Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell-level caching #89

Open
JanPalasek opened this issue Aug 24, 2022 · 4 comments
Open

Cell-level caching #89

JanPalasek opened this issue Aug 24, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@JanPalasek
Copy link

JanPalasek commented Aug 24, 2022

Context

I work with notebooks that are hard to compute.
If I change one code cell at the end of the notebook, I do not expect that its entire cache will be invalidated and the notebook will be needed to be recomputed.
I want only the dependent cells to be changed.

Proposal

Assumption: Notebooks are executed from top to bottom (I come from Quarto). If we work directly with Jupyter notebooks, imho we don't need caching like this. Jupyter does it pretty well on its own. I don't know if this package attempts to cover this case as well.
I'd propose a cell-level cache.
We would remember each cell individually.
If its source code or output of that cell changes, we would only recompute the changed cell and all cells that came after.
This would greatly improve performance when prototyping a notebook because we would only recompute dependent cells.
I assume there is like 100 problems that I don't see. If you see any, please fill me in. It's also possible that this is a problem of Quarto and I missjudgedmisjudged the scope of this project.

Tasks and updates

Later if the proposed solution is viable.

@JanPalasek JanPalasek added the enhancement New feature or request label Aug 24, 2022
@chrisjsewell
Copy link
Member

Heya, Well the key problem (and for jupyter/nbclient#248) is what would you cache?
You can't start execution half way through a notebook unless you cached the entire state of the kernel, e.g. say you have three cells

a = 1
b = 2
c = a + b

You can't run from cell 3, unless you've cached (and reloaded) the variables a and b

I don't know of an easy way to do this robustly?

@JanPalasek
Copy link
Author

JanPalasek commented Aug 25, 2022

Ye, sorry for the (duplicate?) issue. I looked into nbclient and it seemed like it might be something needed to be done in nbclient, though I'm not totally sure. I didn't see a method to skip the cache execution.

Ye that's true. I gave it a thought today and it might be done by serializing the entire kernel by dill. It has a function for that: dill.dump_module (previously dump_session). It supports all serialization of all base objects except for frame, generator, traceback. Works for pandas etc.. However, if some object that was used in the notebook wasn't supported by dill, it is always possible to use the current implementation of caching.

To make the caching efficient, we could make something like a check-point system: a checkpoint could be made after nth cell that would serialize the entire state. Each checkpoint would have a hash made of cell's source codes up to this cell. If any of the source codes changed, the cache could be invalidated.

Further optimization could be made to prevent cache from being so memory hungry, such as:

  • Developer usually modifies e.g. last 10 cells. The previous checkpoints could be much sparser and thus save the memory.
  • We could measure, which cells take the most of the execution time (threshold or statistics based on the previous executions) and we could make the cache right after that cell and drop some of the others.
  • Drop some of the caches with time with strategy like LFU? (Least Frequently Used) It would locate the checkpoints that weren't used much and delete them.
  • ... ?

The main things that need to be imo tested:

  • Dill and some libraries people tend to use in their books / reports. Pandas (already tested), numpy, scipy, tensorflow, matplotlib, plotly etc.
  • Dill's performance for big objects, such as large pandas dataframe or large numpy (large numpy should be ok based on this SO post.

I'm very interested in your opinion about these suggestions. I could also potentially help with some of the tasks.

@JanPalasek
Copy link
Author

@chrisjsewell Will you accept PR if someone manages to come up with a good solution? (probably taking some inspiration from knitr)

@chrisjsewell
Copy link
Member

Heya yeh definitely interested thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants