Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to explicitly manage pystow cache (for downloaded sqlite dbs) #792

Closed
cmungall opened this issue Aug 15, 2024 · 5 comments · Fixed by #799
Closed

Allow user to explicitly manage pystow cache (for downloaded sqlite dbs) #792

cmungall opened this issue Aug 15, 2024 · 5 comments · Fixed by #799

Comments

@cmungall
Copy link
Collaborator

cmungall commented Aug 15, 2024

Copy from Slack:

@turbomam wrote:

I am writing about how great OAK is for a NMDC value set task. But I just realized that this
local/biome-info.txt:
$(RUN) runoak --input sqlite:obo:envo info .desc//p=i ENVO:00000428 > $@
does not retrieve forest biome or any of its children

UPDATE_ My cached ~/.data/oaklib/envo.db was from March 9th 2023! (edited)

Chris Mungall
Today at 11:35 AM
Answered separately but the issue here is that it’s easy for cached versions to become stale. There is some discussion here: cthoyt/pystow#54

I think OAK should more actively manage the cache for you but open to ideas about how this should be done

@cmungall
Copy link
Collaborator Author

From @gouttegd:

My 2 cents, as someone having implemented caching features for a couple of projects (GrainyHead, Pebble):
Default behaviour should be that the cache should automatically be refreshed if the cached data has last been refreshed (or initially cached) more than X days ago (with X configurable by the user, with a default value of probably a few days to a few weeks – choosing the “best” or “least bad” default value can of course be tricky, and needs to take into account the volume of the cached data and their volatility).
It should always be possible to force a refresh of the cache at the demand of the user; that is, refresh the cache even if the last refresh occurred less than X days ago. Such a refresh should be triggered by an explicit option, and not by a workaround such as temporarily setting X to a very low value.
Conversely, it should always be possible to force the use of the cached data even if they have not been refreshed for more than X days. Reasons to do that include, for example, if you are working offline (or online but with a bad connection, or a metered connection with high download costs); or you are in a hurry and you would prefer, for today, to work on old data rather than waiting for the cache to be refreshed. As for the forceful refresh, the forceful “non-refresh” should be requested with an explicit option, rather than by a workaround such as temporarily setting X to a very large value.
👍
1

12:39
In GrainyHead for example, the behaviours above are controlled by a single --caching option, which accepts either:
a numerical value indicating the number of days before cached data are considered stale;
the special value refresh to forcefully refresh the cache;
the special value no-refresh to always use cached data even if they are old (data would still be downloaded if they are not present at all in the cache, of course);
the special value reset to forcefully empty the cache (which of course would also force a refresh).

@gouttegd
Copy link
Contributor

Is there an interest in allowing file-specific cache lifetimes?

That is, allowing users to say: “I want most SQLite DBs to be refreshed once every week, except uberon.db which should be refreshed once every month”?

This seems a bit “overkill” to me, but asking just in case.

@matentzn
Copy link
Contributor

I personally like

  1. A flag to invalidate/force refresh (ala make -B)
  2. A dominant warning message when using an outdated cache

@cmungall
Copy link
Collaborator Author

cmungall commented Aug 19, 2024 via email

@gouttegd
Copy link
Contributor

This seems quite useful

OK, code-wise this is something that can easily be added to the infrastructure I propose in #799 (basically instead of having one global CachePolicy, the file cache can have a default policy and a map associating filename patterns to specific policies).

As for how the user would control that: I am reluctant to exposing such a feature on the command line, and I don’t think it would be practical anyway (obviously the user wouldn’t want to state the file-specific policies on every call).

So, the default policy and the file-specific policies would only be configurable through a config file ($XDG_CONFIG_HOME/oaklib/cache.conf?), which could simply look like this:

# default policy is to refresh after 1 week
default=1w
# CL is more or less on a monthly release schedule, so refreshing after 2w is enough
cl.db=2w
# Ditto for Uberon
uberon.db=2w
# FlyBase ontologies are roughly on a 2-months release cycle
fb*.db=1m

(As seen in the last example, this would be using shell-type “glob” patterns, rather than regular expressions. I do think regexes would be overkill here.)

The --caching option on the command line, if used, would take precedence over any policy set forth in the configuration file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants