Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache management features #799

Merged
merged 9 commits into from
Aug 22, 2024
Merged

Add cache management features #799

merged 9 commits into from
Aug 22, 2024

Conversation

gouttegd
Copy link
Contributor

@gouttegd gouttegd commented Aug 18, 2024

This PR implements roughly what was proposed in this comment.

From a user’s perspective:

By default, whenever an attempt is made to access a SQLite DB through the sqlite:obo:... descriptor, and the requested DB is already present in the Pystow cache, OAK will check whether the SQLite file is older than 7 days, and if it is, will forcefully re-download that file again.

A new command-line global option --caching is available to alter that default behaviour. That option allows to:

  • Change the duration after which a cached DB is considered “stale” and in need of refreshing; for example, with --caching=3w, a cached DB will be refreshed upon access if is older than 3 weeks (the general syntax is ND, where N is a number and D can be s, d, w, m, or y to indicate that N is a number of seconds, days, weeks, months, or years, respectively).
  • Prevent OAK from refreshing a cached DB regardless of how old it is, with --caching=no-refresh.
  • Force OAK to refresh a cached DB regardless of how old it is, with --caching=refresh.
  • Force OAK to completely clear the cache (which will obviously trigger a refresh), with --caching=clear or --caching=reset.

Implementation details:

Most of the new code is in the new module oaklib.utilities.caching and split in two classes:

  • CachePolicy represents the logic to determine whether a given file is in need of refreshing.
  • FileCache represents the file cache itself and is the main interface that the rest of OAK should interact with whenever they need to access the cache (instead of using Pystow directly). It is merely a thin layer on top of Pystow, that tries¹ to present the same interface (same methods) than a Pystow module, so that it can be used as a drop-in replacement.

(¹ I say “tries to” because it only implements the Pystow methods that were actually used somewhere in OAK code, and only with the parameters used in those case.)

A new FILE_CACHE object (an instance of FileCache) is added to the globals in oaklib.constants (same place containing the PYSTOW_MODULE), where it can be used by any part of OAK that needs to interact with the cache. The llm_implementation and sqldb_implementation modules are amended to use that new global instead of PYSTOW_MODULE.

Finally, in the main entry point module:

  • the --caching option is added;
  • the cache-ls and cache-clear commands are re-written to use methods from the FileCache class.

closes #792

We add a cache management layer on top of Pystow. This takes the form of
two classes (both in `oaklib.utilities.caching`):

* one representing the cache management policy, i.e. the logic dictating
  whether a cached file (if present) should be refreshed or not;
* one representing the file cache itself.

The policy is set once by the main entry point method, using either a
default policy of refreshing cached data after 7 days, or another policy
explicitly selected by the user with the new `--caching` option.

The class that represents the file cache is the one that the rest of OAK
should interact with whenever an access to caching data is needed.
Ultimately, all calls to the Pystow module should be replaced to calls
to FileCache, the use of Pystow becoming an implementation detail
entirely encapsulated in FileCache.
Add new methods to the FileCache class to (1) get the list of files
present in the cache and (2) delete files in the cache.

Replace the implementations of the cache-ls and cache-clear commands to
use the new methods, so that the details of cache listing and clearing
remain encapsulated in FileCache.

As a side-effect, this automatically fixes the issue that cache listing
was only working on Unix-like systems, since the FileCache
implementation is pure Python and does not rely on the ls(1) Unix
command.
The intended difference between the REFRESH and RESET caching policies
is that, when a cache lookup is attempted, REFRESH should cause the file
that was looked up -- and only that file -- to be refreshed, leaving any
other file that may be present in the cache untouched. RESET, on the
other hand, should entirely clear the cache, so that not only the file
that was looked up should be refreshed, but any other file that may
looked up in a subsequent call should be refreshed as well.

This commit implements the intended behaviour for the RESET policy.
In principle, we should never have to compare a timestamp representing a
future date when we check whether a cached file should be refreshed.
However, files with bogus mtime values and/or computers configured with
a bogus system time are certainly not uncommon, so encountering a
timestamp higher than the current time can (and will) definitely happen.

Under an "always refresh" policy, a refresh must be triggered even if
the cached file appears to "newer than now", so we explicitly implement
that behaviour here.

We also add a complete test fixture for the CachePolicy class.
In the SQLite tutorial, in the section that briefly mentions that
automatically downloaded SQLite files are cached in ``.data/oaklib``, we
describe in more details how the cache works and how it can be
controlled using the `--caching` option.
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 81.10599% with 41 lines in your changes missing coverage. Please review.

Project coverage is 74.16%. Comparing base (ed29c27) to head (e1f08a3).

Files Patch % Lines
src/oaklib/utilities/caching.py 76.51% 31 Missing ⚠️
src/oaklib/cli.py 30.76% 9 Missing ⚠️
src/oaklib/implementations/llm_implementation.py 50.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #799      +/-   ##
==========================================
+ Coverage   74.01%   74.16%   +0.14%     
==========================================
  Files         282      284       +2     
  Lines       33533    33742     +209     
==========================================
+ Hits        24821    25026     +205     
- Misses       8712     8716       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/intro/tutorial07.rst Outdated Show resolved Hide resolved
@matentzn
Copy link
Contributor

This is so awesome, thank you!

Add a new section in the CLI reference documentation to explain how the
cache works and how it can be controlled using the `--caching` option.

Replace the previous, shorter documentation in the SQLite tutorial by a
simple mention of the cache with a link to the newly added reference
section.
@caufieldjh
Copy link
Collaborator

Thanks @gouttegd , this is going to help to avoid a lot of confusion in the future.

This commit adds the possibility to configure the file cache to apply
pattern-specific caching policies. This is controlled by a configuration
file ($XDG_CONFIG_HOME/ontology-access-kit/cache.conf, under GNU/Linux)
containing "pattern=policy" pairs, where pattern is a shell-type
globbing pattern and policy is a string of the same type as expected by
the newly introduced --caching option.
@gouttegd
Copy link
Contributor Author

PR updated with the feature discussed in this comment.

Briefly, in addition to the --caching option that can be specified on the command line, there is also the possibility of using a configuration file in which a user can not only set the default cache expiry lifetime (to be used when the --caching option is not used), but also set different cache expiry lifetimes for different cached files, for example:

# Refresh FBbt if older than 1 month
fbbt.db = 1m
# Refresh other FlyBase ontologies if older than 2 months
fb*.db = 2m
# Refresh Uberon if older than 2 weeks
uberon.db = 2w
# For any other file, refresh if older than 1 month
default = 1m

The "user_config_dir" returned by the Appdirs package under macOS is not
in "~/Library/Prefences" but under "~/Library/Application Support"
(Appdirs documentation is not up to date).

Also, there is no need to mention the roaming directory under Windows,
as Appdirs will never use that directory unless we explicitly asks it do
so (which we don't).

There is also no need for a show_default=True parameter with the
--caching option, since that option has _no_ default.
@cmungall cmungall merged commit ecfa132 into INCATools:main Aug 22, 2024
9 checks passed
@gouttegd gouttegd deleted the caching branch August 30, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow user to explicitly manage pystow cache (for downloaded sqlite dbs)
5 participants