demo of what checkpointing plugins might look like #3535

benclifford · 2024-07-19T17:58:20Z

Description

this is most immediately in the context of issue #3534 - but @WardLT might also be interested

this very rough PR:

moves more checkpoint/memo out of the data flow kernel into the existing memoizer implementation class
makes the DFK use an abstract Memoizer class, with the existing implementation now in BasicMemoizer
adds a test to demo perhaps for isse Method for overriding cached/checkpointed results #3534 showing how checkpoint/memo lookup can look at the args of the function it has been passed to decide whether to ask an underlying BasicMemoizer to look up a result, or to not return a memoized result without even asking the basic memoizer

This PR is intended for experimentation with this kind of API, but a lot of it drives towards a cleaner codebase and so for the most part should find its way into master branch

Type of change

Choose which options apply, and delete the ones which do not apply.

New feature

drewoldag

Overall, this looks great and seems to be exactly what we would want to be able to programmatically avoid using cached/memoized data. I left one small question spurred by a comment in test_memoize_plugin.py, but it's not critical. Thanks a bunch for putting this together.

drewoldag · 2024-07-22T20:35:04Z

parsl/tests/test_python_apps/test_memoize_plugin.py

+    # TODO: this .result() needs to be here, not in the loop
+    # because otherwise we race to complete... and then
+    # we might sometimes get a memoization before the loop
+    # and sometimes not...


Is this generally something that users should keep in mind when working with caching and memoization in Parsl? Namely that there could be race conditions that crop up for functions that are in a loop and return very quickly?

This will happen if you launch two apps with the same parameters at "the same time", where the same time means without waiting for one of them to complete:

invoke my_app(7)

there's no checkpoint/memo so we submit it to be executed, as task 1

invoke my_app(7)

there's no checkpoint/memo so we submit it to be executed, as task 2

task 1 completes and its result is memoized as my_app 7

task 2 completes and its result is memozied as my_app 7, replacing the result from step 3.

Maybe possible to implement to avoid this (again low priority for me) is more like Haskell thunks:

invoke my_app(7)

there's no checkpoint/memo so we submit it to be executed, as task 1. we memoise its unpopulated future as my_app 7.

invoke my_app(7)

there's a memo future for it (with no result yet) - so use that for the result

task 1 completes, populating its result future.
This population of result future causes the 2nd invocation memo future to be populated too, and that completes.

so my_app is only run once.

Oh, ok. I see what you mean. From a selfish perspective, I don't think the extra effort of memoizing a future is necessary for my work.

Thanks a bunch for the explanation.

benclifford · 2024-07-24T10:04:01Z

@drewoldag I'm interested to see any code you write that makes use of the interface - link it here if anything goes online. it doesn't need to look beautiful - i'm more interested in the realistic practical use.

benclifford · 2024-07-30T20:17:24Z

by chance a different checkpoint related question/use case come up and I've added a bit more to this demo to address that. the question was can checkpointing optionally also checkpoint exceptions (so that they will be treated as permanent failures even across runs). almost all of the machinery to do that is already there - including the file format having a space for exceptions (that was unused and should have been tidied away long ago). This PR instead now makes use of that file format and lets a user specify policies for which completed apps should be checkpointed to disk.

PfeifferMicha · 2024-08-01T13:50:12Z

As mentioned by @benclifford, I wanted to cache certain exceptions as "valid" results so that these tasks would not be rerun. I've made some preliminary tests with this PR and it seems to work as expected.

I used this custom Memoizer to cache tasks throwing SampleProcessingException or SampleValidationException:

class ExceptionSafeMemoizer(BasicMemoizer):
    def filter_for_checkpoint(self, app_fu):

        # task record is available from app_fu.task_record
        assert app_fu.task_record is not None

        # Checkpoint either if there's a result, or if the exception is a
        # SampleProcessingException or SampleValidationException:
        exp = app_fu.exception() 
        # If an exception has occurred:
        if exp is not None:
            # Check if the exception type is one of our internal pipeline exceptions. If so, do cache this
            # result, because this is expected behavior and we do not need to re-run this task in the future.
            if isinstance( exp, SampleProcessingException ) or isinstance( exp, SampleValidationException ):
                return True
            else:
                # If an unexpected exception occurred, do _not_ cache anything
                return False
        # If no exception occurred, cache result.
        return True

(Not sure about the line assert app_fu.task_record is not None, I copied that from the example.)

As mentioned on Slack, the only thing that was a bit inconvenient was that exceptions within the filter_for_checkpoint call - even code errors - would be silently ignored (i.e. they were only written into parsl.log, but they didn't communicate back to the main process called by the user or abort the program as I'd expect them to).

benclifford · 2024-08-21T13:16:08Z

a good other example of checkpointing is an out of memory store using sqlite3 - that would make a nice alternate implementation

this test checks that memoization works with all configs test design: that's not necessarily true in future but it should be now. in future perhaps a memoizer option would be "NoMemoizer" which would not do *any* memoization, not even in-memory-only? but I think its ok to not do that for now and i think its ok to require that a memoizer always does actually do memoization at a memory level (so you can't avoid it...)

this test checks that checkpoint dir exists checkpoint_1 checks that files exist within the checkpoint dir

and is more reusable when it isn't this isn't the only way to make a hash though. and hashing isn't the only way to compare checkpoint entries for equality.

goal: results should not (never? in weak small cache?) be stored in an in-memory memo table. so that memo table should be not present in this implementation. instead all memo questions go to the sqlite3 database. this drives some blurring between in-memory caching and disk-based checkpointing: the previous disk based checkpointed model relied on repopulating the in-memory memo table cache... i hit some thread problems when using one sqlite3 connection across threads and the docs are unclear about what I can/cannot do, so i made this open the sqlite3 database on every access. that's probably got quite a performance hit, but its probably enough for basically validating the idea.

benclifford mentioned this pull request Jul 19, 2024

Method for overriding cached/checkpointed results #3534

Open

drewoldag approved these changes Jul 22, 2024

View reviewed changes

benclifford force-pushed the benc-checkpoint-plugins branch from 12b2168 to 1846145 Compare July 30, 2024 20:04

benclifford force-pushed the benc-checkpoint-plugins branch 3 times, most recently from 7d4a026 to 2255dfe Compare August 22, 2024 15:19

benclifford added 13 commits August 22, 2024 15:21

delete duplicate/subtest

84a66b3

this test checks that checkpoint dir exists checkpoint_1 checks that files exist within the checkpoint dir

remove unused checkpoint return value

fa4b7a5

make checkpoint call not use dfk state, in prep for moving into memoizer

b9b2a6f

loadchcekpoints in memoizer

4594fc9

dev

40c78fe

make memoizer into an interface class and impls

8ba7557

configurable memoizer instance

09eb41d

checkpoint exceptions

1745e7e

add a todo on checkpoint policy position

333c7eb

make hash does not need to be part of basic memoizer

e78e12d

and is more reusable when it isn't this isn't the only way to make a hash though. and hashing isn't the only way to compare checkpoint entries for equality.

close method for api

9ff13d7

benclifford force-pushed the benc-checkpoint-plugins branch 4 times, most recently from 23ff9ce to 4726c81 Compare August 22, 2024 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo of what checkpointing plugins might look like #3535

demo of what checkpointing plugins might look like #3535

benclifford commented Jul 19, 2024

drewoldag left a comment

drewoldag Jul 22, 2024

benclifford Jul 23, 2024

drewoldag Jul 23, 2024

benclifford commented Jul 24, 2024

benclifford commented Jul 30, 2024

PfeifferMicha commented Aug 1, 2024 •

edited

Loading

benclifford commented Aug 21, 2024

demo of what checkpointing plugins might look like #3535

Are you sure you want to change the base?

demo of what checkpointing plugins might look like #3535

Conversation

benclifford commented Jul 19, 2024

Description

Type of change

drewoldag left a comment

Choose a reason for hiding this comment

drewoldag Jul 22, 2024

Choose a reason for hiding this comment

benclifford Jul 23, 2024

Choose a reason for hiding this comment

drewoldag Jul 23, 2024

Choose a reason for hiding this comment

benclifford commented Jul 24, 2024

benclifford commented Jul 30, 2024

PfeifferMicha commented Aug 1, 2024 • edited Loading

benclifford commented Aug 21, 2024

PfeifferMicha commented Aug 1, 2024 •

edited

Loading