[adag] Avoid deserialization during CompiledDAGRef's deallocation #47614

jeffreyjeffreywang · 2024-09-11T20:25:42Z

What happened + What you expected to happen

Although we don't call ray.get, ray.get is called and deserialization still happens when the dag ref is deallocated because of the following code.

ray/python/ray/experimental/compiled_dag_ref.py

Line 80 in 4aea49f

if not self._ray_get_called:

Versions / Dependencies

ray master

Reproduction script

a = Actor.remote(0)
with InputNode() as inp:
    dag = a.inc.bind(inp)

compiled_dag = dag.experimental_compile()
ref = compiled_dag.execute(1)
compiled_dag.teardown()

# ref.get() is called upon ref's deallocation -- deserialization still happens

Issue Severity

None

The text was updated successfully, but these errors were encountered:

rkooo567 · 2024-09-11T23:29:59Z

@jeffreyjeffreywang would you be interested in taking this?

jeffreyjeffreywang · 2024-09-12T08:15:05Z

@rkooo567 Yup, definitely!

Hey @stephanie-wang @ruisearch42, I just wanted to clarify a few things before I proceed with this issue. You've suggested to release the value of a CompiledDAGRef if ray.get() isn't called by the user (#45951 (comment)). Could you please help me understand how the following code avoids execution result leak?

ray/python/ray/experimental/compiled_dag_ref.py

Lines 87 to 89 in 44dd9a7

    
           # If not yet, get the result and discard to avoid execution result leak. 
        
           if not self._ray_get_called: 
        
               self.get()

With this code, another issue arises -- attempting to load a python library during program exit (when CompiledDAGRef is destructed) will fail. Please refer to #47305 (comment) for more context.

foo = Foo.remote()
bar = Bar.remote()

with InputNode() as inp:
    dag = MultiOutputNode([foo.increment.bind(inp), bar.decrement.bind(inp)])

dag = dag.experimental_compile()

ref1 = dag.execute(1)
ref2 = dag.execute(1)

assert ref1.get() == [1, -1]

dag.teardown()
# Upon destruction, the DAG will be executed until the latest index.
# However, we attempt to import DAGContext during program exit.

I'm thinking about removing the custom destructor entirely but wanted to understand the implications before doing so.

rkooo567 · 2024-09-12T15:02:21Z

With this code, another issue arises -- attempting to load a python library during program exit (when CompiledDAGRef is destructed) will fail. Please refer to #47305 (comment) for more context.

I think this is because python cannot guarantee all modules exist when __del__ is called. So I agreed doing clean up this way in a destructor is a bad idea, and it is something we should clean up.

ruisearch42 · 2024-09-12T16:03:17Z

Could you please help me understand how the following code avoids execution result leak?

With this code, the python object is retrieved and then immediately goes out of scope. If there are any native buffers underneath, they will also be released.

jeffreyjeffreywang · 2024-09-12T17:02:27Z

With this code, the python object is retrieved and then immediately goes out of scope. If there are any native buffers underneath, they will also be released.

Thank you @ruisearch42, could you give me an example/repro when native buffers are used and therefore this destruction is necessary? I'd like to measure whether the deserialization is necessary.

If deserialization is necessary, we still need to solve the module import issue. We might want to move the deserialization (stepping through the remaining steps in the DAG) to teardown().

jeffreyjeffreywang · 2024-09-12T22:14:26Z

This is a duplicate of #46909. Will close both bugs once this issue is addressed.

stephanie-wang · 2024-09-12T22:25:29Z

Thanks, @jeffreyjeffreywang for the great questions! The deserialization is necessary because the native buffer is reused for future data. If the reader does not explicitly read and release the buffer, then the buffer cannot be reused for future values. You can reproduce it by returning a numpy array as the DAG output; since numpy arrays are zero-copy, the buffer will be held until the np array in python goes out of scope.

Note that you do not need to deserialize the data in order to release the buffer. We just need to make sure to call the ReadAcquire and ReadRelease methods on the buffer (but skip the python-based deserialization).

We do a similar custom destructor for when ObjectRefs and actors go out of scope, so I think you can reuse a similar codepath to avoid the destruction ordering problem, see here.

jeffreyjeffreywang · 2024-09-13T16:23:12Z

Thank you, Stephanie, for the thorough explanation. I'll dig a bit deeper and publish a PR.

rkooo567 · 2024-09-14T00:13:57Z

Thanks @jeffreyjeffreywang ! Btw, are you in OSS ray slack? We have regular sync up, and you are more than welcome to join!

jeffreyjeffreywang · 2024-09-14T05:24:02Z

@rkooo567 Yeah, I just joined couple days ago. Thanks for inviting, I'll keep an eye on the sync up next time and hopefully I'll be able to join! 😄

rkooo567 · 2024-09-14T05:36:22Z

@anyscalesam can you make sure @jeffreyjeffreywang is invited to next sync?! Thank you!

jeffreyjeffreywang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2024

jeffreyjeffreywang mentioned this issue Sep 11, 2024

[core][adag] Separate the outputs of execute and execute_async to multiple refs or futures to allow clients to retrieve them one at a time (#46908) #47305

Merged

8 tasks

ruisearch42 added the accelerated-dag label Sep 13, 2024

anyscalesam added the core Issues that should be addressed in Ray Core label Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[adag] Avoid deserialization during CompiledDAGRef's deallocation #47614

[adag] Avoid deserialization during CompiledDAGRef's deallocation #47614

jeffreyjeffreywang commented Sep 11, 2024

rkooo567 commented Sep 11, 2024

jeffreyjeffreywang commented Sep 12, 2024 •

edited

Loading

rkooo567 commented Sep 12, 2024

ruisearch42 commented Sep 12, 2024

jeffreyjeffreywang commented Sep 12, 2024

jeffreyjeffreywang commented Sep 12, 2024

stephanie-wang commented Sep 12, 2024

jeffreyjeffreywang commented Sep 13, 2024

rkooo567 commented Sep 14, 2024

jeffreyjeffreywang commented Sep 14, 2024

rkooo567 commented Sep 14, 2024

[adag] Avoid deserialization during CompiledDAGRef's deallocation #47614

[adag] Avoid deserialization during CompiledDAGRef's deallocation #47614

Comments

jeffreyjeffreywang commented Sep 11, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

rkooo567 commented Sep 11, 2024

jeffreyjeffreywang commented Sep 12, 2024 • edited Loading

rkooo567 commented Sep 12, 2024

ruisearch42 commented Sep 12, 2024

jeffreyjeffreywang commented Sep 12, 2024

jeffreyjeffreywang commented Sep 12, 2024

stephanie-wang commented Sep 12, 2024

jeffreyjeffreywang commented Sep 13, 2024

rkooo567 commented Sep 14, 2024

jeffreyjeffreywang commented Sep 14, 2024

rkooo567 commented Sep 14, 2024

jeffreyjeffreywang commented Sep 12, 2024 •

edited

Loading