Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning about point tasks in a process #1746

Open
opensdh opened this issue Aug 27, 2024 · 9 comments
Open

Learning about point tasks in a process #1746

opensdh opened this issue Aug 27, 2024 · 9 comments

Comments

@opensdh
Copy link

opensdh commented Aug 27, 2024

Using global variables and the usual thread synchronization, it is easy to take process-local notes as point tasks execute. However, the distribution of point tasks over (Unix) processes is in general known only by the relevant mapper instances, and it could be expensive to communicate it to other processes where the point tasks end up running. As such, no process knows when all its assigned point tasks have executed (so as to finalize/transmit the collected notes). (As a particular example, a process might end up with 0 point tasks such that there is no hint as to when the launch took place at all.)

Could we have a means of learning from Legion (which already has to arrange for the right number of calls to take place) either how many such point tasks from a particular (index) launch have or will run in each process or that, again per-launch, no more point tasks will be started in each process? One form that this interface might take would be a callback function (with access at least to the TaskArgument) called once per Legion::Runtime per task launch. It would either provide the number as an argument or (following the CPS idea) be invoked after all relevant point tasks. In the usual (non-debugging) situation of one Runtime per process, this produces one call on each process as desired, independent of sharding patterns.

A toy example might be to collect per-process statistics about a field for lightweight in-situ analysis:

std::atomic<unsigned> max;
void bunny_task(...) {
  unsigned mine=0;
  for(i...) {
    // update various fields, including bunny_field
    mine=std::max(mine,bunny_field[i]);
  }
  max.fetch_max(mine);
}
void callback(Legion::Task *) {
  if(max.exchange(0)>75) std::cerr << getpid() << ": too many bunnies\n";
}
void caller_task() {
  run->execute_index_space(ctx,bunny_task_launcher);
}
void near_main() {
  Runtime::set_top_level_task_id(...);
  Runtime::register_launch_complete_callback(callback);
  Runtime::start(...);
}

Obviously as written this works only if all the launches of bunny_task are serialized (though their point tasks can run in parallel!). Passing the Legion::Task* to the callback allows a more sophisticated (read: realistic) callback to perform separate bookkeeping for separate launches.

A very different CPS-like approach (which would be sufficient for our use cases) might be to equip Legion::Future with a then (and Legion::FutureMap with an all_then) that performs a callback (in the process that registered the callback, or else in all processes that collectively registered one) when the future is (all the futures are) ready without requiring a separate task launch or blocking.

@opensdh
Copy link
Author

opensdh commented Aug 27, 2024

One interesting special case here is a task configured with set_idempotent: there the total number of point tasks is not known a priori, although it would seem that the same interface would suffice.

@lightsighter
Copy link
Contributor

A very different CPS-like approach (which would be sufficient for our use cases) might be to equip Legion::Future with a then (and Legion::FutureMap with an all_then) that performs a callback (in the process that registered the callback, or else in all processes that collectively registered one) when the future is (all the futures are) ready without requiring a separate task launch or blocking.

How is that different than launching a task which only depends on that future? Tasks are already callbacks that you get when dependences are satisfied. If you don't want to send the task all the way through the pipeline you can even make it a local_function_task since it has no region requirements, but that is an optimization and not required.
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h?ref_type=heads#L1619-1627

@opensdh
Copy link
Author

opensdh commented Aug 27, 2024

A local_function_task might be most of what we want here, but I'm not sure:

  1. We would want to be able to have it proceed only when an entire FutureMap was complete (ideally more efficiently than by adding every FutureMap::get_future result to TaskLauncher::futures).
  2. Does "on a local processor where the parent task is executing" mean that the local function task runs on every such processor when the parent task is control-replicated, or just on an unspecified one of them?
  3. What exactly does "pure function with no side effects" mean here? Without any region requirements, of course such a task cannot alter any field values, so is this a statement about making other calls into Legion (e.g., executing further tasks) or what?

@lightsighter
Copy link
Contributor

We would want to be able to have it proceed only when an entire FutureMap was complete (ideally more efficiently than by adding every FutureMap::get_future result to TaskLauncher::futures).

You can't do this at the moment, but it probably wouldn't be too hard to add. Do you really want it to wait on every future in the future map (all point tasks), or are just certain futures enough?

Does "on a local processor where the parent task is executing" mean that the local function task runs on every such processor when the parent task is control-replicated, or just on an unspecified one of them?

It will actually run on exactly the same processor as the parent task (when the parent is preempted and the futures have all completed).

What exactly does "pure function with no side effects" mean here? Without any region requirements, of course such a task cannot alter any field values, so is this a statement about making other calls into Legion (e.g., executing further tasks) or what?

Mostly this means that you don't have any region requirements. You can have side effects, but like calling printf but it's up to you to manage them in that case. I think you can have local function tasks launch further sub-tasks, but since they can't have any region arguments so there is no way to pass privileges.

@opensdh
Copy link
Author

opensdh commented Aug 29, 2024

You can't do this at the moment, but it probably wouldn't be too hard to add. Do you really want it to wait on every future in the future map (all point tasks), or are just certain futures enough?

Well, for the particular use case we have in mind, it would be sufficient to wait on all the futures corresponding to point tasks dispatched by the same Runtime or by the same shard. I said just "all" because it seemed more likely to be generically useful.

It will actually run on exactly the same processor as the parent task (when the parent is preempted and the futures have all completed).

So, to be sure, if the parent task is control-replicated and (as one would expect) all of its shards post the same local function task, that task runs once each on every processor that has been running (one shard of) the parent task?

Mostly this means that you don't have any region requirements. You can have side effects, but like calling printf but it's up to you to manage them in that case. I think you can have local function tasks launch further sub-tasks, but since they can't have any region arguments so there is no way to pass privileges.

I suppose you could create entirely new logical regions?

@lightsighter
Copy link
Contributor

Well, for the particular use case we have in mind, it would be sufficient to wait on all the futures corresponding to point tasks dispatched by the same Runtime or by the same shard. I said just "all" because it seemed more likely to be generically useful.

Right, that is an optimization, but you would probably want to express it as a dependence on the whole future map.

So, to be sure, if the parent task is control-replicated and (as one would expect) all of its shards post the same local function task, that task runs once each on every processor that has been running (one shard of) the parent task?

Yes, we implicitly replicate local function tasks so we run a copy on every shard if the parent is control replicated. This is easy to do since the only inputs and outputs are futures and not regions.

I suppose you could create entirely new logical regions?

You could I suppose, but I think we might enforce that the selected variant be a leaf task variant, so you could make the new logical regions but then not populate it with any data.

@opensdh
Copy link
Author

opensdh commented Aug 30, 2024

So it sounds like TaskLauncher::local_function_task with a new TaskLauncher::future_maps or so would satisfy our requirements, so long as set_idempotent doesn't spoil it. Can we have that?

@lightsighter
Copy link
Contributor

So it sounds like TaskLauncher::local_function_task with a new TaskLauncher::future_maps or so would satisfy our requirements, so long as set_idempotent doesn't spoil it. Can we have that?

What do you mean by "so long as set_idempotent doesn't spoil it?" What are you afraid that setting it idempotent would do?

@opensdh
Copy link
Author

opensdh commented Sep 3, 2024

I can imagine that idempotent task launches, since the runtime has permission to run them extra times, might complicate the accounting: might a Future(Map) returned from an idempotent task launch become ready (in the wait_all_results sense as appropriate) before all point tasks have completed if the value is known from those that have?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants