Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining chunked+non-chunked packed_range APIs? #51

Open
roystgnr opened this issue Oct 9, 2020 · 4 comments
Open

Combining chunked+non-chunked packed_range APIs? #51

roystgnr opened this issue Oct 9, 2020 · 4 comments

Comments

@roystgnr
Copy link
Member

roystgnr commented Oct 9, 2020

An idea of mine from some send/receive packed_range discussion with @loganharbour late last night, slightly edited:

What if we always do chunked sends conceptually, but:

  1. We pack any metadata in with the buffer itself, so that we can use a Status to allocate the receive buffer and we don't have to send a size integer separately. To enable better computation+communication interleaving during nonblocking sends, we can avoid packing a total_buffer_size, and just pack a bool in each chunk that says whether subsequent chunks are expected.

  2. If a probe for the first buffer succeeds, we just let probes for the packed receive return true and we allow "semi" non-blocking receives in the API: when a PostWaitUnpackBuffer handles the first received chunk's buffer, it checks to see if subsequent chunks are expected and it treats those with a receive-and-unpack loop (blocking as a whole, even if we use non-blocking vector receives internally to let receive and unpack steps overlap).

Then:

We get all the benefits of the existing chunked case: User code which wants to interleave communication and computation as much as possible within a single send/receive pair can set a moderate max chunk size, and it'll work the way the chunked case does now, so they can begin processing earlier chunks while later chunks are still in transit.

We improve slightly on the existing chunked case flexibility and performance: at least the first chunk of a receive can be non-blocking, and each subsequent chunk's receive can be started (non-blocking internally even if we're using the blocking user API) before the meat of the prior chunk gets unpacked, in cases where a probe for it succeeds.

We get all the benefits of the non-chunked case: User code can set a 2e9 max chunk size and it'll work the way the non-chunked case does now, except for one extra buffer entry and one extra if test. There still won't be the latency of waiting for a separate total_size message, and we'll still be able to allocate the incoming buffer based on a probed Status.

We improve slightly on the existing non-chunked case robustness: If not everything fits in 2e9 entries, and we set a 2e9 max chunk size and try to do a non-blocking receive, then in some cases we end up stupidly twiddling our thumbs while waiting for that next chunk, which sucks ... but right now what we do in that case is timpi_error_msg("Non-blocking packed range sends cannot exceed " << std::numeric_limits<int>::max() << "in size"); so we're still moving in the right direction.

We improve a lot on the flexibility of performance testing: We can experiment with using smaller max chunk sizes to see if we can get some more communication/computation overlap, and this can even be benchmarked without recompiling because the difference between the chunked and the unchunked API is now just the value of an integer variable, not two different methods with different argument lists.

Publishing because I don't have much time to play with this right now and I don't want to forget about it when that changes, but also because I'd love to know if @friedmud can poke any holes in this idea.

@friedmud
Copy link
Member

friedmud commented Oct 9, 2020

The boolean will work - but I don't like the blocking part. I wouldn't want my non-blocking algorithm to secretly be blocking instead... I would rather get an error.

Can we think of a way to do this completely non-blocking?

@friedmud
Copy link
Member

friedmud commented Oct 9, 2020

BTW: 2e9 objects ought to be enough for anyone.... but seriously, I don't see hitting this limit.

As I was saying to @loganharbour I expect that modern MPI implementations do the "optimal" thing when sending huge messages - they do their own internal chunking that makes sense for their transport type (ethernet, infiniband, etc.). I don't think this is something we need to second guess at this point.

@friedmud
Copy link
Member

friedmud commented Oct 9, 2020

Finally - if I'm sending something that huge (it would probably be more than 2GB!)... I really want that send/receive to be non-blocking! If it were secretly blocking I would be hella pissed!

@roystgnr
Copy link
Member Author

roystgnr commented Oct 9, 2020

they do their own internal chunking

Of course they do. But they then reassemble those chunks before you're allowed to see any of them, which means you can't work on an initial chunk until after the final chunk has arrived and been copied into place.

but seriously, I don't see hitting this limit.

In which case there's no problem with the proposed API here, right? Just pass INT_MAX (which we could leave as the default) if you want one chunk and you don't expect to hit that limit.

Can we think of a way to do this completely non-blocking?

I do wish. But can you at least agree with me that you'd be less pissed about the wait() being potentially partially blocking (the receive() call itself would be non-blocking) than about the current outcome the send crashing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants