Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure renormalization effects on perfrormance of reproducible reduction #2119

Closed
Tracked by #1558
gevtushenko opened this issue Jul 31, 2024 · 2 comments
Closed
Tracked by #1558
Assignees

Comments

@gevtushenko
Copy link
Collaborator

gevtushenko commented Jul 31, 2024

Reduction agent is using even share to assign multiple tiles to thread blocks. This means that a given thread accumulates more than ITEMS_PER_THREAD values. With current parameters, we exceed RFA endurance at input sizes of 2^31 elements. To ensure functional correctness of the algorithm, we have to renormalize accumulator. There are three options at hands:

  1. Renormalize after each ITEMS_PER_THREAD: this change shouldn't complicate the code much. Given the estimate of renormalization to be ~2 deposits, performing one renormalization every 16 items seems like a small overhead. Nevertheless, this approach has a downside of performing more renormalizations than needed.
  2. Introduce deposit counter and renormalize when it reaches endurance: this change complicates the kernel, but is work-optimal. If the first approach leads to performance overhead of more than 3%, we should investigate this option.
  3. Get rid of even share: this deviates us even further from the original device reduction. The idea is to assign at most one tile to each thread block, and statically guarantee that the number of deposits doesn't reach endurance. We should investigate this approach if previous two introduce performance overhead of more than 3%.

The issue can be closed by NVBench results illustrating the first change from the list above that doesn't lead to >3% performance regression compare to the code without renormalization.

@SAtacker
Copy link

We see that using approach 1 and 2 both regresses the performance by 23~24%.

On 80GB H100 HBM3:

  1. Always:
    1<<20: 145.011 GB/s
    1<<24: 951.908 GB/s
    1<<28: 2.805 TB/s

  2. Counter:

1<<20:

  • Before renorm : 145.483 GB/s
  • After renorm: 144.665 GB/s
    1<<24:
  • Before : 1.240 TB/s
  • After: 955.478 GB/s
    1<<28:
  • Before : 2.872 TB/s
  • After: 2.800 TB/s

1<<16 input size remains unaffected.

Conclusion :
Try option 3 which is closest to what Kate does

@SAtacker
Copy link

Final benchmarks can be viewed here https://gist.github.com/SAtacker/e3410fdc01a42442645b4890c7d609a6
We found out that it wasn't a proper comparison between Kate's many kernel because it never invoked renorm. Above benchmarks are with same input and proper renorm mechanism.
We have proceeded with option 3 for now. For even share performance we investigate in the future and keep the option 3 because it has minimal changes to the code base and is as optimal and simple as it can be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants