Measure renormalization effects on perfrormance of reproducible reduction #2119

gevtushenko · 2024-07-31T02:35:20Z

Reduction agent is using even share to assign multiple tiles to thread blocks. This means that a given thread accumulates more than ITEMS_PER_THREAD values. With current parameters, we exceed RFA endurance at input sizes of 2^31 elements. To ensure functional correctness of the algorithm, we have to renormalize accumulator. There are three options at hands:

Renormalize after each ITEMS_PER_THREAD: this change shouldn't complicate the code much. Given the estimate of renormalization to be ~2 deposits, performing one renormalization every 16 items seems like a small overhead. Nevertheless, this approach has a downside of performing more renormalizations than needed.
Introduce deposit counter and renormalize when it reaches endurance: this change complicates the kernel, but is work-optimal. If the first approach leads to performance overhead of more than 3%, we should investigate this option.
Get rid of even share: this deviates us even further from the original device reduction. The idea is to assign at most one tile to each thread block, and statically guarantee that the number of deposits doesn't reach endurance. We should investigate this approach if previous two introduce performance overhead of more than 3%.

The issue can be closed by NVBench results illustrating the first change from the list above that doesn't lead to >3% performance regression compare to the code without renormalization.

The text was updated successfully, but these errors were encountered:

SAtacker · 2024-07-31T16:16:01Z

We see that using approach 1 and 2 both regresses the performance by 23~24%.

On 80GB H100 HBM3:

Always:
1<<20: 145.011 GB/s
1<<24: 951.908 GB/s
1<<28: 2.805 TB/s
Counter:

1<<20:

Before renorm : 145.483 GB/s
After renorm: 144.665 GB/s
1<<24:
Before : 1.240 TB/s
After: 955.478 GB/s
1<<28:
Before : 2.872 TB/s
After: 2.800 TB/s

1<<16 input size remains unaffected.

Conclusion :
Try option 3 which is closest to what Kate does

SAtacker · 2024-08-13T20:35:20Z

Final benchmarks can be viewed here https://gist.github.com/SAtacker/e3410fdc01a42442645b4890c7d609a6
We found out that it wasn't a proper comparison between Kate's many kernel because it never invoked renorm. Above benchmarks are with same input and proper renorm mechanism.
We have proceeded with option 3 for now. For even share performance we investigate in the future and keep the option 3 because it has minimal changes to the code base and is as optimal and simple as it can be.

gevtushenko mentioned this issue Jul 31, 2024

[EPIC]: Reproducible floating-point reductions #1558

Open

1 task

gevtushenko assigned SAtacker Jul 31, 2024

SAtacker closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure renormalization effects on perfrormance of reproducible reduction #2119

Measure renormalization effects on perfrormance of reproducible reduction #2119

gevtushenko commented Jul 31, 2024 •

edited

Loading

SAtacker commented Jul 31, 2024

SAtacker commented Aug 13, 2024

Measure renormalization effects on perfrormance of reproducible reduction #2119

Measure renormalization effects on perfrormance of reproducible reduction #2119

Comments

gevtushenko commented Jul 31, 2024 • edited Loading

SAtacker commented Jul 31, 2024

SAtacker commented Aug 13, 2024

gevtushenko commented Jul 31, 2024 •

edited

Loading