You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reduction agent is using even share to assign multiple tiles to thread blocks. This means that a given thread accumulates more than ITEMS_PER_THREAD values. With current parameters, we exceed RFA endurance at input sizes of 2^31 elements. To ensure functional correctness of the algorithm, we have to renormalize accumulator. There are three options at hands:
Renormalize after each ITEMS_PER_THREAD: this change shouldn't complicate the code much. Given the estimate of renormalization to be ~2 deposits, performing one renormalization every 16 items seems like a small overhead. Nevertheless, this approach has a downside of performing more renormalizations than needed.
Introduce deposit counter and renormalize when it reaches endurance: this change complicates the kernel, but is work-optimal. If the first approach leads to performance overhead of more than 3%, we should investigate this option.
Get rid of even share: this deviates us even further from the original device reduction. The idea is to assign at most one tile to each thread block, and statically guarantee that the number of deposits doesn't reach endurance. We should investigate this approach if previous two introduce performance overhead of more than 3%.
The issue can be closed by NVBench results illustrating the first change from the list above that doesn't lead to >3% performance regression compare to the code without renormalization.
The text was updated successfully, but these errors were encountered:
Final benchmarks can be viewed here https://gist.github.com/SAtacker/e3410fdc01a42442645b4890c7d609a6
We found out that it wasn't a proper comparison between Kate's many kernel because it never invoked renorm. Above benchmarks are with same input and proper renorm mechanism.
We have proceeded with option 3 for now. For even share performance we investigate in the future and keep the option 3 because it has minimal changes to the code base and is as optimal and simple as it can be.
Reduction agent is using even share to assign multiple tiles to thread blocks. This means that a given thread accumulates more than
ITEMS_PER_THREAD
values. With current parameters, we exceed RFA endurance at input sizes of 2^31 elements. To ensure functional correctness of the algorithm, we have to renormalize accumulator. There are three options at hands:ITEMS_PER_THREAD
: this change shouldn't complicate the code much. Given the estimate of renormalization to be ~2 deposits, performing one renormalization every 16 items seems like a small overhead. Nevertheless, this approach has a downside of performing more renormalizations than needed.The issue can be closed by NVBench results illustrating the first change from the list above that doesn't lead to >3% performance regression compare to the code without renormalization.
The text was updated successfully, but these errors were encountered: