multi-gpu support for Ampere #6

lu-jialin · 2022-07-29T07:46:42Z

Hi, I saw that in the multi_gpu branch, GPU architecture was statically specified to Volta(sm_70) :

$ cat benchmark_multi_gpu/Makefile|grep -m1 -C1 -- -gencode
NVCCGENCODE = \
              -gencode arch=compute_70,code=sm_70
            #   -gencode arch=compute_61,code=sm_61

Does multi-gpu support Ampere?

The text was updated successfully, but these errors were encountered:

sleeepyjack · 2022-08-01T21:25:45Z

You can specify additional target architectures, e.g., compute_80 and compute_86 for Ampere. However, the default setting is optimized for a DGX-1 (Volta) system - hence the default compute_70 target. More specifically, the NVLink communication library that we use (gossip) selects a transport plan which is optimized for the NVLInk topology of said system. I would advise you to create such a plan for your target system using the provided generator scripts. This should give you the best performance.

CC @Funatiq for visibility since he is the PIC for gossip

lu-jialin · 2022-08-02T06:15:36Z

In fact, If I change the architectures in makefile to compute_80, without any other changes, the benchmark_multi_gpu/single_value_benchmark.out would be trapped into infinitely waiting

x86_64 machine with (nvidia GPU A100-40GB * 2)
glibc-2.27
gcc-7.5.0
cuda-11.7

From cuda-gdb where :

$ sudo /usr/local/cuda/bin/cuda-gdb -p `nvidia-smi|tail -n2|head -n1|tr -s ' '|cut -d' ' -f5` -ex 'where'
NVIDIA (R) CUDA Debugger
11.7 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 10.2

... ...

Thread 1 "single_value_be" received signal SIGURG, Urgent I/O condition.
0x00007fdef5feebec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#0  0x00007fdef5feebec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007fdef61ee292 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fdef61eeda9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fdef632dbe2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
--Type  for more, q to quit, c to continue without paging--
#4  0x00007fdef5f9ea93 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fdef5f9ef81 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fdef5f9fef8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fdef61610c1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00005652212f45d9 in __cudart606 ()
#9  0x00005652212c333d in __cudart743 ()
#10 0x00005652213197b5 in cudaMemcpyAsync ()
#11 0x000056522129d7fb in warpcore::SingleValueHashTable, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul>::size (stream=0x0, 
    this=0x565223b18a70) at ../include/single_value_hash_table.cuh:584
#12 warpcore::SingleValueHashTable, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul>::load_factor (stream=, 
    this=0x565223b18a70) at ../include/single_value_hash_table.cuh:603
#13 single_value_benchmark, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul> > (
    multi_split_overhead_factor=1.5, thermal_backoff=..., iters=5 '\005', load_factors=..., print_headers=true, 
    transfer_plan=..., dev_ids=..., input_sizes=...) at src/single_value_benchmark.cu:222
#14 main (argc=1, argv=0x7ffcd4d1e218) at src/single_value_benchmark.cu:325

I will try generator scripts in gossip later. Thanks for reply.

lu-jialin · 2022-08-11T08:32:57Z

I found that the infinitely waiting of multi-gpu program is because of gpu memory error.
In include/single_value_hash_table.cuh : 542

cudaMemsetAsync(tmp, 0, sizeof(index_t), stream);

the memory is invalid here. Here program did not check CUDA error.

And if I insert a cudaSetDevice process at benchmark_multi_gpu/src/single_value_benchmark.cu : 222

            for(uint32_t i = 0; i < num_gpus; ++i) {
                actual_load.emplace_back(hash_table[i].load_factor());
                status.emplace_back(hash_table[i].pop_status());
            }

becomes

            for(uint32_t i = 0; i < num_gpus; ++i) {
                cudaSetDevice(dev_ids[i]); CUERR
                actual_load.emplace_back(hash_table[i].load_factor());
                status.emplace_back(hash_table[i].pop_status());
            }

It will work for me. It seems that this is the correct relation between memory and gpu?

Although why it works on only Volta without cudaSetDevice is still unknown.

lu-jialin closed this as completed Aug 11, 2022

lu-jialin reopened this Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu support for Ampere #6

multi-gpu support for Ampere #6

lu-jialin commented Jul 29, 2022

sleeepyjack commented Aug 1, 2022 •

edited

Loading

lu-jialin commented Aug 2, 2022 •

edited

Loading

lu-jialin commented Aug 11, 2022 •

edited

Loading

multi-gpu support for Ampere #6

multi-gpu support for Ampere #6

Comments

lu-jialin commented Jul 29, 2022

sleeepyjack commented Aug 1, 2022 • edited Loading

lu-jialin commented Aug 2, 2022 • edited Loading

lu-jialin commented Aug 11, 2022 • edited Loading

sleeepyjack commented Aug 1, 2022 •

edited

Loading

lu-jialin commented Aug 2, 2022 •

edited

Loading

lu-jialin commented Aug 11, 2022 •

edited

Loading