Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu support for Ampere #6

Open
lu-jialin opened this issue Jul 29, 2022 · 3 comments
Open

multi-gpu support for Ampere #6

lu-jialin opened this issue Jul 29, 2022 · 3 comments

Comments

@lu-jialin
Copy link

Hi, I saw that in the multi_gpu branch, GPU architecture was statically specified to Volta(sm_70) :

$ cat benchmark_multi_gpu/Makefile|grep -m1 -C1 -- -gencode
NVCCGENCODE = \
              -gencode arch=compute_70,code=sm_70
            #   -gencode arch=compute_61,code=sm_61

Does multi-gpu support Ampere?

@sleeepyjack
Copy link
Owner

sleeepyjack commented Aug 1, 2022

You can specify additional target architectures, e.g., compute_80 and compute_86 for Ampere. However, the default setting is optimized for a DGX-1 (Volta) system - hence the default compute_70 target. More specifically, the NVLink communication library that we use (gossip) selects a transport plan which is optimized for the NVLInk topology of said system. I would advise you to create such a plan for your target system using the provided generator scripts. This should give you the best performance.

CC @Funatiq for visibility since he is the PIC for gossip

@lu-jialin
Copy link
Author

lu-jialin commented Aug 2, 2022

In fact, If I change the architectures in makefile to compute_80, without any other changes, the benchmark_multi_gpu/single_value_benchmark.out would be trapped into infinitely waiting

  • x86_64 machine with (nvidia GPU A100-40GB * 2)
  • glibc-2.27
  • gcc-7.5.0
  • cuda-11.7

From cuda-gdb where :

$ sudo /usr/local/cuda/bin/cuda-gdb -p `nvidia-smi|tail -n2|head -n1|tr -s ' '|cut -d' ' -f5` -ex 'where'
NVIDIA (R) CUDA Debugger
11.7 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 10.2

... ...

Thread 1 "single_value_be" received signal SIGURG, Urgent I/O condition.
0x00007fdef5feebec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#0  0x00007fdef5feebec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007fdef61ee292 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fdef61eeda9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fdef632dbe2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
--Type  for more, q to quit, c to continue without paging--
#4  0x00007fdef5f9ea93 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fdef5f9ef81 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fdef5f9fef8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fdef61610c1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00005652212f45d9 in __cudart606 ()
#9  0x00005652212c333d in __cudart743 ()
#10 0x00005652213197b5 in cudaMemcpyAsync ()
#11 0x000056522129d7fb in warpcore::SingleValueHashTable, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul>::size (stream=0x0, 
    this=0x565223b18a70) at ../include/single_value_hash_table.cuh:584
#12 warpcore::SingleValueHashTable, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul>::load_factor (stream=, 
    this=0x565223b18a70) at ../include/single_value_hash_table.cuh:603
#13 single_value_benchmark, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul> > (
    multi_split_overhead_factor=1.5, thermal_backoff=..., iters=5 '\005', load_factors=..., print_headers=true, 
    transfer_plan=..., dev_ids=..., input_sizes=...) at src/single_value_benchmark.cu:222
#14 main (argc=1, argv=0x7ffcd4d1e218) at src/single_value_benchmark.cu:325

I will try generator scripts in gossip later. Thanks for reply.

@lu-jialin lu-jialin reopened this Aug 11, 2022
@lu-jialin
Copy link
Author

lu-jialin commented Aug 11, 2022

I found that the infinitely waiting of multi-gpu program is because of gpu memory error.
In include/single_value_hash_table.cuh : 542

cudaMemsetAsync(tmp, 0, sizeof(index_t), stream);

the memory is invalid here. Here program did not check CUDA error.

And if I insert a cudaSetDevice process at benchmark_multi_gpu/src/single_value_benchmark.cu : 222

            for(uint32_t i = 0; i < num_gpus; ++i) {
                actual_load.emplace_back(hash_table[i].load_factor());
                status.emplace_back(hash_table[i].pop_status());
            }

becomes

            for(uint32_t i = 0; i < num_gpus; ++i) {
                cudaSetDevice(dev_ids[i]); CUERR
                actual_load.emplace_back(hash_table[i].load_factor());
                status.emplace_back(hash_table[i].pop_status());
            }

It will work for me. It seems that this is the correct relation between memory and gpu?

Although why it works on only Volta without cudaSetDevice is still unknown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants