-
Notifications
You must be signed in to change notification settings - Fork 0
/
ddp-gpu-logs-pipeline-test-torch-gpu-pipeline-phnvq-flow-d7m2x-torch-ddp-1-1775013970.txt
60 lines (60 loc) · 6.33 KB
/
ddp-gpu-logs-pipeline-test-torch-gpu-pipeline-phnvq-flow-d7m2x-torch-ddp-1-1775013970.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/usr/local/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
Torch distributed launch config settings: {'min_nodes': 3, 'max_nodes': 3, 'node_rank': 1, 'nproc_per_node': 1, 'start_method': 'fork', 'rdzv_backend': 'static', 'rdzv_endpoint_url': 'torch-ddp-0-d6370c62-661b-4ab3-b6ce-4c8659917f04.argo.svc.cluster.local', 'rdzv_endpoint_port': 29200, 'run_id': '1', 'role': '', 'max_restarts': 3, 'monitor_interval': 30.0, 'redirects': '0', 'tee': '3', 'log_dir': None, 'log_line_prefix_template': None}
[2024-07-13 10:36:28,930] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 16) of fn: torch_ddp (start_method: fork)
[2024-07-13 10:36:28,930] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-07-13 10:36:28,930] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-07-13 10:36:28,930] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1)
[2024-07-13 10:36:28,930] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
[2024-07-13 10:36:28,930] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException(
[2024-07-13 10:36:28,930] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
[2024-07-13 10:36:59,149] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 23) of fn: torch_ddp (start_method: fork)
[2024-07-13 10:36:59,149] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-07-13 10:36:59,149] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-07-13 10:36:59,149] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1)
[2024-07-13 10:36:59,149] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
[2024-07-13 10:36:59,149] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException(
[2024-07-13 10:36:59,149] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
[2024-07-13 10:37:29,349] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 30) of fn: torch_ddp (start_method: fork)
[2024-07-13 10:37:29,349] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-07-13 10:37:29,349] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-07-13 10:37:29,349] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1)
[2024-07-13 10:37:29,349] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
[2024-07-13 10:37:29,349] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException(
[2024-07-13 10:37:29,349] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
[2024-07-13 10:37:59,547] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 37) of fn: torch_ddp (start_method: fork)
[2024-07-13 10:37:59,547] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-07-13 10:37:59,547] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-07-13 10:37:59,547] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1)
[2024-07-13 10:37:59,547] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/usr/local/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
[2024-07-13 10:37:59,547] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException(
[2024-07-13 10:37:59,547] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Traceback (most recent call last):
File "/argo/staging/script", line 61, in <module>
torch_distributed_function(n_iter,n_seconds_sleep,duration)
File "/src/bettmensch-ai/sdk/bettmensch_ai/torch_utils.py", line 124, in wrapper
elastic_launch(
File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
torch_ddp FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-13_10:37:59
host : pipeline-test-torch-gpu-pipeline-phnvq-flow-d7m2x-torch-ddp-1-1
rank : 1 (local_rank: 0)
exitcode : -9 (pid: 37)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 37
============================================================
time="2024-07-13T10:38:03.617Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-07-13T10:38:03.618Z" level=error msg="cannot save parameter duration" argo=true error="open duration: no such file or directory"
Error: exit status 1