Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchRec DLRM Failed to initialize NumPy: _ARRAY_API not found #760

Open
rvernica opened this issue Aug 5, 2024 · 0 comments
Open

TorchRec DLRM Failed to initialize NumPy: _ARRAY_API not found #760

rvernica opened this issue Aug 5, 2024 · 0 comments

Comments

@rvernica
Copy link

rvernica commented Aug 5, 2024

Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.

> cp Dockerfile Dockerfile.torchx
> torchx run -s local_docker dist.ddp -j 1x2 --script dlrm_main.py
torchx 2024-08-05 13:26:15 INFO     Tracker configurations: {}
torchx 2024-08-05 13:26:15 INFO     Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`...
torchx 2024-08-05 13:26:15 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-08-05 13:26:15 INFO     Workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` resolved to filesystem path `/proj/java-gpu/training/recommendation_v2/torchrec_dlrm`
torchx 2024-08-05 13:26:16 INFO     Building workspace docker image (this may take a while)...
torchx 2024-08-05 13:26:16 INFO     Step 1/7 : ARG FROM_IMAGE_NAME=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
torchx 2024-08-05 13:26:16 INFO     Step 2/7 : FROM ${FROM_IMAGE_NAME}
torchx 2024-08-05 13:26:16 INFO      ---> 71eb2d092138
torchx 2024-08-05 13:26:16 INFO     Step 3/7 : RUN apt-get -y update &&     apt-get -y install git
torchx 2024-08-05 13:26:16 INFO      ---> Using cache
torchx 2024-08-05 13:26:16 INFO      ---> 45eded198de2
torchx 2024-08-05 13:26:16 INFO     Step 4/7 : WORKDIR /workspace/torchrec_dlrm
torchx 2024-08-05 13:26:16 INFO      ---> Using cache
torchx 2024-08-05 13:26:16 INFO      ---> 1b41a30dcd79
torchx 2024-08-05 13:26:16 INFO     Step 5/7 : COPY . .
torchx 2024-08-05 13:26:16 INFO      ---> ae30b5f5e5a1
torchx 2024-08-05 13:26:16 INFO     Step 6/7 : RUN pip install --no-cache-dir -r requirements.txt
torchx 2024-08-05 13:26:16 INFO      ---> Running in 3ef0c644fc38
...
torchx 2024-08-05 13:27:02 INFO      ---> Removed intermediate container 3ef0c644fc38
torchx 2024-08-05 13:27:02 INFO      ---> addfe3ce01cb
torchx 2024-08-05 13:27:02 INFO     Step 7/7 : LABEL torchx.pytorch.org/version=0.7.0
torchx 2024-08-05 13:27:02 INFO      ---> Running in 4e254643ce54
torchx 2024-08-05 13:27:02 INFO      ---> Removed intermediate container 4e254643ce54
torchx 2024-08-05 13:27:02 INFO      ---> 861ee2a4e5d3
torchx 2024-08-05 13:27:02 INFO     [Warning] One or more build-args [IMAGE WORKSPACE] were not consumed
torchx 2024-08-05 13:27:02 INFO     Successfully built 861ee2a4e5d3
torchx 2024-08-05 13:27:02 INFO     Built new image `sha256:861ee2a4e5d33dca93d9fe8847feccd4028d2e27c8f281654307aeec203452bd` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` for role[0]=dlrm_main.
local_docker://torchx/dlrm_main-sbz7tbpcb2sqvd
torchx 2024-08-05 13:27:03 INFO     Waiting for the app to finish...
dlrm_main/0 WARNING:torch.distributed.run:
dlrm_main/0 *****************************************
dlrm_main/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
dlrm_main/0 *****************************************
dlrm_main/0 [0]:
dlrm_main/0 [0]:A module that was compiled using NumPy 1.x cannot be run in
dlrm_main/0 [0]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
dlrm_main/0 [0]:versions of NumPy, modules must be compiled with NumPy 2.0.
dlrm_main/0 [0]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
dlrm_main/0 [0]:
dlrm_main/0 [0]:If you are a user of the module, the easiest solution will be to
dlrm_main/0 [0]:downgrade to 'numpy<2' or try to upgrade the affected module.
dlrm_main/0 [0]:We expect that some modules will need time to support NumPy 2.
dlrm_main/0 [0]:
dlrm_main/0 [0]:Traceback (most recent call last):  File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module>
dlrm_main/0 [0]:    import torchmetrics as metrics
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module>
dlrm_main/0 [0]:    from torchmetrics import functional  # noqa: E402
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
dlrm_main/0 [0]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
dlrm_main/0 [0]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate  # noqa: F401
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.checks import check_forward_full_state_property  # noqa: F401
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.data import select_topk, to_onehot
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module>
dlrm_main/0 [0]:    _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
dlrm_main/0 [0]:    if not _module_available(package):
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
dlrm_main/0 [0]:    module = import_module(module_names[0])
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
dlrm_main/0 [0]:    return _bootstrap._gcd_import(name[level:], package, level)
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module>
dlrm_main/0 [0]:    from torchvision import datasets, io, models, ops, transforms, utils
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module>
dlrm_main/0 [0]:    from . import detection, optical_flow, quantization, segmentation, video
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module>
dlrm_main/0 [0]:    from .faster_rcnn import *
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module>
dlrm_main/0 [0]:    from .anchor_utils import AnchorGenerator
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module>
dlrm_main/0 [0]:    class AnchorGenerator(nn.Module):
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator
dlrm_main/0 [0]:    device: torch.device = torch.device("cpu"),
dlrm_main/0 [0]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.)
dlrm_main/0 [0]:  device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:
dlrm_main/0 [1]:A module that was compiled using NumPy 1.x cannot be run in
dlrm_main/0 [1]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
dlrm_main/0 [1]:versions of NumPy, modules must be compiled with NumPy 2.0.
dlrm_main/0 [1]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
dlrm_main/0 [1]:
dlrm_main/0 [1]:If you are a user of the module, the easiest solution will be to
dlrm_main/0 [1]:downgrade to 'numpy<2' or try to upgrade the affected module.
dlrm_main/0 [1]:We expect that some modules will need time to support NumPy 2.
dlrm_main/0 [1]:
dlrm_main/0 [1]:Traceback (most recent call last):  File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module>
dlrm_main/0 [1]:    import torchmetrics as metrics
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module>
dlrm_main/0 [1]:    from torchmetrics import functional  # noqa: E402
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
dlrm_main/0 [1]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
dlrm_main/0 [1]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate  # noqa: F401
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.checks import check_forward_full_state_property  # noqa: F401
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.data import select_topk, to_onehot
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module>
dlrm_main/0 [1]:    _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
dlrm_main/0 [1]:    if not _module_available(package):
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
dlrm_main/0 [1]:    module = import_module(module_names[0])
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
dlrm_main/0 [1]:    return _bootstrap._gcd_import(name[level:], package, level)
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module>
dlrm_main/0 [1]:    from torchvision import datasets, io, models, ops, transforms, utils
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module>
dlrm_main/0 [1]:    from . import detection, optical_flow, quantization, segmentation, video
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module>
dlrm_main/0 [1]:    from .faster_rcnn import *
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module>
dlrm_main/0 [1]:    from .anchor_utils import AnchorGenerator
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module>
dlrm_main/0 [1]:    class AnchorGenerator(nn.Module):
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator
dlrm_main/0 [1]:    device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.)
dlrm_main/0 [1]:  device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:Traceback (most recent call last):
dlrm_main/0 [1]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module>
dlrm_main/0 [1]:    main(sys.argv[1:])
dlrm_main/0 [1]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 813, in main
dlrm_main/0 [1]:    plan = planner.collective_plan(
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/planner/planners.py", line 177, in collective_plan
dlrm_main/0 [1]:    return invoke_on_rank_and_broadcast_result(
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/collective_utils.py", line 58, in invoke_on_rank_and_broadcast_result
dlrm_main/0 [1]:    dist.broadcast_object_list(object_list, rank, group=pg)
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2106, in broadcast_object_list
dlrm_main/0 [1]:    object_list[i] = _tensor_to_object(obj_view, obj_size)
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in _tensor_to_object
dlrm_main/0 [1]:    buf = tensor.numpy().tobytes()[:tensor_size]
dlrm_main/0 [1]:RuntimeError: Numpy is not available
dlrm_main/0 [0]:Traceback (most recent call last):
dlrm_main/0 [0]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module>
dlrm_main/0 [0]:    main(sys.argv[1:])
dlrm_main/0 [0]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 817, in main
dlrm_main/0 [0]:    model = DistributedModelParallel(
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 232, in __init__
dlrm_main/0 [0]:    self.init_data_parallel()
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 266, in init_data_parallel
dlrm_main/0 [0]:    self._data_parallel_wrapper.wrap(self, self._env, self.device)
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 97, in wrap
dlrm_main/0 [0]:    DistributedDataParallel(
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
dlrm_main/0 [0]:    _verify_param_shape_across_processes(self.process_group, parameters)
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
dlrm_main/0 [0]:    return dist._verify_params_across_processes(process_group, tensors, logger)
dlrm_main/0 [0]:RuntimeError: [/opt/conda/conda-bld/pytorch_1670525552843/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.20.0.2]:54499
dlrm_main/0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25) of binary: /opt/conda/bin/python
dlrm_main/0 [0]:libcuda.so.1: cannot open shared object file: No such file or directory
dlrm_main/0 [1]:libcuda.so.1: cannot open shared object file: No such file or directory
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}}
dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}}
dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}}
dlrm_main/0 [0]:{'adagrad': False,
dlrm_main/0 [0]: 'allow_tf32': False,
dlrm_main/0 [0]: 'batch_size': 32,
dlrm_main/0 [0]: 'collect_multi_hot_freqs_stats': False,
dlrm_main/0 [0]: 'dataset_name': 'criteo_1t',
dlrm_main/0 [0]: 'dcn_low_rank_dim': 512,
dlrm_main/0 Traceback (most recent call last):
dlrm_main/0   File "/opt/conda/bin/torchrun", line 33, in <module>
dlrm_main/0     sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
dlrm_main/0 [0]: 'dcn_num_layers': 3,
dlrm_main/0 [0]: 'dense_arch_layer_sizes': [512, 256, 64],
dlrm_main/0 [0]: 'drop_last_training_batch': False,
dlrm_main/0 [0]: 'embedding_dim': 64,
dlrm_main/0 [0]: 'epochs': 1,
dlrm_main/0 [0]: 'evaluate_on_epoch_end': False,
dlrm_main/0 [0]: 'evaluate_on_training_end': False,
dlrm_main/0 [0]: 'in_memory_binary_criteo_path': None,
dlrm_main/0 [0]: 'interaction_branch1_layer_sizes': [2048, 2048],
dlrm_main/0 [0]: 'interaction_branch2_layer_sizes': [2048, 2048],
dlrm_main/0 [0]: 'interaction_type': <InteractionType.ORIGINAL: 'original'>,
dlrm_main/0 [0]: 'learning_rate': 15.0,
dlrm_main/0 [0]: 'limit_test_batches': None,
dlrm_main/0 [0]: 'limit_train_batches': None,
dlrm_main/0 [0]: 'limit_val_batches': None,
dlrm_main/0 [0]: 'lr_decay_start': 0,
dlrm_main/0 [0]: 'lr_decay_steps': 0,
dlrm_main/0 [0]: 'lr_warmup_steps': 0,
dlrm_main/0 [0]: 'mmap_mode': False,
dlrm_main/0 [0]: 'multi_hot_distribution_type': None,
dlrm_main/0 [0]: 'multi_hot_sizes': None,
dlrm_main/0 [0]: 'num_embeddings': 100000,
dlrm_main/0 [0]: 'num_embeddings_per_feature': None,
dlrm_main/0 [0]: 'over_arch_layer_sizes': [512, 512, 256, 1],
dlrm_main/0 [0]: 'pin_memory': False,
dlrm_main/0 [0]: 'print_lr': False,
dlrm_main/0 [0]: 'print_progress': False,
dlrm_main/0 [0]: 'print_sharding_plan': False,
dlrm_main/0 [0]: 'seed': None,
dlrm_main/0 [0]: 'shuffle_batches': False,
dlrm_main/0 [0]: 'shuffle_training_set': False,
dlrm_main/0 [0]: 'synthetic_multi_hot_criteo_path': None,
dlrm_main/0 [0]: 'test_batch_size': None,
dlrm_main/0 [0]: 'validation_auroc': None,
dlrm_main/0 [0]: 'validation_freq_within_epoch': None}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "dlrm_dcnv2", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 7}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 11}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 15}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 19}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 23}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 64, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 705}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 709}}
dlrm_main/0     return f(*args, **kwargs)
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "seed", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 713}}
dlrm_main/0     run(args)
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
dlrm_main/0     elastic_launch(
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
dlrm_main/0     return launch_agent(self._config, self._entrypoint, list(args))
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
dlrm_main/0     raise ChildFailedError(
dlrm_main/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
dlrm_main/0 ============================================================
dlrm_main/0 dlrm_main.py FAILED
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Failures:
dlrm_main/0 [1]:
dlrm_main/0   time      : 2024-08-05_20:27:09
dlrm_main/0   host      : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
dlrm_main/0   rank      : 1 (local_rank: 1)
dlrm_main/0   exitcode  : 1 (pid: 26)
dlrm_main/0   error_file: <N/A>
dlrm_main/0   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Root Cause (first observed failure):
dlrm_main/0 [0]:
dlrm_main/0   time      : 2024-08-05_20:27:09
dlrm_main/0   host      : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
dlrm_main/0   rank      : 0 (local_rank: 0)
dlrm_main/0   exitcode  : 1 (pid: 25)
dlrm_main/0   error_file: <N/A>
dlrm_main/0   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dlrm_main/0 ============================================================
torchx 2024-08-05 13:27:10 INFO     Job finished: FAILED
torchx 2024-08-05 13:27:10 ERROR    AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles:
  - replicas:
    - hostname: dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
      id: 0
      role: dlrm_main
      state: !!python/object/apply:torchx.specs.api.AppState
      - 5
      structured_error_msg: <NONE>
    role: dlrm_main
  state: FAILED (5)
  structured_error_msg: <NONE>
  ui_url: null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant