You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to run inference with FP8 version of Llama 3.1 405B model (Meta-Llama3.1-405B-Instruct). The model was downloaded with llama download --source huggingface --model-id Meta-Llama3.1-405B-Instruct --hf-token TOKEN. However, the command llama distribution start --name local-llama-405b --port 5000 --disable-ipv6 gave the following error:
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-25_04:55:10
host : node007
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3088723)
error_file: /tmp/torchelastic_kvox4nb5/ee89349c-cc4c-43c4-9796-1ceeb2986a3b_ugr5p160/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/home/ubuntu/miniforge3/envs/local-llama-405b/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/parallel_utils.py", line 131, in worker_process_entrypoint
model = init_model_cb()
File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/model_parallel.py", line 48, in init_model_cb
llama = Llama.build(config)
File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/generation.py", line 100, in build
assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
AssertionError: no checkpoint files found in /home/ubuntu/.llama/checkpoints/Meta-Llama3.1-405B-Instruct/original
Trying to run inference with FP8 version of Llama 3.1 405B model (Meta-Llama3.1-405B-Instruct). The model was downloaded with
llama download --source huggingface --model-id Meta-Llama3.1-405B-Instruct --hf-token TOKEN
. However, the commandllama distribution start --name local-llama-405b --port 5000 --disable-ipv6
gave the following error:Under the original folder,
consolidated.xx
are folders instead of files, I think that's probably why they were not found.The text was updated successfully, but these errors were encountered: