Different perplexity when fine-tuning with parallel_model vs 1 gpu #86

randy-ac · 2024-08-20T13:19:08Z

Hello,

we have noticed some unexpected behaviors when fine-tuning a llama 3 model on 1 gpu and when fine-tuning the same model on the same data set with 2 gpus in parallel mode. See the attached tensorboard graphs (red=run with parallel mode). The minimal validation ppl is different between the two runs.

As you can see from the configs I am pasting below, the only parameters that differ between the runs are: world_size, gpu_rank and parallel_mode.

Could you please advise?

Configs for run with 1 GPU

# General settings
seed: 1234
share_vocab: true
save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue"
src_vocab: "${EOLE_MODEL_DIR}/llama3-8b-instruct/vocab.txt" # size
src_vocab_size: 128256
tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true
tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue/logs/


transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs:
    insert_mask_before_placeholder:
        response_patterns: ["<|start_header_id|>assistant<|end_header_id|>｟newline｠｟newline｠"]
    onmt_tokenize:
        src_subword_type: bpe
        src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        tgt_subword_type: bpe
        tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        gpt2_pretok: true
    filtertoolong:
        src_seq_length: 2048
        tgt_seq_length: 2048


data:
    new_synth_dataset:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_train.shuffle"
        weight: 1 
    valid:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_dev.shuffle"


skip_empty_level: silent # silently ignore empty lines in the data


training:
    world_size: 1
    gpu_ranks: [0]

    zero_out_prompt_loss: true

    train_steps: 1000
    valid_steps: 50

    dropout_steps: [0]
    dropout: [0.0]
    attention_dropout: [0.0]

    bucket_size: 10
    num_workers: 1
    batch_type: "sents"
    batch_size: 1
    valid_batch_size: 1
    batch_size_multiple: 1


    compute_dtype: fp16
    apex_opt_level: ""
    optim: "fusedadam"
    learning_rate: 2e-05
    warmup_steps: 100
    decay_method: "none"
    adam_beta2: 0.998
    accum_count: [8, 16]
    accum_steps: [0, 100]
    max_grad_norm: 0
    label_smoothing: 0.0
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

    train_from: "${EOLE_MODEL_DIR}/llama3-8b-instruct"
    model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue"
    keep_checkpoint: 30
    save_checkpoint_steps: 200
    
    quant_layers: ['gate_up_proj', 'down_proj', 'up_proj'] 
    quant_type: "bnb_NF4"

    lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
    lora_rank: 8 #5 #2
    lora_dropout: 0.05
    lora_alpha: 32
    lora_embedding: false

Configs for run with parallel_mode

seed: 1234
share_vocab: true
save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue"
src_vocab: "${EOLE_MODEL_DIR}/llama3-8b-instruct/vocab.txt" # size
src_vocab_size: 128256
tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true
tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue/logs/

transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs:
    insert_mask_before_placeholder:
        response_patterns: ["<|start_header_id|>assistant<|end_header_id|>｟newline｠｟newline｠"]
    onmt_tokenize:
        src_subword_type: bpe
        src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        tgt_subword_type: bpe
        tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        gpt2_pretok: true
    filtertoolong:
        src_seq_length: 2048
        tgt_seq_length: 2048

data:
    new_synth_dataset:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_train.shuffle"
        weight: 1 
    valid:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_dev.shuffle"


skip_empty_level: silent # silently ignore empty lines in the data


training:
    world_size: 2
    gpu_ranks: [0, 1]
    parallel_mode: "tensor_parallel"
    zero_out_prompt_loss: true

    train_steps: 1000
    valid_steps: 50

    dropout_steps: [0]
    dropout: [0.0]
    attention_dropout: [0.0]
    bucket_size: 10
    num_workers: 1
    batch_type: "sents"
    batch_size: 1
    valid_batch_size: 1
    batch_size_multiple: 1

    compute_dtype: fp16
    apex_opt_level: ""
    optim: "fusedadam"
    learning_rate: 2e-05
    warmup_steps: 100
    decay_method: "none"

    adam_beta2: 0.998
    accum_count: [8, 16]
    accum_steps: [0, 100]
    max_grad_norm: 0
    label_smoothing: 0.0
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

    train_from: "${EOLE_MODEL_DIR}/llama3-8b-instruct"
    model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue"
    keep_checkpoint: 30

    save_checkpoint_steps: 200
    
    quant_layers: ['gate_up_proj', 'down_proj', 'up_proj'] 
    quant_type: "bnb_NF4"

    lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
    lora_rank: 8 #5 #2
    lora_dropout: 0.05
    lora_alpha: 32
    lora_embedding: false

francoishernandez · 2024-08-23T08:49:34Z

Hey Randy,
I think some variation is expected as there might be some slight differences in some numerical operations, which can build up along the way.
It might be interesting to investigate at various steps at the beginning of the training to see which operations are most impactful. Also, could be interesting to check if significant differences in output values occur at inference as well.

vince62s · 2024-08-26T13:13:23Z

probably unrelated with your issue, nut bear in mind that llama 3 uses Rope scaling which is not implemented in Eole yet.

randy-ac · 2024-08-26T13:35:18Z

Hello both,

thanks for your replies. I will check your suggestions as soon as possible and will keep you posted.

l-k-11235 · 2024-09-02T13:15:32Z

Hello,
Thanks for your answers.
@randy-ac will run and compare the single and parallel gpu mode on a longer setting, without quantization or dropout to avoid “spurious” differences.

randy-ac · 2024-09-05T14:26:29Z

Hello everyone,
I'm finally back to this topic with the results of some experiments we carried out.
We trained two models on 1000 steps, one on 2 GPUs with tensor_parallel mode, and another one on a single gpu. The task is domain classification. The two models had exactly the same configs. We removed quantization and dropout to avoid introducing other variables in the experiment. Please see the configs attached.
We've still found that the two models diverge in validation accuracy, output values for the same checkpoint, LM decoder forward, checkpoint sizes. In general, the model trained with tensor_parallel seems to achieve a worse performance.

Tensorboard logs

(the red one is the one with tensor_parallel)

Output values
We tested checkpoint 400 of both models. There is a ~18% difference between the two accuracy values (i.e. if the predicted label = gold label).
Please find attached a tsv with the outputs of each model.

Different values in decoder forward
For each layer, we printed out the norm of the layer input and the attention output. It seems that some differences start to build up in layer 3 in the first step (i.e. before the first model backward):

1 GPU
Layer nr 3

Layer_in norm: 389.75

norm_layer_in Euclidean Distance to zero: 691.5
attn_output Euclidean Distance to zero: 18.78125

Layer nr 4

Layer_in norm: 391.25

norm_layer_in Euclidean Distance to zero: 620.0

Tensor parallel
Layer nr 3

Layer_in norm: 389.75

norm_layer_in Euclidean Distance to zero: 691.5
attn_output Euclidean Distance to zero: 18.796875

Layer nr 4

Layer_in norm: 391.5

norm_layer_in Euclidean Distance to zero: 620.0
attn_output Euclidean Distance to zero: 24.359375

Checkpoint size
The sizes (KB) of the 400th checkpoint for the parallel_mode model are:
5 llama3-8b-instruct-parallel-eole-test-long/step_400/config.json
15700311 llama3-8b-instruct-parallel-eole-test-long/step_400/merged
15397 llama3-8b-instruct-parallel-eole-test-long/step_400/model.00.safetensors
64753 llama3-8b-instruct-parallel-eole-test-long/step_400/optimizer.pt
2069 llama3-8b-instruct-parallel-eole-test-long/step_400/vocab.json

The sizes (KB) of the 400th checkpoint for the 1 gpu model are:
5 llama3-8b-instruct-1gpu-eole-test-long/step_400/config.json
15700324 llama3-8b-instruct-1gpu-eole-test-long/step_400/merged
13349 llama3-8b-instruct-1gpu-eole-test-long/step_400/model.00.safetensors
80137 llama3-8b-instruct-1gpu-eole-test-long/step_400/optimizer.pt
2069 llama3-8b-instruct-1gpu-eole-test-long/step_400/vocab.json

Could you please advise? Thanks!

output.csv
tensor_parallel_model_configs.json
1gpu_model_configs.json

francoishernandez added the question Further information is requested label Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different perplexity when fine-tuning with parallel_model vs 1 gpu #86

Different perplexity when fine-tuning with parallel_model vs 1 gpu #86

randy-ac commented Aug 20, 2024 •

edited

Loading

francoishernandez commented Aug 23, 2024

vince62s commented Aug 26, 2024

randy-ac commented Aug 26, 2024

l-k-11235 commented Sep 2, 2024 •

edited

Loading

randy-ac commented Sep 5, 2024 •

edited

Loading

Different perplexity when fine-tuning with parallel_model vs 1 gpu #86

Different perplexity when fine-tuning with parallel_model vs 1 gpu #86

Comments

randy-ac commented Aug 20, 2024 • edited Loading

francoishernandez commented Aug 23, 2024

vince62s commented Aug 26, 2024

randy-ac commented Aug 26, 2024

l-k-11235 commented Sep 2, 2024 • edited Loading

randy-ac commented Sep 5, 2024 • edited Loading

randy-ac commented Aug 20, 2024 •

edited

Loading

l-k-11235 commented Sep 2, 2024 •

edited

Loading

randy-ac commented Sep 5, 2024 •

edited

Loading