Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different perplexity when fine-tuning with parallel_model vs 1 gpu #86

Open
randy-ac opened this issue Aug 20, 2024 · 5 comments
Open
Labels
question Further information is requested

Comments

@randy-ac
Copy link

randy-ac commented Aug 20, 2024

Hello,

we have noticed some unexpected behaviors when fine-tuning a llama 3 model on 1 gpu and when fine-tuning the same model on the same data set with 2 gpus in parallel mode. See the attached tensorboard graphs (red=run with parallel mode). The minimal validation ppl is different between the two runs.

As you can see from the configs I am pasting below, the only parameters that differ between the runs are: world_size, gpu_rank and parallel_mode.

Could you please advise?

image

Configs for run with 1 GPU

# General settings
seed: 1234
share_vocab: true
save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue"
src_vocab: "${EOLE_MODEL_DIR}/llama3-8b-instruct/vocab.txt" # size
src_vocab_size: 128256
tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true
tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue/logs/


transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs:
    insert_mask_before_placeholder:
        response_patterns: ["<|start_header_id|>assistant<|end_header_id|>⦅newline⦆⦅newline⦆"]
    onmt_tokenize:
        src_subword_type: bpe
        src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        tgt_subword_type: bpe
        tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        gpt2_pretok: true
    filtertoolong:
        src_seq_length: 2048
        tgt_seq_length: 2048


data:
    new_synth_dataset:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_train.shuffle"
        weight: 1 
    valid:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_dev.shuffle"


skip_empty_level: silent # silently ignore empty lines in the data


training:
    world_size: 1
    gpu_ranks: [0]

    zero_out_prompt_loss: true

    train_steps: 1000
    valid_steps: 50

    dropout_steps: [0]
    dropout: [0.0]
    attention_dropout: [0.0]

    bucket_size: 10
    num_workers: 1
    batch_type: "sents"
    batch_size: 1
    valid_batch_size: 1
    batch_size_multiple: 1


    compute_dtype: fp16
    apex_opt_level: ""
    optim: "fusedadam"
    learning_rate: 2e-05
    warmup_steps: 100
    decay_method: "none"
    adam_beta2: 0.998
    accum_count: [8, 16]
    accum_steps: [0, 100]
    max_grad_norm: 0
    label_smoothing: 0.0
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

    train_from: "${EOLE_MODEL_DIR}/llama3-8b-instruct"
    model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue"
    keep_checkpoint: 30
    save_checkpoint_steps: 200
    
    quant_layers: ['gate_up_proj', 'down_proj', 'up_proj'] 
    quant_type: "bnb_NF4"

    lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
    lora_rank: 8 #5 #2
    lora_dropout: 0.05
    lora_alpha: 32
    lora_embedding: false

Configs for run with parallel_mode

seed: 1234
share_vocab: true
save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue"
src_vocab: "${EOLE_MODEL_DIR}/llama3-8b-instruct/vocab.txt" # size
src_vocab_size: 128256
tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true
tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue/logs/

transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs:
    insert_mask_before_placeholder:
        response_patterns: ["<|start_header_id|>assistant<|end_header_id|>⦅newline⦆⦅newline⦆"]
    onmt_tokenize:
        src_subword_type: bpe
        src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        tgt_subword_type: bpe
        tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        gpt2_pretok: true
    filtertoolong:
        src_seq_length: 2048
        tgt_seq_length: 2048

data:
    new_synth_dataset:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_train.shuffle"
        weight: 1 
    valid:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_dev.shuffle"


skip_empty_level: silent # silently ignore empty lines in the data


training:
    world_size: 2
    gpu_ranks: [0, 1]
    parallel_mode: "tensor_parallel"
    zero_out_prompt_loss: true

    train_steps: 1000
    valid_steps: 50

    dropout_steps: [0]
    dropout: [0.0]
    attention_dropout: [0.0]
    bucket_size: 10
    num_workers: 1
    batch_type: "sents"
    batch_size: 1
    valid_batch_size: 1
    batch_size_multiple: 1

    compute_dtype: fp16
    apex_opt_level: ""
    optim: "fusedadam"
    learning_rate: 2e-05
    warmup_steps: 100
    decay_method: "none"

    adam_beta2: 0.998
    accum_count: [8, 16]
    accum_steps: [0, 100]
    max_grad_norm: 0
    label_smoothing: 0.0
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

    train_from: "${EOLE_MODEL_DIR}/llama3-8b-instruct"
    model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue"
    keep_checkpoint: 30

    save_checkpoint_steps: 200
    
    quant_layers: ['gate_up_proj', 'down_proj', 'up_proj'] 
    quant_type: "bnb_NF4"

    lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
    lora_rank: 8 #5 #2
    lora_dropout: 0.05
    lora_alpha: 32
    lora_embedding: false
@francoishernandez
Copy link
Contributor

Hey Randy,
I think some variation is expected as there might be some slight differences in some numerical operations, which can build up along the way.
It might be interesting to investigate at various steps at the beginning of the training to see which operations are most impactful. Also, could be interesting to check if significant differences in output values occur at inference as well.

@francoishernandez francoishernandez added the question Further information is requested label Aug 23, 2024
@vince62s
Copy link
Contributor

probably unrelated with your issue, nut bear in mind that llama 3 uses Rope scaling which is not implemented in Eole yet.

@randy-ac
Copy link
Author

Hello both,

thanks for your replies. I will check your suggestions as soon as possible and will keep you posted.

@l-k-11235
Copy link
Contributor

l-k-11235 commented Sep 2, 2024

Hello,
Thanks for your answers.
@randy-ac will run and compare the single and parallel gpu mode on a longer setting, without quantization or dropout to avoid “spurious” differences.

@randy-ac
Copy link
Author

randy-ac commented Sep 5, 2024

Hello everyone,
I'm finally back to this topic with the results of some experiments we carried out.
We trained two models on 1000 steps, one on 2 GPUs with tensor_parallel mode, and another one on a single gpu. The task is domain classification. The two models had exactly the same configs. We removed quantization and dropout to avoid introducing other variables in the experiment. Please see the configs attached.
We've still found that the two models diverge in validation accuracy, output values for the same checkpoint, LM decoder forward, checkpoint sizes. In general, the model trained with tensor_parallel seems to achieve a worse performance.

Tensorboard logs
image
(the red one is the one with tensor_parallel)

Output values
We tested checkpoint 400 of both models. There is a ~18% difference between the two accuracy values (i.e. if the predicted label = gold label).
Please find attached a tsv with the outputs of each model.

Different values in decoder forward
For each layer, we printed out the norm of the layer input and the attention output. It seems that some differences start to build up in layer 3 in the first step (i.e. before the first model backward):

1 GPU
Layer nr 3

Layer_in norm: 389.75

norm_layer_in Euclidean Distance to zero: 691.5
attn_output Euclidean Distance to zero: 18.78125

Layer nr 4

Layer_in norm: 391.25

norm_layer_in Euclidean Distance to zero: 620.0

Tensor parallel
Layer nr 3

Layer_in norm: 389.75

norm_layer_in Euclidean Distance to zero: 691.5
attn_output Euclidean Distance to zero: 18.796875

Layer nr 4

Layer_in norm: 391.5

norm_layer_in Euclidean Distance to zero: 620.0
attn_output Euclidean Distance to zero: 24.359375

Checkpoint size
The sizes (KB) of the 400th checkpoint for the parallel_mode model are:
5 llama3-8b-instruct-parallel-eole-test-long/step_400/config.json
15700311 llama3-8b-instruct-parallel-eole-test-long/step_400/merged
15397 llama3-8b-instruct-parallel-eole-test-long/step_400/model.00.safetensors
64753 llama3-8b-instruct-parallel-eole-test-long/step_400/optimizer.pt
2069 llama3-8b-instruct-parallel-eole-test-long/step_400/vocab.json

The sizes (KB) of the 400th checkpoint for the 1 gpu model are:
5 llama3-8b-instruct-1gpu-eole-test-long/step_400/config.json
15700324 llama3-8b-instruct-1gpu-eole-test-long/step_400/merged
13349 llama3-8b-instruct-1gpu-eole-test-long/step_400/model.00.safetensors
80137 llama3-8b-instruct-1gpu-eole-test-long/step_400/optimizer.pt
2069 llama3-8b-instruct-1gpu-eole-test-long/step_400/vocab.json

Could you please advise? Thanks!

output.csv
tensor_parallel_model_configs.json
1gpu_model_configs.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants