Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory when fine-tune the ICX2.5 #399

Open
cool-xuan opened this issue Jul 24, 2024 · 4 comments
Open

out of memory when fine-tune the ICX2.5 #399

cool-xuan opened this issue Jul 24, 2024 · 4 comments
Assignees

Comments

@cool-xuan
Copy link

Following your fine-tune instruction, my finetune.sh is as

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

export MODEL="./ckpt/internlm-xcomposer2d5-7b"
# export DATA="path of data"
export DATA="data.txt"

GPUS_PER_NODE=4
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS ./finetune/finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler False \
    --use_lora False \
    --hd_num 18 \
    --output_dir output/0724/paired_data_ft_fixVIT_bz16 \
    --num_train_epochs 1 \
    --batch_size 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 16384 \
    --deepspeed ./finetune/ds_config_zero2.json \
    --gradient_checkpointing True

All batch sizes are set to 1 and only two images are encoded for each conversation.
I run this shell on 4 A100 with 80G GPU memory, with out of memory in the first iteration.

All packages are same with your env docs/install.md, except torch=2.10 and cuda=12.1.

Any advice for this wired OOD?

@yuhangzang
Copy link
Collaborator

You may

  • try LoRA
  • decrease the value of --max_length, e.g., --max_length=4096
  • decrease the value of --hd_num

@cool-xuan
Copy link
Author

cool-xuan commented Jul 24, 2024

You may

  • try LoRA
  • decrease the value of --max_length, e.g., --max_length=4096
  • decrease the value of --hd_num

Thanks for your reply. Fine tuning by LoRA solves the OOD error. However, even when I reduce max_length=4096 and hd_num=4, full parameters tuning still encounter OOD.
Maybe some other advice to try for full parameters tuning?

@FUJIsyu0515
Copy link

@yuhangzang @cool-xuan The methods you provided are very useful for avoiding OOM at startup. I have tried them. However, now it always suddenly appears OOM after running dozens of steps. I have no idea what parameter configuration is really effective.

@yuhangzang
Copy link
Collaborator

Do not forget to install the flash-attention 2.

You may need 8 A100 80G GPUs for full parameters tuning.

If you use LoRA fine-tuning, you can also decrease the value of lora_r and lora_alpha to avoid the OOM problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants