Skip to content

Commit

Permalink
Merge branch 'main' into release/2.3
Browse files Browse the repository at this point in the history
  • Loading branch information
Jintao-Huang committed Aug 24, 2024
2 parents 42e476a + e73c5e2 commit e2bba6e
Show file tree
Hide file tree
Showing 14 changed files with 87 additions and 25 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ You can contact us and communicate with us by adding our group:

## 🎉 News
- 🔥2024.08.22: Support `reft` tuner from [ReFT](https://github.com/stanfordnlp/pyreft) to achieve 15×–65× more parameter-efficient than LoRA, use `--sft_type reft` to begin!
- 2024.08.21: Support for phi3_5-mini-instruct, phi3_5-moe-instruct, and phi3_5-vision-instruct.
- 🔥2024.08.21: Support for phi3_5-mini-instruct, phi3_5-moe-instruct, and phi3_5-vision-instruct. The best practices for fine-tuning Latex OCR using phi3_5-vision-instruct can be found [here](https://github.com/modelscope/ms-swift/issues/1809).
- 2024.08.21: Support for idefics3-8b-llama3, llava-onevision-qwen2-0_5b-ov, llava-onevision-qwen2-7b-ov, and llava-onevision-qwen2-72b-ov.
- 🔥2024.08.20: Support fine-tuning of multimodal large models using DeepSpeed-Zero3.
- 2024.08.20: Supported models: longwriter-glm4-9b, longwriter-llama3_1-8b. Supported dataset: longwriter-6k.
Expand Down
2 changes: 1 addition & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ SWIFT具有丰富全面的文档,请查看我们的文档网站:

## 🎉 新闻
- 🔥2024.08.22: 支持[ReFT](https://github.com/stanfordnlp/pyreft), 该tuner可以以LoRA的1/15~1/65的参数量达到和LoRA匹配或更好的效果, 使用`--sft_type reft`开始训练!
- 2024.08.21: 支持phi3_5-mini-instruct, phi3_5-moe-instruct, phi3_5-vision-instruct.
- 🔥2024.08.21: 支持phi3_5-mini-instruct, phi3_5-moe-instruct, phi3_5-vision-instruct. 使用phi3_5-vision-instruct进行Latex OCR微调的最佳实践可以查看[这里](https://github.com/modelscope/ms-swift/issues/1809).
- 2024.08.21: 支持idefics3-8b-llama3, llava-onevision-qwen2-0_5b-ov, llava-onevision-qwen2-7b-ov, llava-onevision-qwen2-72b-ov.
- 🔥2024.08.20: 支持使用deepspeed-zero3对多模态大模型进行微调.
- 2024.08.20: 支持模型: longwriter-glm4-9b, longwriter-llama3_1-8b. 支持数据集: longwriter-6k.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/LLM/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## LLM文档

[English Documentation](https://swift.readthedocs.io/en/latest/)
[English Documentation](https://swift.readthedocs.io/en/latest/LLM/index.html)

### 📚教程

Expand Down
4 changes: 3 additions & 1 deletion docs/source/LLM/支持的模型和数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -510,9 +510,11 @@
|coco-en-2|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|454617|36.8±2.8, min=32, max=89|chat, multi-modal, vision|-|
|🔥coco-en-2-mini|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|40504|36.8±2.6, min=32, max=75|chat, multi-modal, vision|-|
|capcha-images|[AI-ModelScope/captcha-images](https://modelscope.cn/datasets/AI-ModelScope/captcha-images/summary)||8000|31.0±0.0, min=31, max=31|chat, multi-modal, vision|-|
|latex-ocr-print|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|full|17918|362.7±34.8, min=294, max=528|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
|latex-ocr-handwrite|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|synthetic_handwrite|95424|375.1±59.4, min=292, max=2115|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
|aishell1-zh|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||141600|152.2±36.8, min=63, max=419|chat, multi-modal, audio|-|
|🔥aishell1-zh-mini|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||14526|152.2±35.6, min=74, max=359|chat, multi-modal, audio|-|
|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|-|
|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|[lmms-lab/VideoChatGPT](https://huggingface.co/datasets/lmms-lab/VideoChatGPT)|
|hh-rlhf|[AI-ModelScope/hh-rlhf](https://modelscope.cn/datasets/AI-ModelScope/hh-rlhf/summary)|harmless-base<br>helpful-base<br>helpful-online<br>helpful-rejection-sampled|127459|245.4±190.7, min=22, max=1999|rlhf, dpo, pairwise|-|
|🔥hh-rlhf-cn|[AI-ModelScope/hh_rlhf_cn](https://modelscope.cn/datasets/AI-ModelScope/hh_rlhf_cn/summary)|hh_rlhf<br>harmless_base_cn<br>harmless_base_en<br>helpful_base_cn<br>helpful_base_en|355920|171.2±122.7, min=22, max=3078|rlhf, dpo, pairwise|-|
|orpo-dpo-mix-40k|[AI-ModelScope/orpo-dpo-mix-40k](https://modelscope.cn/datasets/AI-ModelScope/orpo-dpo-mix-40k/summary)|default|43666|548.3±397.4, min=28, max=8483|dpo, orpo, en, quality|[mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k)|
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Multi-Modal/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
4. [InternVL系列最佳实践](internvl最佳实践.md)
5. [Deepseek-VL最佳实践](deepseek-vl最佳实践.md)
6. [Internlm2-Xcomposers最佳实践](internlm-xcomposer2最佳实践.md)
7. [Phi3-Vision最佳实践](phi3-vision最佳实践.md)
7. [Phi3-Vision最佳实践](phi3-vision最佳实践.md), [Phi3.5-Vision最佳实践](https://github.com/modelscope/ms-swift/issues/1809).


一轮对话只能包含一张图片(可能可以不含图片):
Expand Down
4 changes: 3 additions & 1 deletion docs/source_en/LLM/Supported-models-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -510,9 +510,11 @@ The table below introduces the datasets supported by SWIFT:
|coco-en-2|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|454617|36.8±2.8, min=32, max=89|chat, multi-modal, vision|-|
|🔥coco-en-2-mini|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|40504|36.8±2.6, min=32, max=75|chat, multi-modal, vision|-|
|capcha-images|[AI-ModelScope/captcha-images](https://modelscope.cn/datasets/AI-ModelScope/captcha-images/summary)||8000|31.0±0.0, min=31, max=31|chat, multi-modal, vision|-|
|latex-ocr-print|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|full|17918|362.7±34.8, min=294, max=528|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
|latex-ocr-handwrite|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|synthetic_handwrite|95424|375.1±59.4, min=292, max=2115|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
|aishell1-zh|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||141600|152.2±36.8, min=63, max=419|chat, multi-modal, audio|-|
|🔥aishell1-zh-mini|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||14526|152.2±35.6, min=74, max=359|chat, multi-modal, audio|-|
|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|-|
|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|[lmms-lab/VideoChatGPT](https://huggingface.co/datasets/lmms-lab/VideoChatGPT)|
|hh-rlhf|[AI-ModelScope/hh-rlhf](https://modelscope.cn/datasets/AI-ModelScope/hh-rlhf/summary)|harmless-base<br>helpful-base<br>helpful-online<br>helpful-rejection-sampled|127459|245.4±190.7, min=22, max=1999|rlhf, dpo, pairwise|-|
|🔥hh-rlhf-cn|[AI-ModelScope/hh_rlhf_cn](https://modelscope.cn/datasets/AI-ModelScope/hh_rlhf_cn/summary)|hh_rlhf<br>harmless_base_cn<br>harmless_base_en<br>helpful_base_cn<br>helpful_base_en|355920|171.2±122.7, min=22, max=3078|rlhf, dpo, pairwise|-|
|orpo-dpo-mix-40k|[AI-ModelScope/orpo-dpo-mix-40k](https://modelscope.cn/datasets/AI-ModelScope/orpo-dpo-mix-40k/summary)|default|43666|548.3±397.4, min=28, max=8483|dpo, orpo, en, quality|[mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k)|
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Multi-Modal/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ A single round of dialogue can contain multiple images (or no images):
4. [InternVL Series Best Practice](internvl-best-practice.md)
5. [Deepseek-VL Best Practice](deepseek-vl-best-practice.md)
6. [Internlm2-Xcomposers Best Practice](internlm-xcomposer2-best-practice.md)
7. [Phi3-Vision Best Practice](phi3-vision-best-practice.md)
7. [Phi3-Vision Best Practice](phi3-vision-best-practice.md), [Phi3.5-Vision Best Practice](https://github.com/modelscope/ms-swift/issues/1809).


A single round of dialogue can only contain one image:
Expand Down
6 changes: 4 additions & 2 deletions swift/llm/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,8 @@ def llm_export(args: ExportArguments) -> None:
'Skipping the conversion process.')
else:
from swift.llm.megatron import MegatronArguments, convert_hf_to_megatron, patch_megatron
model, tokenizer = get_model_tokenizer(args.model_type, torch.float32, {'device_map': 'auto'})
model, tokenizer = get_model_tokenizer(
args.model_type, torch.float32, {'device_map': 'auto'}, model_id_or_path=args.model_id_or_path)
res = MegatronArguments.load_megatron_config(tokenizer.model_dir)
res['model_type'] = args.model_type
res['target_tensor_model_parallel_size'] = args.tp
Expand All @@ -311,7 +312,8 @@ def llm_export(args: ExportArguments) -> None:
'Skipping the conversion process.')
else:
from swift.llm.megatron import MegatronArguments, convert_megatron_to_hf, patch_megatron
hf_model, tokenizer = get_model_tokenizer(args.model_type, torch.float32, {'device_map': 'auto'})
hf_model, tokenizer = get_model_tokenizer(
args.model_type, torch.float32, {'device_map': 'auto'}, model_id_or_path=args.model_id_or_path)
res = MegatronArguments.load_megatron_config(tokenizer.model_dir)
res['model_type'] = args.model_type
res['target_tensor_model_parallel_size'] = args.tp
Expand Down
12 changes: 8 additions & 4 deletions swift/llm/utils/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,16 +80,14 @@ def _check_path(cls,
value = res
return value

@staticmethod
def _is_multimodal(model_type: Optional[str] = None) -> bool:
def _is_multimodal(self, model_type: Optional[str] = None) -> bool:
if model_type is None:
return False
model_info = MODEL_MAPPING[model_type]
tags = model_info.get('tags') or []
return 'multi-modal' in tags

@staticmethod
def _is_vision(model_type: Optional[str] = None) -> bool:
def _is_vision(self, model_type: Optional[str] = None) -> bool:
if model_type is None:
return False
model_info = MODEL_MAPPING[model_type]
Expand Down Expand Up @@ -1590,6 +1588,12 @@ def handle_infer_backend(self) -> None:
if self.eval_url is None:
super().handle_infer_backend()

def _is_multimodal(self, model_type: Optional[str] = None) -> bool:
return False

def _is_vision(self, model_type: Optional[str] = None) -> bool:
return False


@dataclass
class ExportArguments(InferArguments):
Expand Down
3 changes: 2 additions & 1 deletion swift/llm/utils/client_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@ def _from_base64(img_base64: Union[str, 'PIL.Image.Image'], tmp_dir: str = 'tmp'
sha256_hash = hashlib.sha256(img_base64.encode('utf-8')).hexdigest()
img_path = os.path.join(tmp_dir, f'{sha256_hash}.png')
image = Image.open(BytesIO(base64.b64decode(img_base64)))
image.save(img_path)
if not os.path.exists(img_path):
image.save(img_path)
return img_path


Expand Down
48 changes: 43 additions & 5 deletions swift/llm/utils/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,8 @@ class DatasetName:
coco_en_2 = 'coco-en-2'
coco_en_2_mini = 'coco-en-2-mini'
capcha_images = 'capcha-images'
latex_ocr_print = 'latex-ocr-print'
latex_ocr_handwrite = 'latex-ocr-handwrite'
# for qwen-audio
aishell1_zh = 'aishell1-zh'
aishell1_zh_mini = 'aishell1-zh-mini'
Expand Down Expand Up @@ -747,7 +749,10 @@ def _process(d):
response = d[response_key]
return {'query': query * len(response), 'response': response, 'images': images}

return dataset.map(_process)
kwargs = {}
if not isinstance(dataset, HfIterableDataset):
kwargs['load_from_cache_file'] = dataset_enable_cache
return dataset.map(_process, **kwargs)


register_dataset(
Expand Down Expand Up @@ -861,6 +866,7 @@ def _process(d):
_preprocess_video_chatgpt,
get_dataset_from_repo,
split=['test'],
hf_dataset_id='lmms-lab/VideoChatGPT',
tags=['chat', 'multi-modal', 'video', '🔥'])


Expand Down Expand Up @@ -1784,7 +1790,7 @@ def preprocess_row(row):
query = row['question']
response = row['choices'][row['answer']]
solution = row['solution']
return {'query': query, 'response': f'{solution}\nSo the final answer is:{response}'}
return {'query': query, 'response': f'{solution}\nSo the final answer is: {response}'}

kwargs = {}
if not isinstance(dataset, HfIterableDataset):
Expand Down Expand Up @@ -2028,16 +2034,48 @@ def preprocess(row):
tags=['chat', 'general', 'multi-round'])


def _preprocess_latex_ocr_dataset(dataset: DATASET_TYPE) -> DATASET_TYPE:
from datasets import Image
prompt = 'Using LaTeX to perform OCR on the image.'

def _process(d):
return {'query': prompt, 'response': d['text']}

kwargs = {}
if not isinstance(dataset, HfIterableDataset):
kwargs['load_from_cache_file'] = dataset_enable_cache
return dataset.map(_process, **kwargs).rename_column('image', 'images')


register_dataset(
DatasetName.latex_ocr_print,
'AI-ModelScope/LaTeX_OCR',
['full'],
_preprocess_latex_ocr_dataset,
get_dataset_from_repo,
split=['validation', 'test'], # There are some problems in the training dataset.
hf_dataset_id='linxy/LaTeX_OCR',
tags=['chat', 'ocr', 'multi-modal', 'vision'])

register_dataset(
DatasetName.latex_ocr_handwrite,
'AI-ModelScope/LaTeX_OCR', ['synthetic_handwrite'],
_preprocess_latex_ocr_dataset,
get_dataset_from_repo,
split=['train', 'validation', 'test'],
hf_dataset_id='linxy/LaTeX_OCR',
tags=['chat', 'ocr', 'multi-modal', 'vision'])


def _preprocess_capcha_images(dataset: DATASET_TYPE) -> DATASET_TYPE:
from datasets import Image
query = 'recognize the content.'
image_key = 'image'
response_key = 'solution'

def _process(d):
return {'query': query * len(d[response_key]), 'response': d[response_key], 'images': [d[image_key]]}
return {'query': query * len(d[response_key]), 'response': d[response_key]}

return dataset.map(_process).cast_column('image', Image(decode=True))
return dataset.map(_process).rename_column('image', 'images')


register_dataset(
Expand Down
7 changes: 2 additions & 5 deletions swift/llm/utils/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -4287,7 +4287,7 @@ def _get_new_func(func_name: str):

@wraps(_old_func)
def _new_func(self, *args, **kwargs):
res = _old_func(self, *args, **kwargs)
res = _old_func(getattr(self, submodel_name), *args, **kwargs)
if func_name == 'forward':
device = find_device(args)
if device is None:
Expand All @@ -4298,12 +4298,9 @@ def _new_func(self, *args, **kwargs):
return _new_func

for key in func_list:
value = MethodType(_get_new_func(key), submodel)
setattr(model, key, value)
setattr(model, key, MethodType(_get_new_func(key), model))
if key == 'generate' and model.device != submodel.device:
submodel.__class__.device = model.device
if key == 'forward' and 'generate' in func_list:
setattr(submodel, key, value)


@register_model(
Expand Down
2 changes: 2 additions & 0 deletions swift/llm/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,8 @@ def __iter__(self):
sequences = []
for example in buffer:
input, _ = self.template.encode(example)
if not input:
continue
sequences.append((input, len(input['input_ids'])))

packed_sequences = self.calculate_matched_group(sequences)
Expand Down
16 changes: 15 additions & 1 deletion tests/custom/test_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ def test_pt():


def test_vlm_sft():
# lora full
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
from swift.llm import sft_main, SftArguments, infer_main, InferArguments
model_type = 'phi3_5-vision-instruct'
Expand All @@ -45,9 +46,22 @@ def test_llm_sft():
InferArguments(ckpt_dir=last_model_checkpoint, load_dataset_config=True, merge_lora=True, infer_backend='pt'))


def test_vlm_dpo():
# lora, full, stream
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
from swift.llm import rlhf_main, RLHFArguments, infer_main, InferArguments
model_type = 'internvl2-2b'
dataset = 'rlaif-v#100'

output = rlhf_main(RLHFArguments(model_type=model_type, dataset=dataset, max_length=8192, sft_type='full'))
last_model_checkpoint = output['last_model_checkpoint']
infer_main(InferArguments(ckpt_dir=last_model_checkpoint, load_dataset_config=True))


if __name__ == '__main__':
# test_eval_llm()
# test_eval_vlm()
# test_pt()
test_vlm_sft()
# test_vlm_sft()
# test_llm_sft()
test_vlm_dpo()

0 comments on commit e2bba6e

Please sign in to comment.