Skip to content

Latest commit

 

History

History
79 lines (59 loc) · 6.44 KB

LongVILA.md

File metadata and controls

79 lines (59 loc) · 6.44 KB

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Code License Model License Python 3.10+

Paper Huggingface Models

💡 Introduction

Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first long-context Multi-Modal Sequence Parallelism (MM-SP) system that enables long training and inference, enabling 2M context length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, short supervised fine-tuning, context extension, and long supervised fine-tuning. Regarding datasets, we meticulously construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. The full-stack solution extends the feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle in a haystack. LongVILA-8B also demonstrates consistent accuracy improvements on long videos in the VideoMME benchmark as the video frames increase.

Installation

./environment_setup.sh vila

Evaluations

Please refer to scripts/v1_5/eval/needle.sh, scripts/v1_5/eval/video_chatgpt/run_vila_benchmark.sh, and llava/eval/video_mme/eval.sh for needle-in-a-haystack, LongVILA-Caption, and Video MME evaluations.

Note

💡Sequence Parallelism Configuration

To enable sequence parallelism, you can set the following parameters in the training script:

seq_parallel_size:The degree of sequence parallelism (SP). SP is disabled by default (value: -1).

seq_parallel_ring_size: The communication process group size using optimized Ring Attention approach in SP. Ring Attention approach is disabled by default in SP.

seq_parallel_ring_type: Ring Attention implementation. Support ['ring_varlen', 'zigzag_ring_varlen'] in 2D attention. Only works when seq_parallel_ring_size > 1.

Please note that when SP is enabled, we treat each group of seq_parallel_size GPUs as a single device, with the global batch size calculated as the product of the per-device batch size and the data parallelism size.

🔒 License

  • The code is released under the Apache 2.0 license as found in the LICENSE file.
  • The pretrained weights are released under the CC-BY-NC-SA-4.0 license.
  • The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:

Citations

@article{longvila,
      title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
      author={Fuzhao Xue and Yukang Chen and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Yihui He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
      year={2024},
      eprint={2408.10188},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement