Skip to content

horseee/Awesome-Efficient-LLM

Repository files navigation

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models

Full List

Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.

🚀 Updates

  • May 29, 2024: We've had this awesome list for a year now 🥰!
  • Sep 6, 2023: Add a new subdirectory project/ to organize efficient LLM projects.
  • July 11, 2023: A new subdirectory efficient_plm/ is created to house papers that are applicable to PLMs.

💮 Contributing

If you'd like to include your paper, or need to update any details such as conference information or code URLs, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py and execute python generate_item.py. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.

⭐ Recommended Paper

For each topic, we have curated a list of recommended papers that have garnered relatively high GitHub stars or citations.

Paper from June 21, 2024 - Now (see Full List from May 22, 2023 here)

Quick Link

Network Pruning / Sparsity

Title & Authors Introduction Links
Star Publish Type
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Elias Frantar, Dan Alistarh
image Github paper
Star Publish Type
LLM-Pruner: On the Structural Pruning of Large Language Models
Xinyin Ma, Gongfan Fang, Xinchao Wang
image Github paper
Star Publish Type
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter
image Github
Paper
Star Publish Type
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen
image Github
Paper
KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models
Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma
image Paper
Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models
Bishwash Khanal, Jeffery M. Capone
image Paper
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He
image Paper
Star
PAT: Pruning-Aware Tuning for Large Language Models
Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du
image Github
Paper
LLM Pruning and Distillation in Practice: The Minitron Approach
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
image Paper
Language-specific Calibration for Pruning Multilingual Language Models
Simon Kurz, Zhixue Zhao, Jian-Jia Chen, Lucie Flek
Paper
Star
LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models
Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu
image Github
Paper
Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism
Guanchen Li, Xiandong Zhao, Lian Liu, Zeping Li, Dong Li, Lu Tian, Jie He, Ashish Sirasao, Emad Barsoum
image Paper
A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models
Pengxiang Zhao, Hanyu Hu, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming Yuan
image Paper
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training
Weiyu Huang, Guohao Jian, Yuezhou Hu, Jun Zhu, Jianfei Chen
image Paper
Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining
Jianwei Li, Yijun Dong, Qi Lei
image Paper
Star
Compact Language Models via Pruning and Knowledge Distillation
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
image Github
Paper
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi
image Paper
Reconstruct the Pruned Model without Any Retraining
Pingjie Wang, Ziqing Fan, Shengchao Hu, Zhe Chen, Yanfeng Wang, Yu Wang
image Paper
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei
image Paper
StarPublish
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations
Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang
image Github
Paper
Star Type
Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression
Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, Vivek Srikumar
Github
Paper
Publish
Flextron: Many-in-One Flexible Large Language Model
Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov
image Paper
Star
BlockPruner: Fine-grained Pruning for Large Language Models
Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li
image Github
Paper
Publish
Structured Pruning for Large Language Models Using Coupled Components Elimination and Minor Fine-tuning
Honghe Zhang, XiaolongShi XiaolongShi, Jingwei Sun, Guangzhong Sun
image Paper
FoldGPT: Simple and Effective Large Language Model Compression Scheme
Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen
image Paper
Publish
Learning Neural Networks with Sparse Activations
Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka
Paper
Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization
Sungbin Shin, Wonpyo Park, Jaeho Lee, Namhoon Lee
image Paper
Star
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah
image Github
Paper
Optimization-based Structural Pruning for Large Language Models without Back-Propagation
Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia
image Paper

Knowledge Distillation

Title & Authors Introduction Links
Knowledge Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
image Github
Paper
StarPublish
LLMR: Knowledge Distillation with a Large Language Model-Induced Reward
Dongheng Li, Yongchang Hao, Lili Mou
image Github
Paper
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models
Jun Rao, Xuebo Liu, Zepeng Lin, Liang Ding, Jing Li, Dacheng Tao
image Paper
Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights
Mohamad Ballout, Ulf Krumnack, Gunther Heidemann, Kai-Uwe Kühnberger
image Paper
Star
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao
image Github
Paper
FIRST: Teach A Reliable Large Language Model Through Efficient Trustworthy Distillation
KaShun Shum, Minrui Xu, Jianshu Zhang, Zixin Chen, Shizhe Diao, Hanze Dong, Jipeng Zhang, Muhammad Omer Raza
image Paper
Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models
Meiyun Wang, Masahiro Suzuki, Hiroki Sakaji, Kiyoshi Izumi
image Paper
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, Albert Gu
image Paper
Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting
Emmanuel Aboah Boateng, Cassiano O. Becker, Nabiha Asghar, Kabir Walia, Ashwin Srinivasan, Ehi Nosakhare, Victor Dibia, Soundar Srinivasan
image Paper
LaDiMo: Layer-wise Distillation Inspired MoEfier
Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang
image Paper
BOND: Aligning LLMs with Best-of-N Distillation
Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard et al
image Paper
Enhancing Data-Limited Graph Neural Networks by Actively Distilling Knowledge from Large Language Models
Quan Li, Tianxiang Zhao, Lingwei Chen, Junjie Xu, Suhang Wang
image Paper
DDK: Distilling Domain Knowledge for Efficient Large Language Models
Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
image Paper
Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang
image Paper
Don't Throw Away Data: Better Sequence Knowledge Distillation
Jun Wang, Eleftheria Briakou, Hamid Dadkhahi, Rishabh Agarwal, Colin Cherry, Trevor Cohn
Paper
Multi-Granularity Semantic Revision for Large Language Model Distillation
Xiaoyu Liu, Yun Zhang, Wei Li, Simiao Li, Xudong Huang, Hanting Chen, Yehui Tang, Jie Hu, Zhiwei Xiong, Yunhe Wang
image Paper
BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation
Minchong Li, Feng Zhou, Xiaohui Song
image Paper

Quantization

Title & Authors Introduction Links
StarPublish
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
image Github
Paper
StarPublish
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han
image Github
Paper
Star
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han
image Github
Paper
StarPublish
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo
image Github
Paper
Star
SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer
image Github
Paper
Star
Extreme Compression of Large Language Models via Additive Quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
image Github
Paper
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon
image Paper
The Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study
Minghai Qin
image Paper
Matmul or No Matmal in the Era of 1-bit LLMs
Jinendra Malekar, Mohammed E. Elbtity, Ramtin Zand Co
image Paper
Star
MobileQuant: Mobile-friendly Quantization for On-device Language Models
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez
image Github
Paper
Star
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei
image Github
Paper
STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo, Xiaowen Chu
image Paper
Star
Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance
Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li
image Github
Paper
StarPublish
Scalify: scale propagation for efficient low-precision LLM training
Paul Balança, Sam Hosegood, Carlo Luschi, Andrew Fitzgibbon
Github
Paper
Star
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo
image Github
Paper
Star
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho Yang, Kang Min Yoo, Dongsoo Lee
image Github
Paper
Star
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Ayush Kaushal, Tejas Pandey, Tejas Vaidhya, Aaryan Bhagat, Irina Rish
image Github
Paper
Star
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim
image Github
Paper
LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
Tianyi Zhang, Anshumali Shrivastava
image Paper
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee
image Paper
Star
RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
Xijie Huang, Zechun Liu, Shih-Yang Liu, Kwang-Ting Cheng
image Github
Paper
Star
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
Liqun Ma, Mingjie Sun, Zhiqiang Shen
image Github
Paper
Publish
GPTQT: Quantize Large Language Models Twice to Push the Efficiency
Yipin Guo, Yilin Lang, Qinyuan Ren
image Paper
Star
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang
image Github
Paper
Star
Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs
Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu
image Github
Paper
CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Pranav Ajit Nair, Arun Sai Suggala
image Paper
SDQ: Sparse Decomposed Quantization for LLM Inference
Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna
image Paper
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee
image Paper
Attention-aware Post-training Quantization without Backpropagation
Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon
image Paper
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
Dongwon Jo, Taesu Kim, Yulhwa Kim, Jae-Joon Kim
image Paper

Inference Acceleration

Title & Authors Introduction Links
StarPublish
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
image Github
Paper
Star
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia
image Github
paper
Star
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
image Github
Paper
Star
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation
Yuhui Li, Chao Zhang, and Hongyang Zhang
image Github
Blog
Star
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
image Github
Paper
Star
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie
image Github
Paper
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu
image Paper
Star
Sirius: Contextual Sparsity with Correction for Efficient LLMs
Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen
image Github
Paper
Star
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, Ningyu Zhang
image Github
Paper
Path-Consistency: Prefix Enhancement for Efficient Inference in LLM
Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou
image Paper
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen
image Paper
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che
image Paper
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto
image Paper
Star
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
Bin Xiao, Lujun Gui, Lei Su, Weipeng Chen
image Github
Paper
Accelerating Large Language Model Inference with Self-Supervised Early Exits
Florian Valade
Paper
An Efficient Inference Framework for Early-exit Large Language Models
Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang
Paper
Publish
Inference acceleration for large language models using "stairs" assisted greedy generation
Domas Grigaliūnas, Mantas Lukoševičius
image Paper
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
image Paper
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu
image Paper
Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference
Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun
image Paper
Star
LiveMind: Low-latency Large Language Models with Simultaneous Inference
Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li
image Github
Paper
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
image Paper
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu
image Paper
Star
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
image Github
Paper
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
image Paper
Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang
image Paper
Star
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen et al
image Github
Paper
Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet
image Paper

Efficient MOE

Title & Authors Introduction Links
Star
Fast Inference of Mixture-of-Experts Language Models with Offloading
Artyom Eliseev, Denis Mazur
image Github
Paper
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, Jianfeng Gao
image Paper
Star
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
image Github
Paper

Efficient Architecture of LLM

Title & Authors Introduction Links
Star
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan
image Github
Paper
Model
Star
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou
image Github
Paper
SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context
Hongjun An, Yifan Chen, Zhe Sun, Xuelong Li
image Paper
Star
Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads
Xihui Lin, Yunan Zhang, Suyu Ge, Barun Patra, Vishrav Chaudhary, Xia Song
image Github
Paper
Star
Beyond KV Caching: Shared Attention for Efficient LLMs
Bingli Liao, Danilo Vasconcellos Vargas
image Github
Paper

KV Cache Compression

Title & Authors Introduction Links
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao
image Paper
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang
image Paper
A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage
Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu
image Paper
Star
Post-Training Sparse Attention with Double Sparsity
Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng
image Github
Paper
Star
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy
image Github
Paper
Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference
Zeyu Zhang,Haiying Shen
image Paper
Finch: Prompt-guided Key-Value Cache Compression
Giulio Corallo, Paolo Papotti
image Paper
Star
Palu: Compressing KV-Cache with Low-Rank Projection
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu
image Github
Paper
ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo
image Paper
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang
image Paper
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui
image Paper
Star
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression
Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah
image Github
Paper
StarPublish
Efficient Sparse Attention needs Adaptive Token Release
Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li
image Github
Paper
Star
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li et al
image Github
Paper

Text Compression

Title & Authors Introduction Links
StarPublish
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu
image Github
Paper
Star
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
image Github
Paper
Efficient LLM Context Distillation
Rajesh Upadhayayaya, Zachary Smith, Chritopher Kottmyer, Manish Raj Osti
Paper
Star
Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression
Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu
image Github
Paper
Star
500xCompressor: Generalized Prompt Compression for Large Language Models
Zongqian Li, Yixuan Su, Nigel Collier
image Github
Paper
Star
QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression
Wenshan Wang, Yihang Wang, Yixing Fan, Huaming Liao, Jiafeng Guo
image Github
Paper
Publish
Characterizing Prompt Compression Methods for Long Context Inference
Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami
image Paper
Entropy Law: The Story Behind Data Compression and LLM Performance
Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen
image Paper
PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning
Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, Dongmei Zhang
image Paper
Brevity is the soul of wit: Pruning long files for code generation
Aaditya K. Singh, Yu Yang, Kushal Tirumala, Mostafa Elhoushi, Ari S. Morcos
image Paper

Low-Rank Decomposition

Title & Authors Introduction Links
MoDeGPT: Modular Decomposition for Large Language Model Compression
Chi-Heng Lin, Shangqian Gao, James Seale Smith, Abhishek Patel, Shikhar Tuli, Yilin Shen, Hongxia Jin, Yen-Chang Hsu
image Paper
MCNC: Manifold Constrained Network Compression
Chayne Thrash, Ali Abbasi, Parsa Nooralinejad, Soroush Abbasi Koohpayegani, Reed Andreas, Hamed Pirsiavash, Soheil Kolouri
image Paper

Hardware/System

Title & Authors Introduction Links
Publish
OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models
Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung
image Paper
Accelerating Large Language Model Training with Hybrid GPU-based Compression
Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda
image Paper
LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang
image Paper
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff
image Paper
SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving
Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris
image Paper
Designing Efficient LLM Accelerators for Edge Devices
Jude Haris, Rappy Saha, Wenhao Hu, José Cano
image Paper
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari
image Paper
Star
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao
image Github
Paper
Blog
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang
image Paper
StarPublish
EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting
Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin
image Github
Paper
Publish
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
Jungi Lee, Wonbeom Lee, Jaewoong Sim
image Paper

Tuning

Title & Authors Introduction Links
Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs
Afia Anjum, Maksim E. Eren, Ismael Boureima, Boian Alexandrov, Manish Bhattarai
image Paper
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning
Yun-Da Tsai, Mingjie Liu, Haoxing Ren
image Paper
Publish
PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs
Dan Peng, Zhihui Fu, Jun Wang
Paper
StarPublish
Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning
Haobo Song, Hao Zhao, Soumajit Majumder, Tao Lin
image Github
Paper
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
Rickard Brüel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen et al
image Paper
BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks
Amrutha Varshini Ramesh, Vignesh Ganapathiraman, Issam H. Laradji, Mark Schmidt
image Paper

Survey

Title & Authors Introduction Links
Hardware Acceleration of LLMs: A comprehensive survey and comparison
Nikoletta Koilia, Christoforos Kachris
Paper
A Survey on Symbolic Knowledge Distillation of Large Language Models
Kamal Acharya, Alvaro Velasquez, Houbing Herbert Song
image Paper
Publish
Inference Optimization of Foundation Models on AI Accelerators
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis
Paper
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application
Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, Yiqiang Chen
image Paper
Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao
Paper