Performance Optimizations for TP-Aware GPTQ #67

cyang49 · 2024-03-22T18:29:27Z

This is a draft. Please do not merge.

Motivation

Current tgis-native provides GPTQ support for llama and starcoder models by utilizing the fast exllamav2 kernel (and also Marlin #66 when PR is merged). It works well in single GPU deployment. However, for multi-GPU TP deployment, the performance is known to be bad when deploying GPTQ checkpoints that requires activation reordering (desc_act=True in quantization config). This includes many publicly available GPTQ checkpoints.

The reason for the bad performance of these models is that the fast exllamav2 (or Marlin) kernels cannot be used in row-parallel layers, as the weight matrix row shuffling requirement would introduce an extra all-gather communication to do global reordering of input activations of row-parallel layers in TP. The all-gather communication can be prohibitively expensive. As a result, in TGIS, the much slower Triton matmul_248 kernel, which doesn't require shuffling, is used. This 50% CUDA and 50% Triton mixed used in QuantLinear layers works but it is too slow to be a practical solution. vLLM uses a similar approach except that it uses an alternative gptq cuda kernel than the Triton kernel. It still suffers from less optimal performance.

In this PR, we implement TP-aware GPTQ model inference optimizations which includes the technique introduced in the arxiv paper we published previously for the MLP layers, and combining newer technique, masked matmul, for the attention layer optimization.

Preliminary results using exllamav2 show that our techniques enable deploying Llama-70b GPTQ on L40Sx2 getting 24.67 tokens/s, a 30% throughput improvement over deploying FP16 model on A100-80GBx2 (19 tokens/s) thus providing a good cost-saving alternatives for deploying llama-70b. We expect to see even better results using Marlin.

Modifications

The code changes include primarily control path adjustments to manipulate the loading of weight tensors and environment variable flags to toggle different modes.

Known issues:

The weight shuffling can slow down model loading significantly. Pack/unpack functions should be move to GPUs.
The code should be thoroughly tested for non-supported cases, as the control path is heavily modified
I welcome suggestions to make the control path modifications more clean
Santacoder support is not implemented yet
desc_act=False path may not have been sufficiently tested

Result

	Prefill	Token latency	Throughput
FP16: L40Sx4	1.96s	62.33ms	16.04 tokens/s
GPTQ, TP-aware: L40Sx2	2.11s	40.55ms	24.67 tokens/s
GPTQ, original: L40Sx2	3.48s	84.21ms	11.88 tokens/s

GPTQ, TP-aware means using communication avoiding techniques and exllamav2 for dequantization+gemm
GPTQ, original means using exllamav2 for column-parallel layer and triton kernel for row-parallel layer (also avoids all-gather)

We plan to update the results when Marlin PR is merged.

Related Issues

To merge #66 to enable Marlin support

Signed-off-by: Chih-Chieh-Yang <[email protected]>

Update TGIS release image

cyang49 force-pushed the pr_tp_gptq branch from 976d36d to 2c87b67 Compare March 22, 2024 18:43

njhill marked this pull request as draft March 22, 2024 19:21

cyang49 mentioned this pull request Mar 25, 2024

Re: Incoporate Marlin for GPTQ checkpoints into tgis_native #66

Merged

cyang49 added 2 commits March 25, 2024 15:55

TP-aware optimizations draft

c3efc60

Signed-off-by: Chih-Chieh-Yang <[email protected]>

fix use_exllama logic error

58903e5

Signed-off-by: Chih-Chieh-Yang <[email protected]>

cyang49 force-pushed the pr_tp_gptq branch from 2c87b67 to 58903e5 Compare March 25, 2024 20:23

cyang49 and others added 3 commits March 25, 2024 20:59

Fix merge bugs

4f65abc

Signed-off-by: Chih-Chieh-Yang <[email protected]>

Change default to disabling TP-aware optimizations

c7034df

Signed-off-by: Chih-Chieh-Yang <[email protected]>

Merge branch 'main' into pr_tp_gptq

b09929f

Xaenalt pushed a commit to Xaenalt/text-generation-inference that referenced this pull request Sep 16, 2024

Merge pull request IBM#67 from opendatahub-io/main

daaa6b6

Update TGIS release image

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimizations for TP-Aware GPTQ #67

Performance Optimizations for TP-Aware GPTQ #67

cyang49 commented Mar 22, 2024 •

edited

Loading

Performance Optimizations for TP-Aware GPTQ #67

Are you sure you want to change the base?

Performance Optimizations for TP-Aware GPTQ #67

Conversation

cyang49 commented Mar 22, 2024 • edited Loading

Motivation

Modifications

Result

Related Issues

cyang49 commented Mar 22, 2024 •

edited

Loading