File size: 2,477 Bytes

a5aabe5
38527bb
 
a5aabe5
38527bb
 
 
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
 
 
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
 
 
a5aabe5
38527bb
 
a5aabe5
38527bb
a5aabe5
38527bb
 
 
 
 
 
 
a5aabe5
38527bb
a5aabe5
 
 
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
a5aabe5
38527bb
 
 
a5aabe5
38527bb
a5aabe5
38527bb
 
 
a5aabe5
38527bb

---
license: apache-2.0
pipeline_tag: text-generation
---
<div align="center">
  <img src="https://raw.githubusercontent.com/InternLM/lmdeploy/0be9e7ab6fe9a066cfb0a09d0e0c8d2e28435e58/resources/lmdeploy-logo.svg" width="450"/>
</div>

# INT4 Weight-only Quantization and Deployment (W4A16)

LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.

LMDeploy supports the following NVIDIA GPU for W4A16 inference:

- Turing(sm75): 20 series, T4

- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100

- Ada Lovelace(sm90): 40 series

Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.

```shell
pip install lmdeploy[all]
```

This article comprises the following sections:

<!-- toc -->

- [Inference](#inference)
- [Evaluation](#evaluation)
- [Service](#service)

<!-- tocstop -->
## Inference

Trying the following codes, you can perform the batched offline inference with the quantized model:

```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

For more information about the pipeline parameters, please refer to [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md).

## Evaluation

Please overview [this guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_turbomind.html) about model evaluation with LMDeploy.

## Service

LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

```shell
lmdeploy serve api_server internlm/internlm2_5-7b-chat-4bit --backend turbomind --model-format awq
```

The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:

```shell
lmdeploy serve api_client http://0.0.0.0:23333
```

You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/restful_api.md).