File size: 5,709 Bytes
569f484
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
## SWIFT install
You can quickly install SWIFT using bash commands.

``` bash
git clone https://github.com/modelscope/swift.git
cd swift
pip install -r requirements.txt
pip install -e '.[llm]'
```

## SWIFT Infer
Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code.

### Quick start
Here are the steps to launch SWIFT from the Bash command line:

1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference
``` shell
CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat
```

2. You can also run the code with more arguments below to run the inference:
``` 
model_id_or_path  # Can be the model ID from Hugging Face or the local path to the model
infer_backend ['AUTO', 'vllm', 'pt']  # Backend for inference, default is auto
dtype ['bf16', 'fp16', 'fp32', 'AUTO']  # Computational precision
max_length  # Maximum length
max_new_tokens: int = 2048  # Maximum number of tokens to generate
do_sample: bool = True  # Whether to sample during generation
temperature: float = 0.3  # Temperature coefficient during generation
top_k: int = 20 
top_p: float = 0.7
repetition_penalty: float = 1.  # Penalty for repetition
num_beams: int = 1  # Number of beams for beam search
stop_words: List[str] = None  # List of stop words
quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm']  # Quantization method for the model
quantization_bit [0, 1, 2, 3, 4, 8]  # Default is 0, which means no quantization is used
```
3. Example:
``` shell
CUDA_VISIBLE_DEVICES=0,1 swift infer \
--model_type minicpm-v-v2_5-chat \
--model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \
--dtype bf16 
```
### Python code with SWIFT infer
The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT.

```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'  # Set the number of GPUs to use

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType,
    get_default_template_type, inference_stream
)  # Import necessary modules

from swift.utils import seed_everything  # Set random seed
import torch

model_type = ModelType.minicpm_v_v2_5_chat
template_type = get_default_template_type(model_type)  # Obtain the template type, primarily used for constructing special tokens and image processing workflow
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
                                       model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5',
                                       model_kwargs={'device_map': 'auto'})  # Load the model, set model type, model path, model parameters, device allocation, etc., computation precision, etc.
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)  # Construct the template based on the template type
seed_everything(42)

images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']  # Image URL
query = '距离各城市多远?'  # Note: Query is still in Chinese, consider translating if needed
response, history = inference(model, template, query, images=images)  # Obtain results through inference
print(f'query: {query}')
print(f'response: {response}')

# Streaming output
query = '距离最远的城市是哪?'  # Note: Query is still in Chinese, consider translating if needed
gen = inference_stream(model, template, query, history, images=images)  # Call the streaming output interface
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
    delta = response[print_idx:]
    print(delta, end='', flush=True)
    print_idx = len(response)
print()
print(f'history: {history}')
```

## SWIFT train
SWIFT supports training on the local dataset,the training steps are as follows:
1. Make the train data like this:
```jsonl
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "images": ["local_image_path"]}
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "history": [], "images": ["image_path"]}
{"query": "Is bamboo tasty?", "response": "It seems pretty tasty judging by the panda's expression.", "history": [["What's in this picture?", "There's a giant panda in this picture."], ["What is the panda doing?", "Eating bamboo."]], "images": ["image_url"]}
```
2. LoRA Tuning:

The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value.
```shell
# Experimental environment: A100
# 32GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type minicpm-v-v2_5-chat \
--dataset coco-en-2-mini \
```
3. All parameters finetune:

When the argument of lora_target_modules is ALL, the model will finetune all the parameters.
```shell
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type minicpm-v-v2_5-chat \
--dataset coco-en-2-mini \
--lora_target_modules ALL \
--eval_steps 200000
```

## LoRA Merge and Infer
The LoRA weight can be merge to the base model and then load to infer.

1. Load the LoRA weight to infer run the follow code:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir /your/lora/save/checkpoint
```
2. Merge the LoRA weight to the base model:

The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir your/lora/save/checkpoint \
--merge_lora true
```