BAAI
/

File size: 11,528 Bytes
56afb1b
 
 
a7561dc
d910332
 
 
 
d119fbb
 
 
 
bb7b713
d119fbb
 
 
 
 
 
 
 
 
 
476b477
 
 
849a317
476b477
2c2268a
476b477
 
d119fbb
 
2dc4b64
d119fbb
 
 
 
 
 
 
d2d6eae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15e92fc
d2d6eae
 
 
 
15e92fc
d2d6eae
 
 
15e92fc
d2d6eae
 
 
 
 
 
 
 
 
 
15e92fc
d2d6eae
 
 
 
 
 
 
 
15e92fc
d2d6eae
 
 
 
 
 
15e92fc
d2d6eae
 
 
 
 
 
 
 
 
 
15e92fc
d2d6eae
 
 
 
 
 
 
 
15e92fc
d2d6eae
 
 
 
 
 
 
 
 
 
 
 
15e92fc
d2d6eae
15e92fc
 
 
d2d6eae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af9bbff
 
 
 
bb7b713
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: other
---


![Aquila_logo](./log.jpeg)


# Aquila

Aquila Language Model is the first open source language model that supports both Chinese and English knowledge, commercial license agreements, and compliance with domestic data regulations.

- 🌟 **Supports open source commercial licenses**. The source code of the Aquila series models is based on the [Apache 2.0 agreement](https://www.apache.org/licenses/LICENSE-2.0), while the model weight is based on the [BAAI Aquila Model License Agreement](https://huggingface.co/BAAI/AquilaChat-7B/resolve/main/BAAI%20Aquila%20Model%20License%20Agreement.pdf). Users can use it for commercial purposes as long as they meet the licensing restrictions.

- ✍️ **Possesses Chinese and English knowledge**. The Aquila series model is trained from scratch on a high-quality corpus of Chinese and English languages, with Chinese corpora accounting for about 40%, ensuring that the model accumulates native Chinese world knowledge during the pre-training phase, rather than translated knowledge.

- 👮‍♀️ **Complies with domestic data regulations**. The Chinese corpora of the Aquila series models come from Intelligence Source's accumulated Chinese datasets over the years, including Chinese internet data from over 10,000 sources (more than 99% of which are domestic sources), as well as high-quality Chinese literature and book data supported by authoritative domestic organizations. We will continue to accumulate high-quality and diverse datasets and incorporate them into the subsequent training of the Aquila base models.

- 🎯 **Continuous improvements and open sourcing**. We will continue to improve training data, optimize training methods, and enhance model performance, cultivate a flourishing "model tree" on a better base model foundation, and continuously update open-source versions.

The additional details of the Aquila model will be presented in the official technical report. Please stay tuned for updates on official channels, including the [FlagAI GitHub repository](https://github.com/FlagAI-Open/FlagAI/), [FlagAI's Zhihu account](https://www.zhihu.com/people/95-22-20-18) and [FlagAI's official technical communication group](https://github.com/FlagAI-Open/FlagAI/blob/master/wechat-qrcode.jpg).


| Model              | Model Type               | Description                                                                                                                                                                                                                                                                                                                                                                                                     | Status         | GPUs Used    |
| :----------------- | :----------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :--------------| :----------- | 
| Aquila-7B          | Base model, 7 billion parameters   | **Aquila Base Model** inherits the architectural design advantages of GPT-3 and LLaMA. It replaces a batch of more efficient underlying operator implementations, redesigns the implementation of bilingual tokenizer, upgrades BMTrain parallel training method, and achieves nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2.                                   | Released       | Nvidia-A100   |
| Aquila-33B         | Base model, 33 billion parameters   | Same as above                                                                                                                                                                                                                                                                                                                                                                        | Coming soon                                               | Nvidia-A100   |
| AquilaChat-7B      | SFT model, fine-tuned and RL based on Aquila-7B  | **AquilaChat Dialog Model** supports fluent text dialogue and multiple language generation tasks, and realizes the call of AquilaChat to other models and tools by defining an expandable special instruction specification, which is easy to extend. For example, calling the open source **[AltDiffusion](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltDiffusion-m18) multimodal language image generation model** of Flagship Intelligence achieved smooth image generation capability. Together with Flagship Intelligence's **InstructFace multi-step controllable text-picture model**, it is easy to achieve multi-step controllable editing of human face images. | Released    | Nvidia-A100   |
| AquilaChat-33B     | SFT model, fine-tuned and RL based on Aquila-33B  | Same as above                                                                                                                                                                                                                                                                                                                                                                                                   | Coming soon                                               | Nvidia-A100   |
| AquilaCode-7B-NV   | Base model, "text-code" generation model, further pre-trained based on Aquila-7B, trained on Nvidia  | AquilaCode-7B achieves high performance with small data sets and parameters, and is currently the best open source code model that supports both Chinese and English, trained using training code data with compliant open source licenses after high-quality filtering. AquilaCode-7B has been trained on both Nvidia and domestic chips for code models. | Released | Nvidia-A100  |
| AquilaCode-7B-TS   | Base model, "text-code" generation model, further pre-trained based on Aquila-7B, trained on Horizon Robotics chips | Same as above                                                                                                                                                                                                                                                                                                                                                                                                             | Released        | Tianshu-BI-V100 |


We will continue to release improved versions of Aquila model as open source. For more details, please refer to the **[Change Log](https://huggingface.co/BAAI/AquilaChat-7B/blob/main/change_log.log)**.





<!-- </table>  -->

## Quick Start  AquilaChat-7B(Chat model)

### 1. Inference

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from cyg_conversation import covert_prompt_to_input_ids_with_history

tokenizer = AutoTokenizer.from_pretrained("BAAI/AquilaChat-7B")
model = AutoModelForCausalLM.from_pretrained("BAAI/AquilaChat-7B")
model.eval()
model.to("cuda:0")
vocab = tokenizer.vocab
print(len(vocab))

text = "请给出10个要到北京旅游的理由。"

tokens = covert_prompt_to_input_ids_with_history(text, history=[], tokenizer=tokenizer, max_token=512)

tokens = torch.tensor(tokens)[None,].to("cuda:0")


with torch.no_grad():
    out = model.generate(tokens, do_sample=True, max_length=512, eos_token_id=100007)[0]

    out = tokenizer.decode(out.cpu().numpy().tolist())

    print(out)
```

usning [NBCE](https://github.com/bojone/NBCE/tree/main) Inference

```python
import json
import torch
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TopPLogitsWarper, LogitsProcessorList
import pdb

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.padding_side = 'left' 
tokenizer.pad_token = tokenizer.unk_token

# load Aquila model
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
device = torch.device('cuda')
model.to(device)
# load example Context
from cyg_conversation import default_conversation

conv = default_conversation.copy()
contexts = json.load(open('code_text_2.json'))

question = "请解释这段程序的功能:"
batch = []
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
batch.append(conv.get_prompt())
# concat context and question
for ci,context in enumerate(contexts):        
    conv1 = default_conversation.copy()
    conv1.append_message(conv.roles[0], context+question)
    conv1.append_message(conv.roles[1], None)
    batch.append(conv1.get_prompt())
print('Context长度分布:', [len(text) for text in batch])
print('Context总长度:', sum([len(text) for text in batch]))

# Top-P
processors = LogitsProcessorList()
processors.append(TopPLogitsWarper(0.95))

# Copied from https://github.com/bojone/NBCE/blob/main/test.py#L51-L106
@torch.inference_mode()
def generate(max_tokens):
    """Naive Bayes-based Context Extension example code
    """
    inputs = tokenizer(batch, padding='longest', return_tensors='pt').to(device)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    
    print('input_ids', input_ids.shape)
    past_key_values = None
    n = input_ids.shape[0]
    
    for i in range(max_tokens):
        # model output
        outputs = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        return_dict=True,
                        use_cache=True,
                        past_key_values=past_key_values
                       )
        past_key_values = outputs.past_key_values
        
        # ===== NBCE core code starts =====
        beta, eta = 0.25, 0.1
        logits = outputs.logits[:, -1]
        logits = logits - logits.logsumexp(dim=-1, keepdims=True)
        logits = processors(input_ids, logits)
        entropy = -(logits.exp() * logits.clip(-100, 0)).sum(dim=-1)
        if i > 0:
            entropy[k] -= eta
        k = entropy[1:].argmin() + 1
        logits_max = logits[k]
        logits_uncond = logits[0]
        logits_merged = (1 + beta) * logits_max - beta * logits_uncond
        logits = torch.where(logits_uncond > -100, logits_merged, logits_max)
        # ===== NBCE core code ends =====
        
        # Building a distribution and sampling
        # tau = 1 is standard random sampling,tau->0 is greedy search
        # For simplicity, top-k and top-p truncation are not implemented here.
        tau = 0.01
        probas = torch.nn.functional.softmax(logits[None] / tau , dim=-1)
        next_tokens = torch.multinomial(probas, num_samples=1).squeeze(1)        
        if next_tokens[0] == tokenizer.eos_token_id:
            break
            
        ret = tokenizer.batch_decode(next_tokens)
        print(ret[0], flush=True, end='')
        
        # prepare for next iteration
        input_ids = next_tokens.unsqueeze(-1).tile(n, 1)
        attention_mask = torch.cat([attention_mask, torch.ones(n, 1, dtype=torch.long, device=device)], dim=-1)        


if __name__ == '__main__':
    generate(1000)

```

## License

AquilaChat-7B and AquilaChat-33B open-source model is licensed under [ BAAI Aquila Model Licence Agreement](https://huggingface.co/BAAI/AquilaChat-7B/resolve/main/BAAI%20Aquila%20Model%20License%20Agreement.pdf)