Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: https://huggingface.co/beomi/llama-2-ko-70b
|
3 |
+
inference: false
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
- ko
|
7 |
+
model_name: Llama 2 7B Chat
|
8 |
+
model_type: llama
|
9 |
+
pipeline_tag: text-generation
|
10 |
+
quantized_by: kuotient
|
11 |
+
tags:
|
12 |
+
- facebook
|
13 |
+
- meta
|
14 |
+
- pytorch
|
15 |
+
- llama
|
16 |
+
- llama-2
|
17 |
+
- kollama
|
18 |
+
- llama-2-ko
|
19 |
+
- gptq
|
20 |
+
license: cc-by-nc-sa-4.0
|
21 |
+
---
|
22 |
+
# WIP
|
23 |
+
## Llama-2-Ko-GPTQ
|
24 |
+
<!-- README_GPTQ.md-provided-files start -->
|
25 |
+
## Provided files and GPTQ parameters
|
26 |
+
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
27 |
+
Each separate quant is in a different branch. See below for instructions on fetching from different branches.
|
28 |
+
All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa.
|
29 |
+
<details>
|
30 |
+
<summary>Explanation of GPTQ parameters</summary>
|
31 |
+
- Bits: The bit size of the quantised model.
|
32 |
+
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
33 |
+
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
|
34 |
+
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
35 |
+
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
36 |
+
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
37 |
+
- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.
|
38 |
+
|
39 |
+
</details>
|
40 |
+
# Original model card: Llama 2 ko 70b
|
41 |
+
> ๐ง Note: this repo is under construction ๐ง
|
42 |
+
|
43 |
+
# **Llama-2-Ko** ๐ฆ๐ฐ๐ท
|
44 |
+
|
45 |
+
Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the **70B** pretrained version, which is tailored to fit the Hugging Face Transformers format. For access to the other models, feel free to consult the index provided below.
|
46 |
+
|
47 |
+
## Model Details
|
48 |
+
|
49 |
+
**Model Developers** Junbum Lee (Beomi)
|
50 |
+
|
51 |
+
**Variations** Llama-2-Ko will come in a range of parameter sizes โ 7B, 13B, and 70B โ as well as pretrained and fine-tuned variations.
|
52 |
+
|
53 |
+
**Input** Models input text only.
|
54 |
+
|
55 |
+
**Output** Models generate text only.
|
56 |
+
|
57 |
+
## Usage
|
58 |
+
|
59 |
+
**Use with 8bit inference**
|
60 |
+
|
61 |
+
- Requires > 74GB vram (compatible with 4x RTX 3090/4090 or 1x A100/H100 80G or 2x RTX 6000 ada/A6000 48G)
|
62 |
+
|
63 |
+
```python
|
64 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
65 |
+
model_8bit = AutoModelForCausalLM.from_pretrained(
|
66 |
+
"beomi/llama-2-ko-70b",
|
67 |
+
load_in_8bit=True,
|
68 |
+
device_map="auto",
|
69 |
+
)
|
70 |
+
tk = AutoTokenizer.from_pretrained('beomi/llama-2-ko-70b')
|
71 |
+
pipe = pipeline('text-generation', model=model_8bit, tokenizer=tk)
|
72 |
+
def gen(x):
|
73 |
+
gended = pipe(f"### Title: {x}\n\n### Contents:", # Since it this model is NOT finetuned with Instruction dataset, it is NOT optimal prompt.
|
74 |
+
max_new_tokens=300,
|
75 |
+
top_p=0.95,
|
76 |
+
do_sample=True,
|
77 |
+
)[0]['generated_text']
|
78 |
+
print(len(gended))
|
79 |
+
print(gended)
|
80 |
+
```
|
81 |
+
|
82 |
+
**Use with bf16 inference**
|
83 |
+
|
84 |
+
- Requires > 150GB vram (compatible with 8x RTX 3090/4090 or 2x A100/H100 80G or 4x RTX 6000 ada/A6000 48G)
|
85 |
+
|
86 |
+
```python
|
87 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
88 |
+
model = AutoModelForCausalLM.from_pretrained(
|
89 |
+
"beomi/llama-2-ko-70b",
|
90 |
+
device_map="auto",
|
91 |
+
)
|
92 |
+
tk = AutoTokenizer.from_pretrained('beomi/llama-2-ko-70b')
|
93 |
+
pipe = pipeline('text-generation', model=model, tokenizer=tk)
|
94 |
+
def gen(x):
|
95 |
+
gended = pipe(f"### Title: {x}\n\n### Contents:", # Since it this model is NOT finetuned with Instruction dataset, it is NOT optimal prompt.
|
96 |
+
max_new_tokens=300,
|
97 |
+
top_p=0.95,
|
98 |
+
do_sample=True,
|
99 |
+
)[0]['generated_text']
|
100 |
+
print(len(gended))
|
101 |
+
print(gended)
|
102 |
+
```
|
103 |
+
|
104 |
+
**Model Architecture**
|
105 |
+
|
106 |
+
Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
|
107 |
+
|
108 |
+
||Training Data|Params|Content Length|GQA|Tokens|LR|
|
109 |
+
|---|---|---|---|---|---|---|
|
110 |
+
|Llama-2-Ko 70B|*A new mix of Korean online data*|70B|4k|โ
|>20B|1e<sup>-5</sup>|
|
111 |
+
*Plan to train upto 300B tokens
|
112 |
+
**Vocab Expansion**
|
113 |
+
| Model Name | Vocabulary Size | Description |
|
114 |
+
| --- | --- | --- |
|
115 |
+
| Original Llama-2 | 32000 | Sentencepiece BPE |
|
116 |
+
| **Expanded Llama-2-Ko** | 46592 | Sentencepiece BPE. Added Korean vocab and merges |
|
117 |
+
*Note: Llama-2-Ko 70B uses `46592` not `46336`(7B), will update new 7B model soon.
|
118 |
+
|
119 |
+
**Tokenizing "์๋
ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์. ใ
ใ
"**
|
120 |
+
|
121 |
+
| Model | Tokens |
|
122 |
+
| --- | --- |
|
123 |
+
| Llama-2 | `['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์', '.', 'โ', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
|
124 |
+
| Llama-2-Ko *70B | `['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์', '.', 'โ', 'ใ
', 'ใ
']` |
|
125 |
+
**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
|
126 |
+
| Model | Tokens |
|
127 |
+
| --- | --- |
|
128 |
+
| Llama-2 | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` |
|
129 |
+
| Llama-2-Ko 70B | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` |
|
130 |
+
# **Model Benchmark**
|
131 |
+
## LM Eval Harness - Korean (polyglot branch)
|
132 |
+
- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
|
133 |
+
### TBD
|
134 |
+
## Note for oobabooga/text-generation-webui
|
135 |
+
Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`.
|
136 |
+
```python
|
137 |
+
diff --git a/modules/models.py b/modules/models.py
|
138 |
+
index 232d5fa..de5b7a0 100644
|
139 |
+
--- a/modules/models.py
|
140 |
+
+++ b/modules/models.py
|
141 |
+
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
|
142 |
+
trust_remote_code=shared.args.trust_remote_code,
|
143 |
+
use_fast=False
|
144 |
+
)
|
145 |
+
- except ValueError:
|
146 |
+
+ except:
|
147 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
148 |
+
path_to_model,
|
149 |
+
trust_remote_code=shared.args.trust_remote_code,
|
150 |
+
```
|
151 |
+
Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
|
152 |
+
it is required to use `use_fast=True` option when initialize tokenizer.
|
153 |
+
Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
|
154 |
+
## LICENSE
|
155 |
+
- Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License, under LLAMA 2 COMMUNITY LICENSE AGREEMENT
|
156 |
+
- Full License available at: [https://huggingface.co/beomi/llama-2-ko-70b/blob/main/LICENSE](https://huggingface.co/beomi/llama-2-ko-70b/blob/main/LICENSE)
|
157 |
+
- For Commercial Usage, contact Author.
|
158 |
+
## Citation
|
159 |
+
```
|
160 |
+
@misc {l._junbum_2023,
|
161 |
+
author = { {L. Junbum} },
|
162 |
+
title = { llama-2-ko-70b },
|
163 |
+
year = 2023,
|
164 |
+
url = { https://huggingface.co/beomi/llama-2-ko-70b },
|
165 |
+
doi = { 10.57967/hf/1130 },
|
166 |
+
publisher = { Hugging Face }
|
167 |
+
}
|
168 |
+
```
|