kuotient commited on
Commit
5f31079
โ€ข
1 Parent(s): 2c6baf2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: https://huggingface.co/beomi/llama-2-ko-70b
3
+ inference: false
4
+ language:
5
+ - en
6
+ - ko
7
+ model_name: Llama 2 7B Chat
8
+ model_type: llama
9
+ pipeline_tag: text-generation
10
+ quantized_by: kuotient
11
+ tags:
12
+ - facebook
13
+ - meta
14
+ - pytorch
15
+ - llama
16
+ - llama-2
17
+ - kollama
18
+ - llama-2-ko
19
+ - gptq
20
+ license: cc-by-nc-sa-4.0
21
+ ---
22
+ # WIP
23
+ ## Llama-2-Ko-GPTQ
24
+ <!-- README_GPTQ.md-provided-files start -->
25
+ ## Provided files and GPTQ parameters
26
+ Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
27
+ Each separate quant is in a different branch. See below for instructions on fetching from different branches.
28
+ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa.
29
+ <details>
30
+ <summary>Explanation of GPTQ parameters</summary>
31
+ - Bits: The bit size of the quantised model.
32
+ - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
33
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
34
+ - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
35
+ - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
36
+ - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
37
+ - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.
38
+
39
+ </details>
40
+ # Original model card: Llama 2 ko 70b
41
+ > ๐Ÿšง Note: this repo is under construction ๐Ÿšง
42
+
43
+ # **Llama-2-Ko** ๐Ÿฆ™๐Ÿ‡ฐ๐Ÿ‡ท
44
+
45
+ Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the **70B** pretrained version, which is tailored to fit the Hugging Face Transformers format. For access to the other models, feel free to consult the index provided below.
46
+
47
+ ## Model Details
48
+
49
+ **Model Developers** Junbum Lee (Beomi)
50
+
51
+ **Variations** Llama-2-Ko will come in a range of parameter sizes โ€” 7B, 13B, and 70B โ€” as well as pretrained and fine-tuned variations.
52
+
53
+ **Input** Models input text only.
54
+
55
+ **Output** Models generate text only.
56
+
57
+ ## Usage
58
+
59
+ **Use with 8bit inference**
60
+
61
+ - Requires > 74GB vram (compatible with 4x RTX 3090/4090 or 1x A100/H100 80G or 2x RTX 6000 ada/A6000 48G)
62
+
63
+ ```python
64
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
65
+ model_8bit = AutoModelForCausalLM.from_pretrained(
66
+ "beomi/llama-2-ko-70b",
67
+ load_in_8bit=True,
68
+ device_map="auto",
69
+ )
70
+ tk = AutoTokenizer.from_pretrained('beomi/llama-2-ko-70b')
71
+ pipe = pipeline('text-generation', model=model_8bit, tokenizer=tk)
72
+ def gen(x):
73
+ gended = pipe(f"### Title: {x}\n\n### Contents:", # Since it this model is NOT finetuned with Instruction dataset, it is NOT optimal prompt.
74
+ max_new_tokens=300,
75
+ top_p=0.95,
76
+ do_sample=True,
77
+ )[0]['generated_text']
78
+ print(len(gended))
79
+ print(gended)
80
+ ```
81
+
82
+ **Use with bf16 inference**
83
+
84
+ - Requires > 150GB vram (compatible with 8x RTX 3090/4090 or 2x A100/H100 80G or 4x RTX 6000 ada/A6000 48G)
85
+
86
+ ```python
87
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ "beomi/llama-2-ko-70b",
90
+ device_map="auto",
91
+ )
92
+ tk = AutoTokenizer.from_pretrained('beomi/llama-2-ko-70b')
93
+ pipe = pipeline('text-generation', model=model, tokenizer=tk)
94
+ def gen(x):
95
+ gended = pipe(f"### Title: {x}\n\n### Contents:", # Since it this model is NOT finetuned with Instruction dataset, it is NOT optimal prompt.
96
+ max_new_tokens=300,
97
+ top_p=0.95,
98
+ do_sample=True,
99
+ )[0]['generated_text']
100
+ print(len(gended))
101
+ print(gended)
102
+ ```
103
+
104
+ **Model Architecture**
105
+
106
+ Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
107
+
108
+ ||Training Data|Params|Content Length|GQA|Tokens|LR|
109
+ |---|---|---|---|---|---|---|
110
+ |Llama-2-Ko 70B|*A new mix of Korean online data*|70B|4k|โœ…|>20B|1e<sup>-5</sup>|
111
+ *Plan to train upto 300B tokens
112
+ **Vocab Expansion**
113
+ | Model Name | Vocabulary Size | Description |
114
+ | --- | --- | --- |
115
+ | Original Llama-2 | 32000 | Sentencepiece BPE |
116
+ | **Expanded Llama-2-Ko** | 46592 | Sentencepiece BPE. Added Korean vocab and merges |
117
+ *Note: Llama-2-Ko 70B uses `46592` not `46336`(7B), will update new 7B model soon.
118
+
119
+ **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”. ใ…Žใ…Ž"**
120
+
121
+ | Model | Tokens |
122
+ | --- | --- |
123
+ | Llama-2 | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”', '.', 'โ–', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
124
+ | Llama-2-Ko *70B | `['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”', '.', 'โ–', 'ใ…Ž', 'ใ…Ž']` |
125
+ **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
126
+ | Model | Tokens |
127
+ | --- | --- |
128
+ | Llama-2 | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
129
+ | Llama-2-Ko 70B | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
130
+ # **Model Benchmark**
131
+ ## LM Eval Harness - Korean (polyglot branch)
132
+ - Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
133
+ ### TBD
134
+ ## Note for oobabooga/text-generation-webui
135
+ Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`.
136
+ ```python
137
+ diff --git a/modules/models.py b/modules/models.py
138
+ index 232d5fa..de5b7a0 100644
139
+ --- a/modules/models.py
140
+ +++ b/modules/models.py
141
+ @@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
142
+ trust_remote_code=shared.args.trust_remote_code,
143
+ use_fast=False
144
+ )
145
+ - except ValueError:
146
+ + except:
147
+ tokenizer = AutoTokenizer.from_pretrained(
148
+ path_to_model,
149
+ trust_remote_code=shared.args.trust_remote_code,
150
+ ```
151
+ Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
152
+ it is required to use `use_fast=True` option when initialize tokenizer.
153
+ Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
154
+ ## LICENSE
155
+ - Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License, under LLAMA 2 COMMUNITY LICENSE AGREEMENT
156
+ - Full License available at: [https://huggingface.co/beomi/llama-2-ko-70b/blob/main/LICENSE](https://huggingface.co/beomi/llama-2-ko-70b/blob/main/LICENSE)
157
+ - For Commercial Usage, contact Author.
158
+ ## Citation
159
+ ```
160
+ @misc {l._junbum_2023,
161
+ author = { {L. Junbum} },
162
+ title = { llama-2-ko-70b },
163
+ year = 2023,
164
+ url = { https://huggingface.co/beomi/llama-2-ko-70b },
165
+ doi = { 10.57967/hf/1130 },
166
+ publisher = { Hugging Face }
167
+ }
168
+ ```