Spaces:
Paused
Paused
File size: 13,448 Bytes
ee6e328 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# λκ·λͺ¨ μΈμ΄ λͺ¨λΈλ‘ μμ±νκΈ° [[generation-with-llms]]
[[open-in-colab]]
LLM λλ λκ·λͺ¨ μΈμ΄ λͺ¨λΈμ ν
μ€νΈ μμ±μ ν΅μ¬ κ΅¬μ± μμμ
λλ€. κ°λ¨ν λ§νλ©΄, μ£Όμ΄μ§ μ
λ ₯ ν
μ€νΈμ λν λ€μ λ¨μ΄(μ ννκ²λ ν ν°)λ₯Ό μμΈ‘νκΈ° μν΄ νλ ¨λ λκ·λͺ¨ μ¬μ νλ ¨ λ³νκΈ° λͺ¨λΈλ‘ ꡬμ±λ©λλ€. ν ν°μ ν λ²μ νλμ© μμΈ‘νκΈ° λλ¬Έμ μλ‘μ΄ λ¬Έμ₯μ μμ±νλ €λ©΄ λͺ¨λΈμ νΈμΆνλ κ² μΈμ λ 볡μ‘ν μμ
μ μνν΄μΌ ν©λλ€. μ¦, μκΈ°νκ· μμ±μ μνν΄μΌ ν©λλ€.
μκΈ°νκ· μμ±μ λͺ κ°μ μ΄κΈ° μ
λ ₯κ°μ μ 곡ν ν, κ·Έ μΆλ ₯μ λ€μ λͺ¨λΈμ μ
λ ₯μΌλ‘ μ¬μ©νμ¬ λ°λ³΅μ μΌλ‘ νΈμΆνλ μΆλ‘ κ³Όμ μ
λλ€. π€ Transformersμμλ [`~generation.GenerationMixin.generate`] λ©μλκ° μ΄ μν μ νλ©°, μ΄λ μμ± κΈ°λ₯μ κ°μ§ λͺ¨λ λͺ¨λΈμμ μ¬μ© κ°λ₯ν©λλ€.
μ΄ νν 리μΌμμλ λ€μ λ΄μ©μ λ€λ£¨κ² λ©λλ€:
* LLMμΌλ‘ ν
μ€νΈ μμ±
* μΌλ°μ μΌλ‘ λ°μνλ λ¬Έμ ν΄κ²°
* LLMμ μ΅λν νμ©νκΈ° μν λ€μ λ¨κ³
μμνκΈ° μ μ νμν λͺ¨λ λΌμ΄λΈλ¬λ¦¬κ° μ€μΉλμ΄ μλμ§ νμΈνμΈμ:
```bash
pip install transformers bitsandbytes>=0.39.0 -q
```
## ν
μ€νΈ μμ± [[generate-text]]
[μΈκ³Όμ μΈμ΄ λͺ¨λΈλ§(causal language modeling)](tasks/language_modeling)μ λͺ©μ μΌλ‘ νμ΅λ μΈμ΄ λͺ¨λΈμ μΌλ ¨μ ν
μ€νΈ ν ν°μ μ
λ ₯μΌλ‘ μ¬μ©νκ³ , κ·Έ κ²°κ³Όλ‘ λ€μ ν ν°μ΄ λμ¬ νλ₯ λΆν¬λ₯Ό μ 곡ν©λλ€.
<!-- [GIF 1 -- FWD PASS] -->
<figure class="image table text-center m-0 w-full">
<video
style="max-width: 90%; margin: auto;"
autoplay loop muted playsinline
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_1_1080p.mov"
></video>
<figcaption>"LLMμ μ λ°© ν¨μ€"</figcaption>
</figure>
LLMκ³Ό μκΈ°νκ· μμ±μ ν¨κ» μ¬μ©ν λ ν΅μ¬μ μΈ λΆλΆμ μ΄ νλ₯ λΆν¬λ‘λΆν° λ€μ ν ν°μ μ΄λ»κ² κ³ λ₯Ό κ²μΈμ§μ
λλ€. λ€μ λ°λ³΅ κ³Όμ μ μ¬μ©λ ν ν°μ κ²°μ νλ ν, μ΄λ ν λ°©λ²λ κ°λ₯ν©λλ€. νλ₯ λΆν¬μμ κ°μ₯ κ°λ₯μ±μ΄ λμ ν ν°μ μ ννλ κ²μ²λΌ κ°λ¨ν μλ μκ³ , κ²°κ³Ό λΆν¬μμ μνλ§νκΈ° μ μ μμ κ°μ§ λ³νμ μ μ©νλ κ²μ²λΌ 볡μ‘ν μλ μμ΅λλ€.
<!-- [GIF 2 -- TEXT GENERATION] -->
<figure class="image table text-center m-0 w-full">
<video
style="max-width: 90%; margin: auto;"
autoplay loop muted playsinline
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_2_1080p.mov"
></video>
<figcaption>"μκΈ°νκ· μμ±μ νλ₯ λΆν¬μμ λ€μ ν ν°μ λ°λ³΅μ μΌλ‘ μ ννμ¬ ν
μ€νΈλ₯Ό μμ±ν©λλ€."</figcaption>
</figure>
μμμ μ€λͺ
ν κ³Όμ μ μ΄λ€ μ’
λ£ μ‘°κ±΄μ΄ μΆ©μ‘±λ λκΉμ§ λ°λ³΅μ μΌλ‘ μνλ©λλ€. λͺ¨λΈμ΄ μνμ€μ λ(EOS ν ν°)μ μΆλ ₯ν λκΉμ§λ₯Ό μ’
λ£ μ‘°κ±΄μΌλ‘ νλ κ²μ΄ μ΄μμ μ
λλ€. κ·Έλ μ§ μμ κ²½μ°μλ 미리 μ μλ μ΅λ κΈΈμ΄μ λλ¬νμ λ μμ±μ΄ μ€λ¨λ©λλ€.
λͺ¨λΈμ΄ μμλλ‘ λμνκΈ° μν΄μ ν ν° μ ν λ¨κ³μ μ μ§ μ‘°κ±΄μ μ¬λ°λ₯΄κ² μ€μ νλ κ²μ΄ μ€μν©λλ€. μ΄λ¬ν μ΄μ λ‘, κ° λͺ¨λΈμλ κΈ°λ³Έ μμ± μ€μ μ΄ μ μ μλ [`~generation.GenerationConfig`] νμΌμ΄ ν¨κ» μ 곡λ©λλ€.
μ½λλ₯Ό νμΈν΄λ΄
μλ€!
<Tip>
κΈ°λ³Έ LLM μ¬μ©μ κ΄μ¬μ΄ μλ€λ©΄, μ°λ¦¬μ [`Pipeline`](pipeline_tutorial) μΈν°νμ΄μ€λ‘ μμνλ κ²μ μΆμ²ν©λλ€. κ·Έλ¬λ LLMμ μμνλ ν ν° μ ν λ¨κ³μμμ λ―ΈμΈν μ μ΄μ κ°μ κ³ κΈ κΈ°λ₯λ€μ μ’
μ’
νμλ‘ ν©λλ€. μ΄λ¬ν μμ
μ [`~generation.GenerationMixin.generate`]λ₯Ό ν΅ν΄ κ°μ₯ μ μνλ μ μμ΅λλ€. LLMμ μ΄μ©ν μκΈ°νκ· μμ±μ μμμ λ§μ΄ μλͺ¨νλ―λ‘, μ μ ν μ²λ¦¬λμ μν΄ GPUμμ μ€νλμ΄μΌ ν©λλ€.
</Tip>
<!-- TODO: update example to llama 2 (or a newer popular baseline) when it becomes ungated -->
λ¨Όμ , λͺ¨λΈμ λΆλ¬μ€μΈμ.
```py
>>> from transformers import AutoModelForCausalLM
>>> model = AutoModelForCausalLM.from_pretrained(
... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True
... )
```
`from_pretrained` ν¨μλ₯Ό νΈμΆν λ 2κ°μ νλκ·Έλ₯Ό μ£Όλͺ©νμΈμ:
- `device_map`μ λͺ¨λΈμ΄ GPUλ‘ μ΄λλλλ‘ ν©λλ€.
- `load_in_4bit`λ 리μμ€ μꡬ μ¬νμ ν¬κ² μ€μ΄κΈ° μν΄ [4λΉνΈ λμ μμν](main_classes/quantization)λ₯Ό μ μ©ν©λλ€.
μ΄ μΈμλ λͺ¨λΈμ μ΄κΈ°ννλ λ€μν λ°©λ²μ΄ μμ§λ§, LLMμ μ²μ μμν λ μ΄ μ€μ μ μΆμ²ν©λλ€.
μ΄μ΄μ ν
μ€νΈ μ
λ ₯μ [ν ν¬λμ΄μ ](tokenizer_summary)μΌλ‘ μ μ²λ¦¬νμΈμ.
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
>>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")
```
`model_inputs` λ³μμλ ν ν°νλ ν
μ€νΈ μ
λ ₯κ³Ό ν¨κ» μ΄ν
μ
λ§μ€ν¬κ° λ€μ΄ μμ΅λλ€. [`~generation.GenerationMixin.generate`]λ μ΄ν
μ
λ§μ€ν¬κ° μ 곡λμ§ μμμ κ²½μ°μλ μ΄λ₯Ό μΆλ‘ νλ €κ³ λ
Έλ ₯νμ§λ§, μ΅μμ μ±λ₯μ μν΄μλ κ°λ₯νλ©΄ μ΄ν
μ
λ§μ€ν¬λ₯Ό μ λ¬νλ κ²μ κΆμ₯ν©λλ€.
λ§μ§λ§μΌλ‘ [`~generation.GenerationMixin.generate`] λ©μλλ₯Ό νΈμΆν΄ μμ±λ ν ν°μ μ»μ ν, μ΄λ₯Ό μΆλ ₯νκΈ° μ μ ν
μ€νΈ ννλ‘ λ³ννμΈμ.
```py
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'A list of colors: red, blue, green, yellow, black, white, and brown'
```
μ΄κ² μ λΆμ
λλ€! λͺ μ€μ μ½λλ§μΌλ‘ LLMμ λ₯λ ₯μ νμ©ν μ μκ² λμμ΅λλ€.
## μΌλ°μ μΌλ‘ λ°μνλ λ¬Έμ [[common-pitfalls]]
[μμ± μ λ΅](generation_strategies)μ΄ λ§κ³ , κΈ°λ³Έκ°μ΄ νμ μ¬μ© μ¬λ‘μ μ ν©νμ§ μμ μ μμ΅λλ€. μΆλ ₯μ΄ μμκ³Ό λ€λ₯Ό λ νν λ°μνλ λ¬Έμ μ μ΄λ₯Ό ν΄κ²°νλ λ°©λ²μ λν λͺ©λ‘μ λ§λ€μμ΅λλ€.
```py
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default
>>> model = AutoModelForCausalLM.from_pretrained(
... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True
... )
```
### μμ±λ μΆλ ₯μ΄ λ무 짧거λ κΈΈλ€ [[generated-output-is-too-shortlong]]
[`~generation.GenerationConfig`] νμΌμμ λ³λλ‘ μ§μ νμ§ μμΌλ©΄, `generate`λ κΈ°λ³Έμ μΌλ‘ μ΅λ 20κ°μ ν ν°μ λ°νν©λλ€. `generate` νΈμΆμμ `max_new_tokens`μ μλμΌλ‘ μ€μ νμ¬ λ°νν μ μλ μ ν ν°μ μ΅λ μλ₯Ό μ€μ νλ κ²μ΄ μ’μ΅λλ€. LLM(μ ννκ²λ [λμ½λ μ μ© λͺ¨λΈ](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt))μ μ
λ ₯ ν둬ννΈλ μΆλ ₯μ μΌλΆλ‘ λ°νν©λλ€.
```py
>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda")
>>> # By default, the output will contain up to 20 tokens
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'A sequence of numbers: 1, 2, 3, 4, 5'
>>> # Setting `max_new_tokens` allows you to control the maximum length
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=50)
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,'
```
### μλͺ»λ μμ± λͺ¨λ [[incorrect-generation-mode]]
κΈ°λ³Έμ μΌλ‘ [`~generation.GenerationConfig`] νμΌμμ λ³λλ‘ μ§μ νμ§ μμΌλ©΄, `generate`λ κ° λ°λ³΅μμ κ°μ₯ νλ₯ μ΄ λμ ν ν°μ μ νν©λλ€(그리λ λμ½λ©). νλ €λ μμ
μ λ°λΌ μ΄ λ°©λ²μ λ°λμ§νμ§ μμ μ μμ΅λλ€. μλ₯Ό λ€μ΄, μ±λ΄μ΄λ μμΈμ΄ μμ±κ³Ό κ°μ μ°½μμ μΈ μμ
μ μνλ§μ΄ μ ν©ν μ μμ΅λλ€. λ°λ©΄, μ€λμ€λ₯Ό ν
μ€νΈλ‘ λ³ννκ±°λ λ²μκ³Ό κ°μ μ
λ ₯ κΈ°λ° μμ
μ 그리λ λμ½λ©μ΄ λ μ ν©ν μ μμ΅λλ€. `do_sample=True`λ‘ μνλ§μ νμ±νν μ μμΌλ©°, μ΄ μ£Όμ μ λν μμΈν λ΄μ©μ μ΄ [λΈλ‘κ·Έ ν¬μ€νΈ](https://huggingface.co/blog/how-to-generate)μμ λ³Ό μ μμ΅λλ€.
```py
>>> # Set seed or reproducibility -- you don't need this unless you want full reproducibility
>>> from transformers import set_seed
>>> set_seed(0)
>>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda")
>>> # LLM + greedy decoding = repetitive, boring output
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'I am a cat. I am a cat. I am a cat. I am a cat'
>>> # With sampling, the output becomes more creative!
>>> generated_ids = model.generate(**model_inputs, do_sample=True)
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'I am a cat.\nI just need to be. I am always.\nEvery time'
```
### μλͺ»λ ν¨λ© [[wrong-padding-side]]
LLMμ [λμ½λ μ μ©](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt) ꡬ쑰λ₯Ό κ°μ§κ³ μμ΄, μ
λ ₯ ν둬ννΈμ λν΄ μ§μμ μΌλ‘ λ°λ³΅ μ²λ¦¬λ₯Ό ν©λλ€. μ
λ ₯ λ°μ΄ν°μ κΈΈμ΄κ° λ€λ₯΄λ©΄ ν¨λ© μμ
μ΄ νμν©λλ€. LLMμ ν¨λ© ν ν°μμ μλμ μ΄μ΄κ°λλ‘ μ€κ³λμ§ μμκΈ° λλ¬Έμ, μ
λ ₯ μΌμͺ½μ ν¨λ©μ΄ μΆκ° λμ΄μΌ ν©λλ€. κ·Έλ¦¬κ³ μ΄ν
μ
λ§μ€ν¬λ κΌ `generate` ν¨μμ μ λ¬λμ΄μΌ ν©λλ€!
```py
>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence,
>>> # which is shorter, has padding on the right side. Generation fails.
>>> model_inputs = tokenizer(
... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
... ).to("cuda")
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.batch_decode(generated_ids[0], skip_special_tokens=True)[0]
''
>>> # With left-padding, it works as expected!
>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b", padding_side="left")
>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default
>>> model_inputs = tokenizer(
... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
... ).to("cuda")
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'1, 2, 3, 4, 5, 6,'
```
<!-- TODO: when the prompting guide is ready, mention the importance of setting the right prompt in this section -->
## μΆκ° μλ£ [[further-resources]]
μκΈ°νκ· μμ± νλ‘μΈμ€λ μλμ μΌλ‘ λ¨μν νΈμ΄μ§λ§, LLMμ μ΅λν νμ©νλ €λ©΄ μ¬λ¬ κ°μ§ μμλ₯Ό κ³ λ €ν΄μΌ νλ―λ‘ μ½μ§ μμ μ μμ΅λλ€. LLMμ λν λ κΉμ μ΄ν΄μ νμ©μ μν λ€μ λ¨κ³λ μλμ κ°μ΅λλ€:
<!-- TODO: complete with new guides -->
### κ³ κΈ μμ± μ¬μ© [[advanced-generate-usage]]
1. [κ°μ΄λ](generation_strategies)λ λ€μν μμ± λ°©λ²μ μ μ΄νλ λ°©λ², μμ± μ€μ νμΌμ μ€μ νλ λ°©λ², μΆλ ₯μ μ€νΈλ¦¬λ°νλ λ°©λ²μ λν΄ μ€λͺ
ν©λλ€.
2. [`~generation.GenerationConfig`]μ [`~generation.GenerationMixin.generate`], [generate-related classes](internal/generation_utils)λ₯Ό μ°Έμ‘°ν΄λ³΄μΈμ.
### LLM 리λ보λ [[llm-leaderboards]]
1. [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)λ μ€ν μμ€ λͺ¨λΈμ νμ§μ μ€μ μ λ‘λλ€.
2. [Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)λ LLM μ²λ¦¬λμ μ€μ μ λ‘λλ€.
### μ§μ° μκ° λ° μ²λ¦¬λ [[latency-and-throughput]]
1. λ©λͺ¨λ¦¬ μꡬ μ¬νμ μ€μ΄λ €λ©΄, λμ μμνμ λν [κ°μ΄λ](main_classes/quantization)λ₯Ό μ°Έμ‘°νμΈμ.
### κ΄λ ¨ λΌμ΄λΈλ¬λ¦¬ [[related-libraries]]
1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference)λ LLMμ μν μ€μ μ΄μ νκ²½μ μ ν©ν μλ²μ
λλ€.
2. [`optimum`](https://github.com/huggingface/optimum)μ νΉμ νλμ¨μ΄ μ₯μΉμμ LLMμ μ΅μ ννκΈ° μν΄ π€ Transformersλ₯Ό νμ₯ν κ²μ
λλ€.
|