Spaces:
Runtime error
Runtime error
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
rendered properly in your Markdown viewer. | |
--> | |
# ν¨λ©κ³Ό μλΌλ΄κΈ°[[padding-and-truncation]] | |
λ°°μΉ μ λ ₯μ κΈΈμ΄κ° λ€λ₯Έ κ²½μ°κ° λ§μμ κ³ μ ν¬κΈ° ν μλ‘ λ³νν μ μμ΅λλ€. ν¨λ©κ³Ό μλΌλ΄κΈ°λ λ€μν κΈΈμ΄μ λ°°μΉμμ μ§μ¬κ°ν ν μλ₯Ό μμ±ν μ μλλ‘ μ΄ λ¬Έμ λ₯Ό ν΄κ²°νλ μ λ΅μ λλ€. ν¨λ©μ νΉμν **ν¨λ© ν ν°**μ μΆκ°νμ¬ μ§§μ μνμ€κ° λ°°μΉμμ κ°μ₯ κΈ΄ μνμ€ λλ λͺ¨λΈμμ νμ©νλ μ΅λ κΈΈμ΄μ λμΌν κΈΈμ΄λ₯Ό κ°λλ‘ ν©λλ€. μλΌλ΄κΈ°λ κΈ΄ μνμ€λ₯Ό μλΌλ΄μ΄ ν¨λ©κ³Ό λ€λ₯Έ λ°©μμΌλ‘ μνμ€μ κΈΈμ΄λ₯Ό λμΌνκ² ν©λλ€. | |
λλΆλΆμ κ²½μ° λ°°μΉμ κ°μ₯ κΈ΄ μνμ€μ κΈΈμ΄λ‘ ν¨λ©νκ³ λͺ¨λΈμ΄ νμ©ν μ μλ μ΅λ κΈΈμ΄λ‘ μλΌλ΄λ κ²μ΄ μ μλν©λλ€. κ·Έλ¬λ νμνλ€λ©΄ APIκ° μ§μνλ λ λ§μ μ λ΅μ μ¬μ©ν μ μμ΅λλ€. νμν μΈμλ `padding`, `truncation`, `max_length` μΈ κ°μ§μ λλ€. | |
`padding` μΈμλ ν¨λ©μ μ μ΄ν©λλ€. λΆλ¦¬μΈ λλ λ¬Έμμ΄μΌ μ μμ΅λλ€: | |
- `True` λλ `'longest'`: λ°°μΉμμ κ°μ₯ κΈ΄ μνμ€λ‘ ν¨λ©ν©λλ€(λ¨μΌ μνμ€λ§ μ 곡νλ κ²½μ° ν¨λ©μ΄ μ μ©λμ§ μμ΅λλ€). | |
- `'max_length'`: `max_length` μΈμκ° μ§μ ν κΈΈμ΄λ‘ ν¨λ©νκ±°λ, `max_length`κ° μ 곡λμ§ μμ κ²½μ°(`max_length=None`) λͺ¨λΈμμ νμ©λλ μ΅λ κΈΈμ΄λ‘ ν¨λ©ν©λλ€. λ¨μΌ μνμ€λ§ μ 곡νλ κ²½μ°μλ ν¨λ©μ΄ μ μ©λ©λλ€. | |
- `False` λλ `'do_not_pad'`: ν¨λ©μ΄ μ μ©λμ§ μμ΅λλ€. μ΄κ²μ΄ κΈ°λ³Έ λμμ λλ€. | |
`truncation` μΈμλ μλΌλΌ λ°©λ²μ μ ν©λλ€. λΆλ¦¬μΈ λλ λ¬Έμμ΄μΌ μ μμ΅λλ€: | |
- `True` λλ `longest_first`: `max_length` μΈμκ° μ§μ ν μ΅λ κΈΈμ΄λ‘ μλΌλ΄κ±°λ, | |
`max_length`κ° μ 곡λμ§ μμ κ²½μ°(`max_length=None`) λͺ¨λΈμμ νμ©λλ μ΅λ κΈΈμ΄λ‘ μλΌλ λλ€. | |
μνμ€ μμμ κ°μ₯ κΈ΄ μνμ€μ ν ν°μ μ μ ν κΈΈμ΄μ λλ¬ν λκΉμ§ νλμ© μ κ±°ν©λλ€. | |
- `'only_second'`: `max_length` μΈμκ° μ§μ ν μ΅λ κΈΈμ΄λ‘ μλΌλ΄κ±°λ, | |
`max_length`κ° μ 곡λμ§ μμ κ²½μ°(`max_length=None`) λͺ¨λΈμμ νμ©λλ μ΅λ κΈΈμ΄λ‘ μλΌλ λλ€. | |
μνμ€ μ(λλ μνμ€ μμ λ°°μΉ)κ° μ 곡λ κ²½μ° μμ λ λ²μ§Έ λ¬Έμ₯λ§ μλΌλ λλ€. | |
- `'only_first'`: `max_length` μΈμκ° μ§μ ν μ΅λ κΈΈμ΄λ‘ μλΌλ΄κ±°λ, | |
`max_length`κ° μ 곡λμ§ μμ κ²½μ°(`max_length=None`) λͺ¨λΈμμ νμ©λλ μ΅λ κΈΈμ΄λ‘ μλΌλ λλ€. | |
μνμ€ μ(λλ μνμ€ μμ λ°°μΉ)κ° μ 곡λ κ²½μ° μμ 첫 λ²μ§Έ λ¬Έμ₯λ§ μλΌλ λλ€. | |
- `False` λλ `'do_not_truncate'`: μλΌλ΄κΈ°λ₯Ό μ μ©νμ§ μμ΅λλ€. μ΄κ²μ΄ κΈ°λ³Έ λμμ λλ€. | |
`max_length` μΈμλ ν¨λ© λ° μλΌλ΄κΈ°λ₯Ό μ μ©ν κΈΈμ΄λ₯Ό μ μ΄ν©λλ€. μ΄ μΈμλ μ μ λλ `None`μΌ μ μμΌλ©°, `None`μΌ κ²½μ° λͺ¨λΈμ΄ νμ©ν μ μλ μ΅λ κΈΈμ΄λ‘ κΈ°λ³Έκ°μ΄ μ€μ λ©λλ€. λͺ¨λΈμ νΉμ ν μ΅λ μ λ ₯ κΈΈμ΄κ° μλ κ²½μ° `max_length`μ λν μλΌλ΄κΈ° λλ ν¨λ©μ΄ λΉνμ±νλ©λλ€. | |
λ€μ νμλ ν¨λ© λ° μλΌλ΄κΈ°λ₯Ό μ€μ νλ κΆμ₯ λ°©λ²μ΄ μμ½λμ΄ μμ΅λλ€. | |
μ λ ₯μΌλ‘ μνμ€ μμ μ¬μ©νλ κ²½μ°, λ€μ μμ μμ `truncation=True`λ₯Ό `['only_first', 'only_second', 'longest_first']`μμ μ νν `STRATEGY`, μ¦ `truncation='only_second'` λλ `truncation='longest_first'`λ‘ λ°κΎΈλ©΄ μμ μ€λͺ ν λλ‘ μμ λ μνμ€κ° μ리λ λ°©μμ μ μ΄ν μ μμ΅λλ€. | |
| μλΌλ΄κΈ° | ν¨λ© | μ¬μ© λ°©λ² | | |
|--------------------------------------|-----------------------------------|------------------------------------------------------------------------------------------| | |
| μλΌλ΄κΈ° μμ | ν¨λ© μμ | `tokenizer(batch_sentences)` | | |
| | λ°°μΉ λ΄ μ΅λ κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding=True)` λλ | | |
| | | `tokenizer(batch_sentences, padding='longest')` | | |
| | λͺ¨λΈμ μ΅λ μ λ ₯ κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding='max_length')` | | |
| | νΉμ κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding='max_length', max_length=42)` | | |
| | λ€μν κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8) | | |
| λͺ¨λΈμ μ΅λ μ λ ₯ κΈΈμ΄λ‘ μλΌλ΄κΈ° | ν¨λ© μμ | `tokenizer(batch_sentences, truncation=True)` λλ | | |
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` | | |
| | λ°°μΉ λ΄ μ΅λ κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding=True, truncation=True)` λλ | | |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` | | |
| | λͺ¨λΈμ μ΅λ μ λ ₯ κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding='max_length', truncation=True)` λλ | | |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` | | |
| | νΉμ κΈΈμ΄λ‘ ν¨λ© | μ¬μ© λΆκ° | | |
| νΉμ κΈΈμ΄λ‘ μλΌλ΄κΈ° | ν¨λ© μμ | `tokenizer(batch_sentences, truncation=True, max_length=42)` λλ | | |
| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` | | |
| | λ°°μΉ λ΄ μ΅λ κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` λλ | | |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` | | |
| | λͺ¨λΈμ μ΅λ μ λ ₯ κΈΈμ΄λ‘ ν¨λ© | μ¬μ© λΆκ° | | |
| | νΉμ κΈΈμ΄λ‘ ν¨λ© | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` λλ | | |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` | | |