|
# KoUL2 |
|
|
|
- λͺ¨λμλ§λμΉ + AI hubμ 곡κ°λ κΈ°ν νκ΅μ΄ ν
μ€νΈ λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ νμ΅λ UL2(Unifying Language Learning Paradigm)λͺ¨λΈμ
λλ€. |
|
- νλΌλ―Έν° μλ 279526656(280M)κ°λ‘ encoder-decoder ꡬ쑰λ₯Ό κ°μ§κ³ μμ΅λλ€. |
|
- [lassl](https://github.com/lassl/lassl) μ€νμμ€ νλ‘μ νΈλ₯Ό νμ©νμ¬ νμ΅νμμ΅λλ€. |
|
- μ¬μ νμ΅λ§ μ§νλ λͺ¨λΈμ΄λ―λ‘ μλμ κ°μ΄ UL2μ denoisingμ νμΈν΄λ³΄μ€ μ μμ΅λλ€. |
|
```py |
|
model = T5ForConditionalGeneration.from_pretrained("DaehanKim/KoUL2") |
|
tokenizer = AutoTokenizer.from_pretrained("DaehanKim/KoUL2") |
|
|
|
for prefix_token in ("[NLU]","[NLG]","[S2S]"): |
|
input_string = f"{prefix_token}μ΄λ€ μννΈλ νΈκ°κ° [new_id_27]λλ± κ²½κΈ° μΉ¨μ²΄λ‘ μΈν [new_id_26]λ₯Ό νμΈν μ μμμ΅λλ€.</s>" |
|
inputs = tokenizer(input_string, return_tensors="pt", add_special_tokens=False) |
|
decoder_inputs = tokenizer("<pad>[new_id_27]", return_tensors='pt', add_special_tokens=False) |
|
outputs = model.generate(input_ids = inputs.input_ids, decoder_input_ids=decoder_inputs.input_ids, num_beams=10, num_return_sequences=5) |
|
print(tokenizer.batch_decode(outputs)) |
|
``` |
|
``` |
|
# output |
|
['<pad>[new_id_27] κ³ κ³΅νμ§μ[new_id_26] μννΈμ νΈκ°κ° κ³ κ³΅νμ§μ', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈ νΈκ°κ° κ³ κ³΅ νμ§', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈ κ°μ΄ κ³ κ³΅ νμ§', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈμ νΈκ°κ° κ³ κ³΅ ν', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈ νΈκ°κ° κ³ κ³΅νμ§μ'] |
|
['<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ²λ§ ', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ²λ§[new_id_26]', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ² λ§', '<pad>[new_id_27] μ²λ§ μμμ μ²λ§ μ κΉμ§ μ€λ₯΄λ[new_id_26] μννΈ κ°κ²© νλ½', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ² μ'] |
|
['<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄λ[new_id_26] μννΈ κ°μ΄ μ²λ§ μ', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄λ[new_id_26] μννΈ κ°μ΄ μ²λ§ μμ', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄λ[new_id_26] μννΈ κ°μ΄ μ€λ₯΄λ λ± λΆλμ°', '<pad>[new_id_27] κ³ κ³΅ νμ§μ μ΄μ΄κ°κ³ [new_id_26] μννΈ κ°μ΄ νλ½νλ λ±', '<pad>[new_id_27] κ³ κ³΅ νμ§μ νκ³ [new_id_26] μννΈ κ°μ΄ νλ½νλ λ±'] |
|
``` |
|
- μ¬μ νμ΅ κ³Όμ μμ sentinel tokenμ κΈ°μ‘΄ T5μ νΈνλκ² νκΈ° μν΄ [new_id_27]...[new_id_1]<extra_token_0>...<extra_token_99> μμΌλ‘ λ€μ΄κ°κ² λ©λλ€. νμ΅ λ°©μμ λν λ΄μ©μ [μ΄ ν¬μ€νΈ](https://daehankim.blogspot.com/2022/08/lassl-feat-t5-ul2.html)λ₯Ό μ°Έμ‘°ν΄μ£Όμλ©΄ κ°μ¬νκ² μ΅λλ€. |
|
- Licenseλ MITμ
λλ€. |
|
- νμ΅ λ‘κ·Έλ [μ¬κΈ°](https://wandb.ai/lucas01/huggingface?workspace=user-lucas01)μμ νμΈνμ€ μ μμ΅λλ€. |
|
- λͺ¨λΈμ΄λ λ°μ΄ν° μ
μ λν΄ κΆκΈνμ μ μ΄ μμΌμλ©΄ `kdh5852 [at] gmail [dot] com`μΌλ‘ λ¬Έμν΄μ£Όμλ©΄ λ΅λ³ λλ¦¬κ² μ΅λλ€. |
|
|
|
## acknowledgement |
|
|
|
- μ΄ νλ‘μ νΈλ TFRC νλ‘κ·Έλ¨μ TPU μ§μμ λ°μ μνλμμ΅λλ€. |