|
--- |
|
pipeline_tag: summarization |
|
language: |
|
- ko |
|
tags: |
|
- T5 |
|
--- |
|
|
|
# t5-base-korean-summarization |
|
|
|
This is [T5](https://huggingface.co/docs/transformers/model_doc/t5) model for korean text summarization. |
|
- Finetuned based on ['paust/pko-t5-base'](https://huggingface.co/paust/pko-t5-base) model. |
|
- Finetuned with 3 datasets. Specifically, it is described below. |
|
|
|
- [Korean Paper Summarization Dataset(λ
Όλ¬Έμλ£ μμ½)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90) |
|
- [Korean Book Summarization Dataset(λμμλ£ μμ½)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=93) |
|
- [Korean Summary statement and Report Generation Dataset(μμ½λ¬Έ λ° λ ν¬νΈ μμ± λ°μ΄ν°)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90) |
|
|
|
|
|
# Usage (HuggingFace Transformers) |
|
|
|
```python |
|
import nltk |
|
nltk.download('punkt') |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained('eenzeenee/t5-base-korean-summarization') |
|
tokenizer = AutoTokenizer.from_pretrained('eenzeenee/t5-base-korean-summarization') |
|
|
|
prefix = "summarize: " |
|
sample = """ |
|
μλ
νμΈμ? μ°λ¦¬ (2νλ
)/(μ΄ νλ
) μΉκ΅¬λ€ μ°λ¦¬ μΉκ΅¬λ€ νκ΅μ κ°μ μ§μ§ (2νλ
)/(μ΄ νλ
) μ΄ λκ³ μΆμλλ° νκ΅μ λͺ» κ°κ³ μμ΄μ λ΅λ΅νμ£ ? |
|
κ·Έλλ μ°λ¦¬ μΉκ΅¬λ€μ μμ κ³Ό 건κ°μ΄ μ΅μ°μ μ΄λκΉμ μ€λλΆν° μ μλμ΄λ λ§€μΌ λ§€μΌ κ΅μ΄ μ¬νμ λ λ보λλ‘ ν΄μ. |
|
μ΄/ μκ°μ΄ λ²μ¨ μ΄λ κ² λλμ? λ¦μμ΄μ. λ¦μμ΄μ. 빨리 κ΅μ΄ μ¬νμ λ λμΌ λΌμ. |
|
κ·Έλ°λ° μ΄/ κ΅μ΄μ¬νμ λ λκΈ° μ μ μ°λ¦¬κ° μ€λΉλ¬Όμ μ±κ²¨μΌ λκ² μ£ ? κ΅μ΄ μ¬νμ λ λ μ€λΉλ¬Ό, κ΅μμ μ΄λ»κ² λ°μ μ μλμ§ μ μλμ΄ μ€λͺ
μ ν΄μ€κ²μ. |
|
(EBS)/(μ΄λΉμμ€) μ΄λ±μ κ²μν΄μ λ€μ΄κ°λ©΄μ 첫νλ©΄μ΄ μ΄λ κ² λμμ. |
|
μ/ κ·Έλ¬λ©΄μ μ¬κΈ° (X)/(μμ€) λλ¬μ£Ό(κ³ μ)/(ꡬμ). μ κΈ° (λκ·ΈλΌλ―Έ)/(λ₯κ·ΈλΌλ―Έ) (EBS)/(μ΄λΉμμ€) (2μ£Ό)/(μ΄ μ£Ό) λΌμ΄λΈνΉκ°μ΄λΌκ³ λμ΄μμ£ ? |
|
κ±°κΈ°λ₯Ό λ°λ‘ κ°κΈ°λ₯Ό λλ¦
λλ€. μ/ (λλ₯΄λ©΄μ)/(λλ₯΄λ©΄μ). μ΄λ»κ² λλ? b/ λ°μΌλ‘ λ΄λ €μ λ΄λ €μ λ΄λ €μ μ λ΄λ €μ. |
|
μ°λ¦¬ λͺ νλ
μ΄μ£ ? μ/ (2νλ
)/(μ΄ νλ
) μ΄μ£ (2νλ
)/(μ΄ νλ
)μ λ¬΄μ¨ κ³Όλͺ©? κ΅μ΄. |
|
μ΄λ²μ£Όλ (1μ£Ό)/(μΌ μ£Ό) μ°¨λκΉμ μ¬κΈ° κ΅μ. λ€μμ£Όλ μ¬κΈ°μ λ€μ΄μ λ°μΌλ©΄ λΌμ. |
|
μ΄ κ΅μμ ν΄λ¦μ νλ©΄, μ§μ/. μ΄λ κ² κ΅μ¬κ° λμ΅λλ€ .μ΄ κ΅μμ (λ€μ΄)/(λ°μ΄)λ°μμ μ°λ¦¬ κ΅μ΄μ¬νμ λ λ μκ° μμ΄μ. |
|
κ·ΈλΌ μ°λ¦¬ μ§μ§λ‘ κ΅μ΄ μ¬νμ νλ² λ λ보λλ‘ ν΄μ? κ΅μ΄μ¬ν μΆλ°. μ/ (1λ¨μ)/(μΌ λ¨μ) μ λͺ©μ΄ λκ°μ? νλ² μ°Ύμλ΄μ. |
|
μλ₯Ό μ¦κ²¨μ μμ. κ·Έλ₯ μλ₯Ό μ½μ΄μ κ° μλμμ. μλ₯Ό μ¦κ²¨μΌ λΌμ μ¦κ²¨μΌ λΌ. μ΄λ»κ² μ¦κΈΈκΉ? μΌλ¨μ λ΄λ΄ μλ₯Ό μ¦κΈ°λ λ°©λ²μ λν΄μ 곡λΆλ₯Ό ν 건λ°μ. |
|
κ·ΈλΌ μ€λμμ μ΄λ»κ² μ¦κΈΈκΉμ? μ€λ 곡λΆν λ΄μ©μμ μλ₯Ό μ¬λ¬ κ°μ§ λ°©λ²μΌλ‘ μ½κΈ°λ₯Ό 곡λΆν κ²λλ€. |
|
μ΄λ»κ² μ¬λ¬κ°μ§ λ°©λ²μΌλ‘ μ½μκΉ μ°λ¦¬ 곡λΆν΄ 보λλ‘ ν΄μ. μ€λμ μ λμλΌ μ§μ/! μκ° λμμ΅λλ€ μμ μ λͺ©μ΄ λκ°μ? λ€ν° λ μ΄μμ λ€ν° λ . |
|
λꡬλ λ€νλ λμμ΄λ λ€νλ μΈλλ μΉκ΅¬λ? λꡬλ λ€νλμ§ μ μλμ΄ μλ₯Ό μ½μ΄ μ€ ν
λκΉ νλ² μκ°μ ν΄λ³΄λλ‘ ν΄μ.""" |
|
|
|
inputs = [prefix + sample] |
|
|
|
|
|
inputs = tokenizer(inputs, max_length=512, truncation=True, return_tensors="pt") |
|
output = model.generate(**inputs, num_beams=3, do_sample=True, min_length=10, max_length=64) |
|
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0] |
|
result = nltk.sent_tokenize(decoded_output.strip())[0] |
|
|
|
print('RESULT >>', result) |
|
|
|
RESULT >> κ΅μ΄ μ¬νμ λ λκΈ° μ μ κ΅μ΄ μ¬νμ λ λ μ€λΉλ¬Όκ³Ό κ΅μμ μ΄λ»κ² λ°μ μ μλμ§ μ μλμ΄ μ€λͺ
ν΄ μ€λ€. |
|
``` |
|
|
|
# Evalutation Result |
|
|
|
- Korean Paper Summarization Dataset(λ
Όλ¬Έμλ£ μμ½) |
|
``` |
|
ROUGE-2-R 0.09868624890432466 |
|
ROUGE-2-P 0.9666714545849712 |
|
ROUGE-2-F 0.17250881441169427 |
|
``` |
|
- Korean Book Summarization Dataset(λμμλ£ μμ½) |
|
``` |
|
ROUGE-2-R 0.1575686156943213 |
|
ROUGE-2-P 0.9718318136896944 |
|
ROUGE-2-F 0.26548116834852586 |
|
``` |
|
- Korean Summary statement and Report Generation Dataset(μμ½λ¬Έ λ° λ ν¬νΈ μμ± λ°μ΄ν°) |
|
``` |
|
ROUGE-2-R 0.0987891733555808 |
|
ROUGE-2-P 0.9276946867981899 |
|
ROUGE-2-F 0.17726493110448185 |
|
``` |
|
|
|
|
|
# Training |
|
|
|
The model was trained with the parameters: |
|
|
|
- training arguments |
|
``` |
|
Seq2SeqTrainingArguments( |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8, |
|
auto_find_batch_size=False, |
|
weight_decay=0.01, |
|
learning_rate=4e-05, |
|
lr_scheduler_type=linear, |
|
num_train_epochs=3, |
|
fp16=True) |
|
``` |
|
|
|
|
|
# Model Architecture |
|
|
|
``` |
|
T5ForConditionalGeneration( |
|
(shared): Embedding(50358, 768) |
|
(encoder): T5Stack( |
|
(embed_tokens): Embedding(50358, 768) |
|
(block): ModuleList( |
|
(0): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
(relative_attention_bias): Embedding(32, 12) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
(1~11): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
) |
|
(final_layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(decoder): T5Stack( |
|
(embed_tokens): Embedding(50358, 768) |
|
(block): ModuleList( |
|
(0): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
(relative_attention_bias): Embedding(32, 12) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerCrossAttention( |
|
(EncDecAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(2): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
(1~11): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerCrossAttention( |
|
(EncDecAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(2): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
(final_layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(lm_head): Linear(in_features=768, out_features=50358, bias=False) |
|
) |
|
``` |
|
|
|
## Citation |
|
|
|
- Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67. |
|
|
|
|