|
--- |
|
pipeline_tag: summarization |
|
language: |
|
- ko |
|
tags: |
|
- T5 |
|
--- |
|
|
|
# t5-base-korean-summarization |
|
|
|
This is [T5](https://huggingface.co/docs/transformers/model_doc/t5) model for korean text summarization. |
|
Finetuned based on ['paust/pko-t5-base'](https://huggingface.co/paust/pko-t5-base) model. |
|
Finetuned with 3 datasets. Specifically, it is described below. |
|
|
|
- [Korean Paper Summarization Dataset(๋
ผ๋ฌธ์๋ฃ ์์ฝ)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90) |
|
- [Korean Book Summarization Dataset(๋์์๋ฃ ์์ฝ)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=93) |
|
- [Korean Summary statement and Report Generation Dataset(์์ฝ๋ฌธ ๋ฐ ๋ ํฌํธ ์์ฑ ๋ฐ์ดํฐ)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90) |
|
|
|
|
|
# Usage (HuggingFace Transformers) |
|
|
|
```python |
|
import nltk |
|
nltk.download('punkt') |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained('eenzeenee/t5-base-korean-summarization') |
|
tokenizer = AutoTokenizer.from_pretrained('eenzeenee/t5-base-korean-summarization') |
|
|
|
prefix = "summarize: " |
|
sample = """ |
|
์๋
ํ์ธ์? ์ฐ๋ฆฌ (2ํ๋
)/(์ด ํ๋
) ์น๊ตฌ๋ค ์ฐ๋ฆฌ ์น๊ตฌ๋ค ํ๊ต์ ๊ฐ์ ์ง์ง (2ํ๋
)/(์ด ํ๋
) ์ด ๋๊ณ ์ถ์๋๋ฐ ํ๊ต์ ๋ชป ๊ฐ๊ณ ์์ด์ ๋ต๋ตํ์ฃ ? |
|
๊ทธ๋๋ ์ฐ๋ฆฌ ์น๊ตฌ๋ค์ ์์ ๊ณผ ๊ฑด๊ฐ์ด ์ต์ฐ์ ์ด๋๊น์ ์ค๋๋ถํฐ ์ ์๋์ด๋ ๋งค์ผ ๋งค์ผ ๊ตญ์ด ์ฌํ์ ๋ ๋๋ณด๋๋ก ํด์. |
|
์ด/ ์๊ฐ์ด ๋ฒ์จ ์ด๋ ๊ฒ ๋๋์? ๋ฆ์์ด์. ๋ฆ์์ด์. ๋นจ๋ฆฌ ๊ตญ์ด ์ฌํ์ ๋ ๋์ผ ๋ผ์. |
|
๊ทธ๋ฐ๋ฐ ์ด/ ๊ตญ์ด์ฌํ์ ๋ ๋๊ธฐ ์ ์ ์ฐ๋ฆฌ๊ฐ ์ค๋น๋ฌผ์ ์ฑ๊ฒจ์ผ ๋๊ฒ ์ฃ ? ๊ตญ์ด ์ฌํ์ ๋ ๋ ์ค๋น๋ฌผ, ๊ต์์ ์ด๋ป๊ฒ ๋ฐ์ ์ ์๋์ง ์ ์๋์ด ์ค๋ช
์ ํด์ค๊ฒ์. |
|
(EBS)/(์ด๋น์์ค) ์ด๋ฑ์ ๊ฒ์ํด์ ๋ค์ด๊ฐ๋ฉด์ ์ฒซํ๋ฉด์ด ์ด๋ ๊ฒ ๋์์. |
|
์/ ๊ทธ๋ฌ๋ฉด์ ์ฌ๊ธฐ (X)/(์์ค) ๋๋ฌ์ฃผ(๊ณ ์)/(๊ตฌ์). ์ ๊ธฐ (๋๊ทธ๋ผ๋ฏธ)/(๋ฅ๊ทธ๋ผ๋ฏธ) (EBS)/(์ด๋น์์ค) (2์ฃผ)/(์ด ์ฃผ) ๋ผ์ด๋ธํน๊ฐ์ด๋ผ๊ณ ๋์ด์์ฃ ? |
|
๊ฑฐ๊ธฐ๋ฅผ ๋ฐ๋ก ๊ฐ๊ธฐ๋ฅผ ๋๋ฆ
๋๋ค. ์/ (๋๋ฅด๋ฉด์)/(๋๋ฅด๋ฉด์). ์ด๋ป๊ฒ ๋๋? b/ ๋ฐ์ผ๋ก ๋ด๋ ค์ ๋ด๋ ค์ ๋ด๋ ค์ ์ญ ๋ด๋ ค์. |
|
์ฐ๋ฆฌ ๋ช ํ๋
์ด์ฃ ? ์/ (2ํ๋
)/(์ด ํ๋
) ์ด์ฃ (2ํ๋
)/(์ด ํ๋
)์ ๋ฌด์จ ๊ณผ๋ชฉ? ๊ตญ์ด. |
|
์ด๋ฒ์ฃผ๋ (1์ฃผ)/(์ผ ์ฃผ) ์ฐจ๋๊น์ ์ฌ๊ธฐ ๊ต์. ๋ค์์ฃผ๋ ์ฌ๊ธฐ์ ๋ค์ด์ ๋ฐ์ผ๋ฉด ๋ผ์. |
|
์ด ๊ต์์ ํด๋ฆญ์ ํ๋ฉด, ์ง์/. ์ด๋ ๊ฒ ๊ต์ฌ๊ฐ ๋์ต๋๋ค .์ด ๊ต์์ (๋ค์ด)/(๋ฐ์ด)๋ฐ์์ ์ฐ๋ฆฌ ๊ตญ์ด์ฌํ์ ๋ ๋ ์๊ฐ ์์ด์. |
|
๊ทธ๋ผ ์ฐ๋ฆฌ ์ง์ง๋ก ๊ตญ์ด ์ฌํ์ ํ๋ฒ ๋ ๋๋ณด๋๋ก ํด์? ๊ตญ์ด์ฌํ ์ถ๋ฐ. ์/ (1๋จ์)/(์ผ ๋จ์) ์ ๋ชฉ์ด ๋ญ๊ฐ์? ํ๋ฒ ์ฐพ์๋ด์. |
|
์๋ฅผ ์ฆ๊ฒจ์ ์์. ๊ทธ๋ฅ ์๋ฅผ ์ฝ์ด์ ๊ฐ ์๋์์. ์๋ฅผ ์ฆ๊ฒจ์ผ ๋ผ์ ์ฆ๊ฒจ์ผ ๋ผ. ์ด๋ป๊ฒ ์ฆ๊ธธ๊น? ์ผ๋จ์ ๋ด๋ด ์๋ฅผ ์ฆ๊ธฐ๋ ๋ฐฉ๋ฒ์ ๋ํด์ ๊ณต๋ถ๋ฅผ ํ ๊ฑด๋ฐ์. |
|
๊ทธ๋ผ ์ค๋์์ ์ด๋ป๊ฒ ์ฆ๊ธธ๊น์? ์ค๋ ๊ณต๋ถํ ๋ด์ฉ์์ ์๋ฅผ ์ฌ๋ฌ ๊ฐ์ง ๋ฐฉ๋ฒ์ผ๋ก ์ฝ๊ธฐ๋ฅผ ๊ณต๋ถํ ๊ฒ๋๋ค. |
|
์ด๋ป๊ฒ ์ฌ๋ฌ๊ฐ์ง ๋ฐฉ๋ฒ์ผ๋ก ์ฝ์๊น ์ฐ๋ฆฌ ๊ณต๋ถํด ๋ณด๋๋ก ํด์. ์ค๋์ ์ ๋์๋ผ ์ง์/! ์๊ฐ ๋์์ต๋๋ค ์์ ์ ๋ชฉ์ด ๋ญ๊ฐ์? ๋คํฐ ๋ ์ด์์ ๋คํฐ ๋ . |
|
๋๊ตฌ๋ ๋คํ๋ ๋์์ด๋ ๋คํ๋ ์ธ๋๋ ์น๊ตฌ๋? ๋๊ตฌ๋ ๋คํ๋์ง ์ ์๋์ด ์๋ฅผ ์ฝ์ด ์ค ํ
๋๊น ํ๋ฒ ์๊ฐ์ ํด๋ณด๋๋ก ํด์.""" |
|
|
|
inputs = [prefix + sample] |
|
|
|
|
|
inputs = tokenizer(inputs, max_length=512, truncation=True, return_tensors="pt") |
|
output = model.generate(**inputs, num_beams=3, do_sample=True, min_length=10, max_length=64) |
|
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0] |
|
result = nltk.sent_tokenize(decoded_output.strip())[0] |
|
|
|
print('RESULT >>', result) |
|
|
|
RESULT >> ๊ตญ์ด ์ฌํ์ ๋ ๋๊ธฐ ์ ์ ๊ตญ์ด ์ฌํ์ ๋ ๋ ์ค๋น๋ฌผ๊ณผ ๊ต์์ ์ด๋ป๊ฒ ๋ฐ์ ์ ์๋์ง ์ ์๋์ด ์ค๋ช
ํด ์ค๋ค. |
|
``` |
|
|
|
# Evalutation Result |
|
|
|
- Korean Paper Summarization Dataset(๋
ผ๋ฌธ์๋ฃ ์์ฝ) |
|
``` |
|
ROUGE-2-R 0.09868624890432466 |
|
ROUGE-2-P 0.9666714545849712 |
|
ROUGE-2-F 0.17250881441169427 |
|
``` |
|
- Korean Book Summarization Dataset(๋์์๋ฃ ์์ฝ) |
|
``` |
|
ROUGE-2-R 0.1575686156943213 |
|
ROUGE-2-P 0.9718318136896944 |
|
ROUGE-2-F 0.26548116834852586 |
|
``` |
|
- Korean Summary statement and Report Generation Dataset(์์ฝ๋ฌธ ๋ฐ ๋ ํฌํธ ์์ฑ ๋ฐ์ดํฐ) |
|
``` |
|
ROUGE-2-R 0.0987891733555808 |
|
ROUGE-2-P 0.9276946867981899 |
|
ROUGE-2-F 0.17726493110448185 |
|
``` |
|
|
|
|
|
# Training |
|
|
|
The model was trained with the parameters: |
|
|
|
### training arguments |
|
``` |
|
Seq2SeqTrainingArguments( |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8, |
|
auto_find_batch_size=False, |
|
weight_decay=0.01, |
|
learning_rate=4e-05, |
|
lr_scheduler_type=linear, |
|
num_train_epochs=3, |
|
fp16=True) |
|
``` |
|
|
|
|
|
# Model Architecture |
|
|
|
``` |
|
T5ForConditionalGeneration( |
|
(shared): Embedding(50358, 768) |
|
(encoder): T5Stack( |
|
(embed_tokens): Embedding(50358, 768) |
|
(block): ModuleList( |
|
(0): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
(relative_attention_bias): Embedding(32, 12) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
(1~11): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
) |
|
(final_layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(decoder): T5Stack( |
|
(embed_tokens): Embedding(50358, 768) |
|
(block): ModuleList( |
|
(0): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
(relative_attention_bias): Embedding(32, 12) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerCrossAttention( |
|
(EncDecAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(2): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
(1~11): T5Block( |
|
(layer): ModuleList( |
|
(0): T5LayerSelfAttention( |
|
(SelfAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(1): T5LayerCrossAttention( |
|
(EncDecAttention): T5Attention( |
|
(q): Linear(in_features=768, out_features=768, bias=False) |
|
(k): Linear(in_features=768, out_features=768, bias=False) |
|
(v): Linear(in_features=768, out_features=768, bias=False) |
|
(o): Linear(in_features=768, out_features=768, bias=False) |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(2): T5LayerFF( |
|
(DenseReluDense): T5DenseGatedActDense( |
|
(wi_0): Linear(in_features=768, out_features=2048, bias=False) |
|
(wi_1): Linear(in_features=768, out_features=2048, bias=False) |
|
(wo): Linear(in_features=2048, out_features=768, bias=False) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(act): NewGELUActivation() |
|
) |
|
(layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
) |
|
) |
|
(final_layer_norm): T5LayerNorm() |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(lm_head): Linear(in_features=768, out_features=50358, bias=False) |
|
) |
|
``` |
|
|
|
## Citation |
|
|
|
- Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67. |
|
|
|
|