eenzeenee's picture
Update README.md
442879a
|
raw
history blame
12.7 kB
---
pipeline_tag: summarization
language:
- ko
tags:
- T5
---
# t5-base-korean-summarization
This is [T5](https://huggingface.co/docs/transformers/model_doc/t5) model for korean text summarization.
Finetuned based on ['paust/pko-t5-base'](https://huggingface.co/paust/pko-t5-base) model.
Finetuned with 3 datasets. Specifically, it is described below.
- [Korean Paper Summarization Dataset(๋…ผ๋ฌธ์ž๋ฃŒ ์š”์•ฝ)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90)
- [Korean Book Summarization Dataset(๋„์„œ์ž๋ฃŒ ์š”์•ฝ)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=93)
- [Korean Summary statement and Report Generation Dataset(์š”์•ฝ๋ฌธ ๋ฐ ๋ ˆํฌํŠธ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90)
# Usage (HuggingFace Transformers)
```python
import nltk
nltk.download('punkt')
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained('eenzeenee/t5-base-korean-summarization')
tokenizer = AutoTokenizer.from_pretrained('eenzeenee/t5-base-korean-summarization')
prefix = "summarize: "
sample = """
์•ˆ๋…•ํ•˜์„ธ์š”? ์šฐ๋ฆฌ (2ํ•™๋…„)/(์ด ํ•™๋…„) ์นœ๊ตฌ๋“ค ์šฐ๋ฆฌ ์นœ๊ตฌ๋“ค ํ•™๊ต์— ๊ฐ€์„œ ์ง„์งœ (2ํ•™๋…„)/(์ด ํ•™๋…„) ์ด ๋˜๊ณ  ์‹ถ์—ˆ๋Š”๋ฐ ํ•™๊ต์— ๋ชป ๊ฐ€๊ณ  ์žˆ์–ด์„œ ๋‹ต๋‹ตํ•˜์ฃ ?
๊ทธ๋ž˜๋„ ์šฐ๋ฆฌ ์นœ๊ตฌ๋“ค์˜ ์•ˆ์ „๊ณผ ๊ฑด๊ฐ•์ด ์ตœ์šฐ์„ ์ด๋‹ˆ๊นŒ์š” ์˜ค๋Š˜๋ถ€ํ„ฐ ์„ ์ƒ๋‹˜์ด๋ž‘ ๋งค์ผ ๋งค์ผ ๊ตญ์–ด ์—ฌํ–‰์„ ๋– ๋‚˜๋ณด๋„๋ก ํ•ด์š”.
์–ด/ ์‹œ๊ฐ„์ด ๋ฒŒ์จ ์ด๋ ‡๊ฒŒ ๋๋‚˜์š”? ๋Šฆ์—ˆ์–ด์š”. ๋Šฆ์—ˆ์–ด์š”. ๋นจ๋ฆฌ ๊ตญ์–ด ์—ฌํ–‰์„ ๋– ๋‚˜์•ผ ๋ผ์š”.
๊ทธ๋Ÿฐ๋ฐ ์–ด/ ๊ตญ์–ด์—ฌํ–‰์„ ๋– ๋‚˜๊ธฐ ์ „์— ์šฐ๋ฆฌ๊ฐ€ ์ค€๋น„๋ฌผ์„ ์ฑ™๊ฒจ์•ผ ๋˜๊ฒ ์ฃ ? ๊ตญ์–ด ์—ฌํ–‰์„ ๋– ๋‚  ์ค€๋น„๋ฌผ, ๊ต์•ˆ์„ ์–ด๋–ป๊ฒŒ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š”์ง€ ์„ ์ƒ๋‹˜์ด ์„ค๋ช…์„ ํ•ด์ค„๊ฒŒ์š”.
(EBS)/(์ด๋น„์—์Šค) ์ดˆ๋“ฑ์„ ๊ฒ€์ƒ‰ํ•ด์„œ ๋“ค์–ด๊ฐ€๋ฉด์š” ์ฒซํ™”๋ฉด์ด ์ด๋ ‡๊ฒŒ ๋‚˜์™€์š”.
์ž/ ๊ทธ๋Ÿฌ๋ฉด์š” ์—ฌ๊ธฐ (X)/(์—‘์Šค) ๋ˆŒ๋Ÿฌ์ฃผ(๊ณ ์š”)/(๊ตฌ์š”). ์ €๊ธฐ (๋™๊ทธ๋ผ๋ฏธ)/(๋˜ฅ๊ทธ๋ผ๋ฏธ) (EBS)/(์ด๋น„์—์Šค) (2์ฃผ)/(์ด ์ฃผ) ๋ผ์ด๋ธŒํŠน๊ฐ•์ด๋ผ๊ณ  ๋˜์–ด์žˆ์ฃ ?
๊ฑฐ๊ธฐ๋ฅผ ๋ฐ”๋กœ ๊ฐ€๊ธฐ๋ฅผ ๋ˆ„๋ฆ…๋‹ˆ๋‹ค. ์ž/ (๋ˆ„๋ฅด๋ฉด์š”)/(๋ˆŒ๋ฅด๋ฉด์š”). ์–ด๋–ป๊ฒŒ ๋˜๋ƒ? b/ ๋ฐ‘์œผ๋กœ ๋‚ด๋ ค์š” ๋‚ด๋ ค์š” ๋‚ด๋ ค์š” ์ญ‰ ๋‚ด๋ ค์š”.
์šฐ๋ฆฌ ๋ช‡ ํ•™๋…„์ด์ฃ ? ์•„/ (2ํ•™๋…„)/(์ด ํ•™๋…„) ์ด์ฃ  (2ํ•™๋…„)/(์ด ํ•™๋…„)์˜ ๋ฌด์Šจ ๊ณผ๋ชฉ? ๊ตญ์–ด.
์ด๋ฒˆ์ฃผ๋Š” (1์ฃผ)/(์ผ ์ฃผ) ์ฐจ๋‹ˆ๊นŒ์š” ์—ฌ๊ธฐ ๊ต์•ˆ. ๋‹ค์Œ์ฃผ๋Š” ์—ฌ๊ธฐ์„œ ๋‹ค์šด์„ ๋ฐ›์œผ๋ฉด ๋ผ์š”.
์ด ๊ต์•ˆ์„ ํด๋ฆญ์„ ํ•˜๋ฉด, ์งœ์ž”/. ์ด๋ ‡๊ฒŒ ๊ต์žฌ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค .์ด ๊ต์•ˆ์„ (๋‹ค์šด)/(๋”ฐ์šด)๋ฐ›์•„์„œ ์šฐ๋ฆฌ ๊ตญ์–ด์—ฌํ–‰์„ ๋– ๋‚  ์ˆ˜๊ฐ€ ์žˆ์–ด์š”.
๊ทธ๋Ÿผ ์šฐ๋ฆฌ ์ง„์งœ๋กœ ๊ตญ์–ด ์—ฌํ–‰์„ ํ•œ๋ฒˆ ๋– ๋‚˜๋ณด๋„๋ก ํ•ด์š”? ๊ตญ์–ด์—ฌํ–‰ ์ถœ๋ฐœ. ์ž/ (1๋‹จ์›)/(์ผ ๋‹จ์›) ์ œ๋ชฉ์ด ๋ญ”๊ฐ€์š”? ํ•œ๋ฒˆ ์ฐพ์•„๋ด์š”.
์‹œ๋ฅผ ์ฆ๊ฒจ์š” ์—์š”. ๊ทธ๋ƒฅ ์‹œ๋ฅผ ์ฝ์–ด์š” ๊ฐ€ ์•„๋‹ˆ์—์š”. ์‹œ๋ฅผ ์ฆ๊ฒจ์•ผ ๋ผ์š” ์ฆ๊ฒจ์•ผ ๋ผ. ์–ด๋–ป๊ฒŒ ์ฆ๊ธธ๊นŒ? ์ผ๋‹จ์€ ๋‚ด๋‚ด ์‹œ๋ฅผ ์ฆ๊ธฐ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ๊ณต๋ถ€๋ฅผ ํ•  ๊ฑด๋ฐ์š”.
๊ทธ๋Ÿผ ์˜ค๋Š˜์€์š” ์–ด๋–ป๊ฒŒ ์ฆ๊ธธ๊นŒ์š”? ์˜ค๋Š˜ ๊ณต๋ถ€ํ•  ๋‚ด์šฉ์€์š” ์‹œ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ฝ๊ธฐ๋ฅผ ๊ณต๋ถ€ํ• ๊ฒ๋‹ˆ๋‹ค.
์–ด๋–ป๊ฒŒ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ฝ์„๊นŒ ์šฐ๋ฆฌ ๊ณต๋ถ€ํ•ด ๋ณด๋„๋ก ํ•ด์š”. ์˜ค๋Š˜์˜ ์‹œ ๋‚˜์™€๋ผ ์งœ์ž”/! ์‹œ๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค ์‹œ์˜ ์ œ๋ชฉ์ด ๋ญ”๊ฐ€์š”? ๋‹คํˆฐ ๋‚ ์ด์—์š” ๋‹คํˆฐ ๋‚ .
๋ˆ„๊ตฌ๋ž‘ ๋‹คํ‰œ๋‚˜ ๋™์ƒ์ด๋ž‘ ๋‹คํ‰œ๋‚˜ ์–ธ๋‹ˆ๋ž‘ ์นœ๊ตฌ๋ž‘? ๋ˆ„๊ตฌ๋ž‘ ๋‹คํ‰œ๋Š”์ง€ ์„ ์ƒ๋‹˜์ด ์‹œ๋ฅผ ์ฝ์–ด ์ค„ ํ…Œ๋‹ˆ๊นŒ ํ•œ๋ฒˆ ์ƒ๊ฐ์„ ํ•ด๋ณด๋„๋ก ํ•ด์š”."""
inputs = [prefix + sample]
inputs = tokenizer(inputs, max_length=512, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=3, do_sample=True, min_length=10, max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
result = nltk.sent_tokenize(decoded_output.strip())[0]
print('RESULT >>', result)
RESULT >> ๊ตญ์–ด ์—ฌํ–‰์„ ๋– ๋‚˜๊ธฐ ์ „์— ๊ตญ์–ด ์—ฌํ–‰์„ ๋– ๋‚  ์ค€๋น„๋ฌผ๊ณผ ๊ต์•ˆ์„ ์–ด๋–ป๊ฒŒ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š”์ง€ ์„ ์ƒ๋‹˜์ด ์„ค๋ช…ํ•ด ์ค€๋‹ค.
```
# Evalutation Result
- Korean Paper Summarization Dataset(๋…ผ๋ฌธ์ž๋ฃŒ ์š”์•ฝ)
```
ROUGE-2-R 0.09868624890432466
ROUGE-2-P 0.9666714545849712
ROUGE-2-F 0.17250881441169427
```
- Korean Book Summarization Dataset(๋„์„œ์ž๋ฃŒ ์š”์•ฝ)
```
ROUGE-2-R 0.1575686156943213
ROUGE-2-P 0.9718318136896944
ROUGE-2-F 0.26548116834852586
```
- Korean Summary statement and Report Generation Dataset(์š”์•ฝ๋ฌธ ๋ฐ ๋ ˆํฌํŠธ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ)
```
ROUGE-2-R 0.0987891733555808
ROUGE-2-P 0.9276946867981899
ROUGE-2-F 0.17726493110448185
```
# Training
The model was trained with the parameters:
### training arguments
```
Seq2SeqTrainingArguments(
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
auto_find_batch_size=False,
weight_decay=0.01,
learning_rate=4e-05,
lr_scheduler_type=linear,
num_train_epochs=3,
fp16=True)
```
# Model Architecture
```
T5ForConditionalGeneration(
(shared): Embedding(50358, 768)
(encoder): T5Stack(
(embed_tokens): Embedding(50358, 768)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=768, out_features=768, bias=False)
(k): Linear(in_features=768, out_features=768, bias=False)
(v): Linear(in_features=768, out_features=768, bias=False)
(o): Linear(in_features=768, out_features=768, bias=False)
(relative_attention_bias): Embedding(32, 12)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseGatedActDense(
(wi_0): Linear(in_features=768, out_features=2048, bias=False)
(wi_1): Linear(in_features=768, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(act): NewGELUActivation()
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1~11): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=768, out_features=768, bias=False)
(k): Linear(in_features=768, out_features=768, bias=False)
(v): Linear(in_features=768, out_features=768, bias=False)
(o): Linear(in_features=768, out_features=768, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseGatedActDense(
(wi_0): Linear(in_features=768, out_features=2048, bias=False)
(wi_1): Linear(in_features=768, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(act): NewGELUActivation()
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(final_layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(decoder): T5Stack(
(embed_tokens): Embedding(50358, 768)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=768, out_features=768, bias=False)
(k): Linear(in_features=768, out_features=768, bias=False)
(v): Linear(in_features=768, out_features=768, bias=False)
(o): Linear(in_features=768, out_features=768, bias=False)
(relative_attention_bias): Embedding(32, 12)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=768, out_features=768, bias=False)
(k): Linear(in_features=768, out_features=768, bias=False)
(v): Linear(in_features=768, out_features=768, bias=False)
(o): Linear(in_features=768, out_features=768, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseGatedActDense(
(wi_0): Linear(in_features=768, out_features=2048, bias=False)
(wi_1): Linear(in_features=768, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(act): NewGELUActivation()
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1~11): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=768, out_features=768, bias=False)
(k): Linear(in_features=768, out_features=768, bias=False)
(v): Linear(in_features=768, out_features=768, bias=False)
(o): Linear(in_features=768, out_features=768, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=768, out_features=768, bias=False)
(k): Linear(in_features=768, out_features=768, bias=False)
(v): Linear(in_features=768, out_features=768, bias=False)
(o): Linear(in_features=768, out_features=768, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseGatedActDense(
(wi_0): Linear(in_features=768, out_features=2048, bias=False)
(wi_1): Linear(in_features=768, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(act): NewGELUActivation()
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(final_layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(lm_head): Linear(in_features=768, out_features=50358, bias=False)
)
```
## Citation
- Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67.