KoChatBART / README.md
BM-K's picture
Update README.md
97e34e2

😎 KoChatBART

BART(Bidirectional and Auto-Regressive Transformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” autoencoder의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μ±„νŒ… BART(μ΄ν•˜ KoChatBART) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ Text Infilling λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μ•½ 10GB μ΄μƒμ˜ ν•œκ΅­μ–΄ λŒ€ν™” ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ encoder-decoder μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ λŒ€ν™” 생성에 κ°•κ±΄ν•œ KoChatBART-baseλ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.

Quick tour

from transformers import AutoTokenizer, BartForConditionalGeneration
  
tokenizer = AutoTokenizer.from_pretrained("BM-K/KoChatBART")
model = BartForConditionalGeneration.from_pretrained("BM-K/KoChatBART")

inputs = tokenizer("μ•ˆλ…• 세상아!", return_tensors="pt")
outputs = model(**inputs)

사전 ν•™μŠ΅ 데이터 μ „μ²˜λ¦¬

μ‚¬μš©ν•œ 데이터셋

KoChatBARTλ₯Ό ν•™μŠ΅μ‹œν‚€κΈ° μœ„ν•˜μ—¬ ν•œκ΅­μ–΄ λŒ€ν™” 데이터셋듀을 μ „μ²˜λ¦¬ ν›„ 합쳐 λŒ€λŸ‰μ˜ ν•œκ΅­μ–΄ λŒ€ν™” λ§λ­‰μΉ˜λ₯Ό λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€.

  1. λ°μ΄ν„°μ˜ 쀑볡을 쀄이기 μœ„ν•΄ 'γ…‹γ…‹γ…‹γ…‹γ…‹γ…‹'와 같은 μ€‘λ³΅λœ ν‘œν˜„μ΄ 2번 이상 반볡될 λ•ŒλŠ” 'γ…‹γ…‹'와 같이 2번으둜 λ°”κΏ¨μŠ΅λ‹ˆλ‹€.
  2. λ„ˆλ¬΄ 짧은 λ°μ΄ν„°λŠ” ν•™μŠ΅μ— λ°©ν•΄κ°€ 될 수 있기 λ•Œλ¬Έμ— KoBART ν† ν¬λ‚˜μ΄μ € κΈ°μ€€ 전체 토큰 길이가 3을 λ„˜λŠ” λ°μ΄ν„°λ§Œμ„ μ„ λ³„ν–ˆμŠ΅λ‹ˆλ‹€.
  3. κ°€λͺ…μ²˜λ¦¬λœ λ°μ΄ν„°λŠ” μ œκ±°ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

Model

Model # of params vocab size Type # of layers # of heads ffn_dim hidden_dims
KoChatBART 139M 50265 Encoder 6 16 3072 768
Decoder 6 16 3072 768

λŒ€ν™” 생성 μ„±λŠ₯ μΈ‘μ •

λ‹€μŒ μ½”λ“œ(Dialogue Generator)λ₯Ό 기반으둜 각 λͺ¨λΈμ„ fine-tuning ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λŒ€ν™” 생성 μ„±λŠ₯ 츑정을 μœ„ν•΄ μΆ”λ‘  μ‹œ ν† ν¬λ‚˜μ΄μ§•λ˜μ–΄ μƒμ„±λœ 응닡을 λ³΅μ›ν•œ ν›„, BPE tokenizerλ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹€μ œ 응닡과 μƒμ„±λœ 응닡 μ‚¬μ΄μ˜ overlap 및 distinctλ₯Ό μΈ‘μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

Warning
일반적으둜 짧은 λŒ€ν™” λ°μ΄ν„°λ‘œ λͺ¨λΈμ„ μ‚¬μ „ν•™μŠ΅ν•˜μ˜€κΈ° λ•Œλ¬Έμ— κΈ΄ λ¬Έμž₯ μ²˜λ¦¬κ°€ μš”κ΅¬λ˜λŠ” νƒœμŠ€ν¬(μš”μ•½) 등에 λŒ€ν•΄μ„œλŠ” μ•½ν•œ λͺ¨μŠ΅μ„ λ³΄μž…λ‹ˆλ‹€.

μ‹€ν—˜ κ²°κ³Ό

Training Validation Test
9,458 1,182 1,183
Model Param BLEU-3 BLEU-4 Dist-1 Dist-2
KoBART 124M 8.73 7.12 16.85 34.89
KoChatBART 139M 12.97 11.23 19.64 44.53
KoT5-ETRI 324M 12.10 10.14 16.97 40.09
Training Validation Test
29,093 1,616 1,616
Model Param BLEU-3 BLEU-4 Dist-1 Dist-2
KoBART 124M 10.04 7.24 13.76 42.09
KoChatBART 139M 10.11 7.26 15.12 46.08
KoT5-ETRI 324M 9.45 6.66 14.50 45.46

Contributors

Reference