KoChatBART / README.md
BM-K's picture
Update README.md
97e34e2
# 😎 KoChatBART
[**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” `autoencoder`의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μ±„νŒ… BART(μ΄ν•˜ **KoChatBART**) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ `Text Infilling` λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μ•½ **10GB** μ΄μƒμ˜ ν•œκ΅­μ–΄ λŒ€ν™” ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ `encoder-decoder` μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ λŒ€ν™” 생성에 κ°•κ±΄ν•œ `KoChatBART-base`λ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.
<img src=https://user-images.githubusercontent.com/55969260/205434343-b72641e9-d0f9-4b88-a334-9f904e0a35c5.png>
## Quick tour
```python
from transformers import AutoTokenizer, BartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("BM-K/KoChatBART")
model = BartForConditionalGeneration.from_pretrained("BM-K/KoChatBART")
inputs = tokenizer("μ•ˆλ…• 세상아!", return_tensors="pt")
outputs = model(**inputs)
```
## 사전 ν•™μŠ΅ 데이터 μ „μ²˜λ¦¬
μ‚¬μš©ν•œ 데이터셋
- [μ£Όμ œλ³„ ν…μŠ€νŠΈ 일상 λŒ€ν™” 데이터](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=543)
- [μ†Œμƒκ³΅μΈ 고객 μ£Όλ¬Έ 질의-응닡 ν…μŠ€νŠΈ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=102)
- [ν•œκ΅­μ–΄ SNS](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=114)
- [민원 업무 μžλ™ν™” 인곡지λŠ₯ μ–Έμ–΄ 데이터](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=619)
KoChatBARTλ₯Ό ν•™μŠ΅μ‹œν‚€κΈ° μœ„ν•˜μ—¬ ν•œκ΅­μ–΄ λŒ€ν™” 데이터셋듀을 μ „μ²˜λ¦¬ ν›„ 합쳐 λŒ€λŸ‰μ˜ ν•œκ΅­μ–΄ λŒ€ν™” λ§λ­‰μΉ˜λ₯Ό λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€.
1. λ°μ΄ν„°μ˜ 쀑볡을 쀄이기 μœ„ν•΄ 'γ…‹γ…‹γ…‹γ…‹γ…‹γ…‹'와 같은 μ€‘λ³΅λœ ν‘œν˜„μ΄ 2번 이상 반볡될 λ•ŒλŠ” 'γ…‹γ…‹'와 같이 2번으둜 λ°”κΏ¨μŠ΅λ‹ˆλ‹€.
2. λ„ˆλ¬΄ 짧은 λ°μ΄ν„°λŠ” ν•™μŠ΅μ— λ°©ν•΄κ°€ 될 수 있기 λ•Œλ¬Έμ— KoBART ν† ν¬λ‚˜μ΄μ € κΈ°μ€€ 전체 토큰 길이가 3을 λ„˜λŠ” λ°μ΄ν„°λ§Œμ„ μ„ λ³„ν–ˆμŠ΅λ‹ˆλ‹€.
3. κ°€λͺ…μ²˜λ¦¬λœ λ°μ΄ν„°λŠ” μ œκ±°ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
## Model
| Model | # of params | vocab size | Type | # of layers | # of heads | ffn_dim | hidden_dims |
| ------------- | :---------: | :-----: | :----------: | ---------: | ------: | ----------: | ----------: |
| `KoChatBART` | 139M | 50265 | Encoder | 6 | 16 | 3072 | 768 |
| | | | Decoder | 6 | 16 | 3072 | 768 |
## λŒ€ν™” 생성 μ„±λŠ₯ μΈ‘μ •
λ‹€μŒ μ½”λ“œ[(Dialogue Generator)](https://github.com/2unju/KoBART_Dialogue_Generator)λ₯Ό 기반으둜 각 λͺ¨λΈμ„ fine-tuning ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λŒ€ν™” 생성 μ„±λŠ₯ 츑정을 μœ„ν•΄ μΆ”λ‘  μ‹œ ν† ν¬λ‚˜μ΄μ§•λ˜μ–΄ μƒμ„±λœ 응닡을 λ³΅μ›ν•œ ν›„, BPE tokenizerλ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹€μ œ 응닡과 μƒμ„±λœ 응닡 μ‚¬μ΄μ˜ overlap 및 distinctλ₯Ό μΈ‘μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
> **Warning** <br>
> 일반적으둜 짧은 λŒ€ν™” λ°μ΄ν„°λ‘œ λͺ¨λΈμ„ μ‚¬μ „ν•™μŠ΅ν•˜μ˜€κΈ° λ•Œλ¬Έμ— κΈ΄ λ¬Έμž₯ μ²˜λ¦¬κ°€ μš”κ΅¬λ˜λŠ” νƒœμŠ€ν¬(μš”μ•½) 등에 λŒ€ν•΄μ„œλŠ” μ•½ν•œ λͺ¨μŠ΅μ„ λ³΄μž…λ‹ˆλ‹€.
### μ‹€ν—˜ κ²°κ³Ό
- [감성 λŒ€ν™” 데이터](https://github.com/songys/Chatbot_data)
|Training|Validation|Test|
|:----:|:----:|:----:|
|9,458|1,182|1,183|
| Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 |
|------------------------|:----:|:----:|:----:|:----:|:----:|
| KoBART | 124M | 8.73 | 7.12 | 16.85 | 34.89 |
| KoChatBART | 139M | **12.97** | **11.23** | **19.64** | **44.53** |
| KoT5-ETRI | 324M | 12.10 | 10.14 | 16.97 | 40.09 |
- [μ†Œμƒκ³΅μΈ λŒ€ν™” 데이터](https://github.com/2unju/AIHub_Chitchat_dataset_parser)
|Training|Validation|Test|
|:----:|:----:|:----:|
|29,093|1,616|1,616|
| Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 |
|------------------------|:----:|:----:|:----:|:----:|:----:|
| KoBART | 124M | 10.04 | 7.24 | 13.76| 42.09 |
| KoChatBART | 139M | **10.11** | **7.26** | **15.12** | **46.08** |
| KoT5-ETRI | 324M | 9.45 | 6.66 | 14.50 | 45.46 |
## Contributors
<a href="https://github.com/BM-K/KoChatBART/graphs/contributors">
<img src="https://contrib.rocks/image?repo=BM-K/KoChatBART" />
</a>
## Reference
- [KoBART](https://github.com/SKT-AI/KoBART)