Kconvo-roberta / README.md
yeongjoon's picture
Update README.md
3840cc7
|
raw
history blame
1.87 kB
---
license: mit
language:
- ko
---
# Kconvo-roberta: Korean conversation RoBERTa ([github](https://github.com/HeoTaksung/Domain-Robust-Retraining-of-Pretrained-Language-Model))
- There are many PLMs (Pretrained Language Models) for Korean, but most of them are trained with written language.
- Here, we introduce a retrained PLM for prediction of Korean conversation data where we use verbal data for training.
## Usage
```python
# Kconvo-roberta
from transformers import RobertaTokenizerFast, RobertaModel
tokenizer_roberta = RobertaTokenizerFast.from_pretrained("yeongjoon/Kconvo-roberta")
model_roberta = RobertaModel.from_pretrained("yeongjoon/Kconvo-roberta")
```
-----------------
## Domain Robust Retraining of Pretrained Language Model
- Kconvo-roberta uses [klue/roberta-base](https://huggingface.co/klue/roberta-base) as the base model and retrained additionaly with the conversation dataset.
- The retrained dataset was collected through the [National Institute of the Korean Language](https://corpus.korean.go.kr/request/corpusRegist.do) and [AI-Hub](https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=115&topMenu=100&dataSetSn=&srchdataClCode=DATACL001&srchOrder=&SrchdataClCode=DATACL002&searchKeyword=&srchDataRealmCode=REALM002&srchDataTy=DATA003), and the collected dataset is as follows.
```
- National Institute of the Korean Language
* 온라인 λŒ€ν™” λ§λ­‰μΉ˜ 2021
* 일상 λŒ€ν™” λ§λ­‰μΉ˜ 2020
* ꡬ어 λ§λ­‰μΉ˜
* λ©”μ‹ μ € λ§λ­‰μΉ˜
- AI-Hub
* 온라인 ꡬ어체 λ§λ­‰μΉ˜ 데이터
* 상담 μŒμ„±
* ν•œκ΅­μ–΄ μŒμ„±
* μžμœ λŒ€ν™” μŒμ„±(μΌλ°˜λ‚¨μ—¬)
* μΌμƒμƒν™œ 및 ꡬ어체 ν•œ-영 λ²ˆμ—­ 병렬 λ§λ­‰μΉ˜ 데이터
* ν•œκ΅­μΈ λŒ€ν™”μŒμ„±
* 감성 λŒ€ν™” λ§λ­‰μΉ˜
* μ£Όμ œλ³„ ν…μŠ€νŠΈ 일상 λŒ€ν™” 데이터
* μš©λ„λ³„ λͺ©μ λŒ€ν™” 데이터
* ν•œκ΅­μ–΄ SNS
```