|
--- |
|
license: mit |
|
language: |
|
- ko |
|
--- |
|
|
|
# Kconvo-roberta: Korean conversation RoBERTa ([github](https://github.com/HeoTaksung/Domain-Robust-Retraining-of-Pretrained-Language-Model)) |
|
- There are many PLMs (Pretrained Language Models) for Korean, but most of them are trained with written language. |
|
- Here, we introduce a retrained PLM for prediction of Korean conversation data where we use verbal data for training. |
|
|
|
## Usage |
|
```python |
|
# Kconvo-roberta |
|
from transformers import RobertaTokenizerFast, RobertaModel |
|
|
|
tokenizer_roberta = RobertaTokenizerFast.from_pretrained("yeongjoon/Kconvo-roberta") |
|
model_roberta = RobertaModel.from_pretrained("yeongjoon/Kconvo-roberta") |
|
``` |
|
|
|
----------------- |
|
## Domain Robust Retraining of Pretrained Language Model |
|
|
|
- Kconvo-roberta uses [klue/roberta-base](https://huggingface.co/klue/roberta-base) as the base model and retrained additionaly with the conversation dataset. |
|
- The retrained dataset was collected through the [National Institute of the Korean Language](https://corpus.korean.go.kr/request/corpusRegist.do) and [AI-Hub](https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=115&topMenu=100&dataSetSn=&srchdataClCode=DATACL001&srchOrder=&SrchdataClCode=DATACL002&searchKeyword=&srchDataRealmCode=REALM002&srchDataTy=DATA003), and the collected dataset is as follows. |
|
|
|
``` |
|
- National Institute of the Korean Language |
|
* μ¨λΌμΈ λν λ§λμΉ 2021 |
|
* μΌμ λν λ§λμΉ 2020 |
|
* κ΅¬μ΄ λ§λμΉ |
|
* λ©μ μ λ§λμΉ |
|
|
|
- AI-Hub |
|
* μ¨λΌμΈ ꡬμ΄μ²΄ λ§λμΉ λ°μ΄ν° |
|
* μλ΄ μμ± |
|
* νκ΅μ΄ μμ± |
|
* μμ λν μμ±(μΌλ°λ¨μ¬) |
|
* μΌμμν λ° κ΅¬μ΄μ²΄ ν-μ λ²μ λ³λ ¬ λ§λμΉ λ°μ΄ν° |
|
* νκ΅μΈ λνμμ± |
|
* κ°μ± λν λ§λμΉ |
|
* μ£Όμ λ³ ν
μ€νΈ μΌμ λν λ°μ΄ν° |
|
* μ©λλ³ λͺ©μ λν λ°μ΄ν° |
|
* νκ΅μ΄ SNS |
|
``` |
|
|