metadata

license: mit
language:
  - ko

Kconvo-roberta: Korean conversation RoBERTa (github)

There are many PLMs (Pretrained Language Models) for Korean, but most of them are trained with written language.
Here, we introduce a retrained PLM for prediction of Korean conversation data where we use verbal data for training.

Usage

# Kconvo-roberta
from transformers import RobertaTokenizerFast, RobertaModel

tokenizer_roberta = RobertaTokenizerFast.from_pretrained("yeongjoon/Kconvo-roberta")
model_roberta = RobertaModel.from_pretrained("yeongjoon/Kconvo-roberta")

Domain Robust Retraining of Pretrained Language Model

Kconvo-roberta uses klue/roberta-base as the base model and retrained additionaly with the conversation dataset.
The retrained dataset was collected through the National Institute of the Korean Language and AI-Hub, and the collected dataset is as follows.

- National Institute of the Korean Language
   * 온라인 대화 말뭉치 2021
   * 일상 대화 말뭉치 2020
   * 구어 말뭉치
   * 메신저 말뭉치

- AI-Hub
   * 온라인 구어체 말뭉치 데이터
   * 상담 음성
   * 한국어 음성
   * 자유대화 음성(일반남여)
   * 일상생활 및 구어체 한-영 번역 병렬 말뭉치 데이터
   * 한국인 대화음성
   * 감성 대화 말뭉치
   * 주제별 텍스트 일상 대화 데이터
   * 용도별 목적대화 데이터
   * 한국어 SNS