Model Card for KorSciDeBERTa

KorSciDeBERTa๋Š” Microsoft DeBERTa ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋…ผ๋ฌธ, ์—ฐ๊ตฌ ๋ณด๊ณ ์„œ, ํŠนํ—ˆ, ๋‰ด์Šค, ํ•œ๊ตญ์–ด ์œ„ํ‚ค ๋ง๋ญ‰์น˜ ์ด 146GB๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋งˆ์Šคํ‚น๋œ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋˜๋Š” ๋‹ค์Œ ๋ฌธ์žฅ ์˜ˆ์ธก์— ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , ์ถ”๊ฐ€๋กœ ๋ฌธ์žฅ ๋ถ„๋ฅ˜, ๋‹จ์–ด ํ† ํฐ ๋ถ„๋ฅ˜ ๋˜๋Š” ์งˆ์˜์‘๋‹ต๊ณผ ๊ฐ™์€ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ๋ฏธ์„ธ ์กฐ์ •์„ ํ†ตํ•ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Model Details

Model Description

  • Developed by: KISTI
  • Model type: deberta-v2
  • Language(s) (NLP): ํ•œ๊ธ€(ko)

Model Sources

Uses

Downstream Use

Load Huggingface model directly

  1. ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ(Mecab) ๋“ฑ ์„ค์น˜ ํ•„์ˆ˜ - KorSciDeBERTa ํ™˜๊ฒฝ์„ค์น˜+ํŒŒ์ธํŠœ๋‹.pdf
  • Mecab ์„ค์น˜ ์ฐธ๊ณ : ๋‹ค์Œ ๋งํฌ์—์„œ '์‚ฌ์šฉ๋ฐฉ๋ฒ•'. https://aida.kisti.re.kr/model/9bbabd2d-6ce8-44cc-b2a3-69578d23970a

  • ๋‹ค์Œ ์—๋Ÿฌ ๋ฐœ์ƒ์‹œ: SetuptoolsDepreciationWarning: Invalid version: '0.996/ko-0.9.2' - https://datanavigator.tistory.com/54

  • Colab ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ Mecab ์„ค์น˜(์œ„์˜ ์‚ฌ์šฉ์ž ์‚ฌ์ „ ์ถ”๊ฐ€ ์„ค์น˜ํ•˜์ง€ ์•Š์„ ์‹œ ๋ฒ ์ด์Šค๋ผ์ธ ์ •ํ™•๋„ 0.786์œผ๋กœ ๊ฐ์†Œํ•จ):


!git clone https://github.com/SOMJANG/Mecab-ko-for-Google-Colab.git
%cd Mecab-ko-for-Google-Colab/
!bash install_mecab-ko_on_colab_light_220429.sh

  • ImportError: accelerate>=0.20.1 ์—๋Ÿฌ ๋ฐœ์ƒ์‹œ ํ•ด๊ฒฐ๋ฒ•

!pip install -U accelerate; pip install -U transformers; pip install pydantic==1.8 (์„ค์น˜ ํ›„ ๋Ÿฐํƒ€์ž„ ์žฌ์‹œ์ž‘)

  • ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ ์—๋Ÿฌ ๋ฐœ์ƒ์‹œ ํ•ด๊ฒฐ๋ฒ•

git-lfs ์„ค์น˜ ํ™•์ธ ๋ฐ spm.model ์ •์ƒ ๋‹ค์šด๋กœ๋“œ & ์šฉ๋Ÿ‰(2.74mb) ํ™•์ธ (apt-get install git git-lfs)

Make sure you have git-lfs installed (git lfs install)

  1. apt-get install git-lfs; git clone https://huggingface.co/kisti/korscideberta; cd korscideberta
  • korscideberta-abstractcls.ipynb

!pip install transformers==4.36.0
from tokenization_korscideberta_v2 import DebertaV2Tokenizer
from transformers import AutoModelForSequenceClassification


tokenizer = DebertaV2Tokenizer.from_pretrained("kisti/korscideberta")
model = AutoModelForSequenceClassification.from_pretrained("kisti/korscideberta", num_labels=7, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1)
#model = AutoModelForMaskedLM.from_pretrained("kisti/korscideberta")
''''''
train_metrics = trainer.train().metrics
trainer.save_metrics("train", train_metrics)
trainer.push_to_hub()
  

KorSciDeBERTa native code

KorSciDeBERTa ํ™˜๊ฒฝ์„ค์น˜+ํŒŒ์ธํŠœ๋‹.pdf ์ฐธ์กฐ


apt-get install git git-lfs
git clone https://huggingface.co/kisti/korscideberta; cd korscideberta; unzip korscideberta.zip -d korscideberta
''''''
cd korscideberta/experiments/glue; chmod 777 *.sh;
./mnli.sh

pip๋ฅผ ์ด์šฉํ•œ KorSciDeBERTa ์„ค์น˜

์œ„ ์ฝ”๋“œ๋Š” ์‚ฌ์šฉ ์œ„์น˜ ํด๋”์— ๋‘๊ณ  importํ•ด์„œ ์จ์•ผํ•ด์„œ ๋ถˆํŽธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ, pip ๋ช…๋ น์œผ๋กœ ๊ฐ€์ƒ ํ™˜๊ฒฝ(์ฝ˜๋‹ค ํ™˜๊ฒฝ)์— ์„ค์น˜ํ•  ์ˆ˜ ์žˆ๋„๋ก pyproject.toml์„ ๊ธฐ์ˆ ํ•˜์˜€๊ณ , tokenization.py์—์„œ normalize.py์™€ unicode.py๋ฅผ importํ•  ๋•Œ, "korscideberta."์„ ์ถ”๊ฐ€ํ•˜์—ฌ, korscideberta ํŒจํ‚ค์ง€๋ฅผ importํ•˜์—ฌ ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

./ ์œ„์น˜์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹คํ–‰ํ•˜๋ฉด, ํ˜„์žฌ ์ฝ˜๋‹ค ํ™˜๊ฒฝ์— korscideberta ๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ์„ค์น˜๋ฉ๋‹ˆ๋‹ค. pyproject.toml์— dependencies๋ฅผ ๊ธฐ์ˆ ํ•˜์—ฌ ํ•„์š”ํ•œ ํŒจํ‚ค์ง€(๋ฒ„์ „)(eg. sentencepiece, mecab, konlpy)์„ ํ™•์ธํ•˜๊ณ  ๊ฐ™์ด ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค. $ pip install .

์„ค์น˜ ํ›„์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด importํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


import korscideberta
tokenizer = korscideberta.tokenization_korscideberta_v2.DebertaV2Tokenizer.from_pretrained(path)

Out-of-Scope Use

์ด ๋ชจ๋ธ์€ ์˜๋„์ ์œผ๋กœ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ์ ๋Œ€์ ์ด๋‚˜ ์†Œ์™ธ๋œ ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜์–ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ '๊ณ ์œ„ํ—˜ ์„ค์ •'์—์„œ ์‚ฌ์šฉ๋  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์‚ฌ๋žŒ์ด๋‚˜ ์‚ฌ๋ฌผ์— ๋Œ€ํ•œ ์ค‘์š”ํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๊ฒŒ ์„ค๊ณ„๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๋ฌผ์€ ์‚ฌ์‹ค์ด ์•„๋‹ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

'๊ณ ์œ„ํ—˜ ์„ค์ •'์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‚ฌํ•ญ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:

์˜๋ฃŒ/์ •์น˜/๋ฒ•๋ฅ /๊ธˆ์œต ๋ถ„์•ผ์—์„œ์˜ ์‚ฌ์šฉ, ๊ณ ์šฉ/๊ต์œก/์‹ ์šฉ ๋ถ„์•ผ์—์„œ์˜ ์ธ๋ฌผ ํ‰๊ฐ€, ์ž๋™์œผ๋กœ ์ค‘์š”ํ•œ ๊ฒƒ์„ ๊ฒฐ์ •ํ•˜๊ธฐ, (๊ฐ€์งœ)์‚ฌ์‹ค์„ ์ƒ์„ฑํ•˜๊ธฐ, ์‹ ๋ขฐ๋„ ๋†’์€ ์š”์•ฝ๋ฌธ ์ƒ์„ฑ, ํ•ญ์ƒ ์˜ณ์•„์•ผ๋งŒ ํ•˜๋Š” ์˜ˆ์ธก ์ƒ์„ฑ ๋“ฑ.

Bias, Risks, and Limitations

์—ฐ๊ตฌ๋ชฉ์ ์œผ๋กœ ์ €์ž‘๊ถŒ ๋ฌธ์ œ๊ฐ€ ์—†๋Š” ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์˜ ์‚ฌ์šฉ์ž๋Š” ์•„๋ž˜์˜ ์œ„ํ—˜ ์š”์ธ๋“ค์„ ์ธ์‹ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ๋ง๋ญ‰์น˜๋Š” ๋Œ€๋ถ€๋ถ„ ์ค‘๋ฆฝ์ ์ธ ์„ฑ๊ฒฉ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์–ธ์–ด ๋ชจ๋ธ์˜ ํŠน์„ฑ์ƒ ์•„๋ž˜์™€ ๊ฐ™์€ ์œค๋ฆฌ ๊ด€๋ จ ์š”์†Œ๋ฅผ ์ผ๋ถ€ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

ํŠน์ • ๊ด€์ ์— ๋Œ€ํ•œ ๊ณผ๋Œ€/๊ณผ์†Œ ํ‘œํ˜„, ๊ณ ์ • ๊ด€๋…, ๊ฐœ์ธ ์ •๋ณด, ์ฆ์˜ค/๋ชจ์š• ๋˜๋Š” ํญ๋ ฅ์ ์ธ ์–ธ์–ด, ์ฐจ๋ณ„์ ์ด๊ฑฐ๋‚˜ ํŽธ๊ฒฌ์ ์ธ ์–ธ์–ด, ๊ด€๋ จ์ด ์—†๊ฑฐ๋‚˜ ๋ฐ˜๋ณต์ ์ธ ์ถœ๋ ฅ ์ƒ์„ฑ ๋“ฑ.

Training Details

Training Data

๋…ผ๋ฌธ, ์—ฐ๊ตฌ ๋ณด๊ณ ์„œ, ํŠนํ—ˆ, ๋‰ด์Šค, ํ•œ๊ตญ์–ด ์œ„ํ‚ค ๋ง๋ญ‰์น˜ ์ด 146GB

Training Procedure

KISTI HPC NVIDIA A100 80G GPU 24EA์—์„œ 2.5๊ฐœ์›”๋™์•ˆ 1,600,000 ์Šคํ… ํ•™์Šต

Preprocessing

  • ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ ํ† ํฌ๋‚˜์ด์ € (KorSci Tokenizer)
  • ๋ณธ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ช…์‚ฌ ๋ฐ ๋ณตํ•ฉ๋ช…์‚ฌ ์•ฝ 600๋งŒ๊ฐœ์˜ ์‚ฌ์šฉ์ž์‚ฌ์ „์ด ์ถ”๊ฐ€๋œ Mecab-ko Tokenizer์™€ ๊ธฐ์กด SentencePiece-BPE๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์ง„ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ง๋ญ‰์น˜๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • Total 128,100 words
  • Included special tokens ( < unk >, < cls >, < s >, < mask > )
  • File name : spm.model, vocab.txt

Training Hyperparameters

  • model_type: deberta-v2
  • model_size: base
  • parameters: 180M
  • hidden_size: 768
  • num_hidden_layers: 12
  • num_attention_heads: 12
  • num_train_steps: 1,600,000
  • train_batch_size: 4,096 * 4 accumulative update = 16,384
  • learning_rate: 1e-4
  • max_seq_length: 512
  • vocab_size: 128,100
  • Training regime: fp16 mixed precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

๋ณธ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ๋…ผ๋ฌธ ์—ฐ๊ตฌ๋ถ„์•ผ ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์— ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋…ผ๋ฌธ ์—ฐ๊ตฌ๋ถ„์•ผ ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹(doi.org/10.23057/50), ๋…ผ๋ฌธ 3๋งŒ ๊ฑด, ๋ถ„๋ฅ˜ ์นดํ…Œ๊ณ ๋ฆฌ ์ˆ˜ - ๋Œ€๋ถ„๋ฅ˜: 33๊ฐœ, ์ค‘๋ถ„๋ฅ˜: 372๊ฐœ, ์†Œ๋ถ„๋ฅ˜: 2898๊ฐœ

Metrics

F1-micro/macro: ์ •๋‹ต Top3 ์ค‘ ์ตœ์†Œ 1๊ฐœ ์˜ˆ์ธก์‹œ ์„ฑ๊ณต ๊ธฐ์ค€

F1-strict: ์ •๋‹ต Top3 ์ค‘ ์˜ˆ์ธกํ•œ ์ˆ˜ ๋งŒํผ ์„ฑ๊ณต ๊ธฐ์ค€

Results

F1-micro: 0.85, F1-macro: 0.52, F1-strict: 0.71

Technical Specifications

Model Objective

MLM is a technique in which you take your tokenized sample and replace some of the tokens with the < mask > token and train your model with it. The model then tries to predict what should come in the place of that < mask > token and gradually starts learning about the data. MLM teaches the model about the relationship between words.

Eg. Suppose you have a sentence - 'Deep Learning is so cool! I love neural networks.', now replace few words with the < mask > token.

Masked Sentence - 'Deep Learning is so < mask >! I love < mask > networks.'

Compute Infrastructure

KISTI ๊ตญ๊ฐ€์Šˆํผ์ปดํ“จํŒ…์„ผํ„ฐ NEURON ์‹œ์Šคํ…œ. HPE ClusterStor E1000, HP Apollo 6500 Gen10 Plus, Lustre, Slurm, CentOS 7.9

Hardware

NVIDIA A100 80G GPU 24EA

Software

Python 3.8, Cuda 10.2, PyTorch 1.10

Citation

ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์› (2023) : ํ•œ๊ตญ์–ด ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ DeBERTa ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (KorSciDeBERTa). Version 1.0. ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›.

Model Card Authors

๊น€์„ฑ์ฐฌ, ๊น€๊ฒฝ๋ฏผ, ๊น€์€ํฌ, ์ด๋ฏผํ˜ธ, ์ด์Šน์šฐ. ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์› ์ธ๊ณต์ง€๋Šฅ๋ฐ์ดํ„ฐ์—ฐ๊ตฌ๋‹จ

Model Card Contact

๊น€์„ฑ์ฐฌ, sckim kisti.re.kr ๊น€๊ฒฝ๋ฏผ

Downloads last month
36
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.