Vi-Legal-PhoBert / README.md
NghiemAbe's picture
Update README.md
1a1a1b5 verified
|
raw
history blame
2.22 kB
metadata
library_name: transformers
tags:
  - legal
  - roberta
  - phobert
license: apache-2.0
datasets:
  - NghiemAbe/Legal-corpus-indexing
language:
  - vi
pipeline_tag: fill-mask

Phobert Base model with Legal domain

Experiment performed with Transformers version 4.38.2
Vi-Legal-PhoBert model for Legal domain based on vinai/phobert-base-v2, then continued MLM pretraining for 154600 steps with token-level on Legal Corpus so the model can learn to legal domain.

Usage

Fill mask example:

from transformers import RobertaForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NghiemAbe/Vi-Legal-PhoBert")
model = RobertaLongForMaskedLM.from_pretrained("NghiemAbe/Vi-Legal-PhoBert")

Metric

I evaluated my Dev-Legal-Dataset and here are the results:

Model Paramaters Language Type Length R@1 R@5 R@10 R@20 R@100 MRR@5 MRR@10 MRR@20 MRR@100 Accuracy Masked
vinai/phobert-base-v2 125M vi 256 0.266 0.482 0.601 0.702 0.841 0.356 0.372 0.379 0.382 0.522
FacebookAI/xlm-roberta-base 279M mul 512 0.012 0.042 0.064 0.091 0.207 0.025 0.028 0.030 0.033 x
Geotrend/bert-base-vi-cased 179M vi 512 0.098 0.175 0.202 0.241 0.356 0.131 0.136 0.139 0.142 x
NlpHUST/roberta-base-vn x vi 512 0.050 0.097 0.126 0.163 0.369 0.071 0.076 0.078 0.083 x
aisingapore/sealion-bert-base x mul 512 0.002 0.007 0.021 0.036 0.106 0.003 0.005 0.006 0.008 x
Vi-Legal-PhoBert 125M vi 256 0.290 0.560 0.707 0.819 0.935 0.410 0.430 0.437 0.440 0.8401