metadata
library_name: transformers
tags:
- legal
- roberta
- phobert
license: apache-2.0
datasets:
- NghiemAbe/Legal-corpus-indexing
language:
- vi
pipeline_tag: fill-mask
Phobert Base model with Legal domain
Experiment performed with Transformers version 4.38.2
Vi-Legal-PhoBert model for Legal domain based on vinai/phobert-base-v2, then continued MLM pretraining for 154600 steps with token-level on Legal Corpus so the model can learn to legal domain.
Usage
Fill mask example:
from transformers import RobertaForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NghiemAbe/Vi-Legal-PhoBert")
model = RobertaLongForMaskedLM.from_pretrained("NghiemAbe/Vi-Legal-PhoBert")
Metric
I evaluated my Dev-Legal-Dataset and here are the results:
Model | Paramaters | Language Type | Length | R@1 | R@5 | R@10 | R@20 | R@100 | MRR@5 | MRR@10 | MRR@20 | MRR@100 | Accuracy Masked |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
vinai/phobert-base-v2 | 125M | vi | 256 | 0.266 | 0.482 | 0.601 | 0.702 | 0.841 | 0.356 | 0.372 | 0.379 | 0.382 | 0.522 |
FacebookAI/xlm-roberta-base | 279M | mul | 512 | 0.012 | 0.042 | 0.064 | 0.091 | 0.207 | 0.025 | 0.028 | 0.030 | 0.033 | x |
Geotrend/bert-base-vi-cased | 179M | vi | 512 | 0.098 | 0.175 | 0.202 | 0.241 | 0.356 | 0.131 | 0.136 | 0.139 | 0.142 | x |
NlpHUST/roberta-base-vn | x | vi | 512 | 0.050 | 0.097 | 0.126 | 0.163 | 0.369 | 0.071 | 0.076 | 0.078 | 0.083 | x |
aisingapore/sealion-bert-base | x | mul | 512 | 0.002 | 0.007 | 0.021 | 0.036 | 0.106 | 0.003 | 0.005 | 0.006 | 0.008 | x |
Vi-Legal-PhoBert | 125M | vi | 256 | 0.290 | 0.560 | 0.707 | 0.819 | 0.935 | 0.410 | 0.430 | 0.437 | 0.440 | 0.8401 |