Phobert Base model with Legal domain

Experiment performed with Transformers version 4.38.2
Vi-Legal-PhoBert model for Legal domain based on vinai/phobert-base-v2, then continued MLM pretraining for 154600 steps with token-level on Legal Corpus so the model can learn to legal domain.

Usage

Fill mask example:

from transformers import RobertaForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NghiemAbe/Vi-Legal-PhoBert")
model = RobertaForMaskedLM.from_pretrained("NghiemAbe/Vi-Legal-PhoBert")

Metric

I evaluated my Dev-Legal-Dataset and here are the results:

Model Paramaters Language Type Length R@1 R@5 R@10 R@20 R@100 MRR@5 MRR@10 MRR@20 MRR@100 Accuracy Masked
vinai/phobert-base-v2 125M vi 256 0.266 0.482 0.601 0.702 0.841 0.356 0.372 0.379 0.382 0.522
FacebookAI/xlm-roberta-base 279M mul 512 0.012 0.042 0.064 0.091 0.207 0.025 0.028 0.030 0.033 x
Geotrend/bert-base-vi-cased 179M vi 512 0.098 0.175 0.202 0.241 0.356 0.131 0.136 0.139 0.142 x
NlpHUST/roberta-base-vn x vi 512 0.050 0.097 0.126 0.163 0.369 0.071 0.076 0.078 0.083 x
aisingapore/sealion-bert-base x mul 512 0.002 0.007 0.021 0.036 0.106 0.003 0.005 0.006 0.008 x
Vi-Legal-PhoBert 125M vi 256 0.290 0.560 0.707 0.819 0.935 0.410 0.430 0.437 0.440 0.8401
Downloads last month
6
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.