--- library_name: transformers tags: - legal - roberta - phobert license: apache-2.0 datasets: - NghiemAbe/Legal-corpus-indexing language: - vi pipeline_tag: fill-mask --- # Phobert Base model with Legal domain **Experiment performed with Transformers version 4.38.2**\ Vi-Legal-PhoBert model for Legal domain based on [vinai/phobert-base-v2](https://huggingface.co/vinai/phobert-base-v2), then continued MLM pretraining for 154600 steps with token-level on [Legal Corpus](https://huggingface.co/datasets/NghiemAbe/Legal-corpus-indexing) so the model can learn to legal domain. ## Usage Fill mask example: ```python: from transformers import RobertaForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("NghiemAbe/Vi-Legal-PhoBert") model = RobertaForMaskedLM.from_pretrained("NghiemAbe/Vi-Legal-PhoBert") ``` ## Metric I evaluated my [Dev-Legal-Dataset](https://huggingface.co/datasets/NghiemAbe/dev_legal) and here are the results: | Model | Paramaters | Language Type | Length | R@1 | R@5 | R@10 | R@20 | R@100 | MRR@5 | MRR@10 | MRR@20 | MRR@100 | Accuracy Masked| |-------------------------------|------------|---------------|--------|-------|-------|-------|-------|-------|-------|--------|--------|---------|---------| | vinai/phobert-base-v2 | 125M | vi | 256 | 0.266 | 0.482 | 0.601 | 0.702 | 0.841 | 0.356 | 0.372 | 0.379 | 0.382 | 0.522| | FacebookAI/xlm-roberta-base | 279M | mul | 512 | 0.012 | 0.042 | 0.064 | 0.091 | 0.207 | 0.025 | 0.028 | 0.030 | 0.033 | x| | Geotrend/bert-base-vi-cased | 179M | vi | 512 | 0.098 | 0.175 | 0.202 | 0.241 | 0.356 | 0.131 | 0.136 | 0.139 | 0.142 | x| | NlpHUST/roberta-base-vn | x | vi | 512 | 0.050 | 0.097 | 0.126 | 0.163 | 0.369 | 0.071 | 0.076 | 0.078 | 0.083 | x| | aisingapore/sealion-bert-base| x | mul | 512 | 0.002 | 0.007 | 0.021 | 0.036 | 0.106 | 0.003 | 0.005 | 0.006 | 0.008 | x| | **Vi-Legal-PhoBert** | 125M | vi | 256 | **0.290**| **0.560**| **0.707**| **0.819**| **0.935**| **0.410**| **0.430**| **0.437**| **0.440**|**0.8401**|