|
--- |
|
library_name: transformers |
|
tags: |
|
- legal |
|
- roberta |
|
- phobert |
|
license: apache-2.0 |
|
datasets: |
|
- NghiemAbe/Legal-corpus-indexing |
|
language: |
|
- vi |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# Phobert Base model with Legal domain |
|
**Experiment performed with Transformers version 4.38.2**\ |
|
Vi-Legal-PhoBert model for Legal domain based on [vinai/phobert-base-v2](https://huggingface.co/vinai/phobert-base-v2), then continued MLM pretraining for 154600 steps with token-level on [Legal Corpus](https://huggingface.co/datasets/NghiemAbe/Legal-corpus-indexing) so the model can learn to legal domain. |
|
|
|
## Usage |
|
Fill mask example: |
|
```python: |
|
from transformers import RobertaForMaskedLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("NghiemAbe/Vi-Legal-PhoBert") |
|
model = RobertaForMaskedLM.from_pretrained("NghiemAbe/Vi-Legal-PhoBert") |
|
``` |
|
|
|
## Metric |
|
I evaluated my [Dev-Legal-Dataset](https://huggingface.co/datasets/NghiemAbe/dev_legal) and here are the results: |
|
| Model | Paramaters | Language Type | Length | R@1 | R@5 | R@10 | R@20 | R@100 | MRR@5 | MRR@10 | MRR@20 | MRR@100 | Accuracy Masked| |
|
|-------------------------------|------------|---------------|--------|-------|-------|-------|-------|-------|-------|--------|--------|---------|---------| |
|
| vinai/phobert-base-v2 | 125M | vi | 256 | 0.266 | 0.482 | 0.601 | 0.702 | 0.841 | 0.356 | 0.372 | 0.379 | 0.382 | 0.522| |
|
| FacebookAI/xlm-roberta-base | 279M | mul | 512 | 0.012 | 0.042 | 0.064 | 0.091 | 0.207 | 0.025 | 0.028 | 0.030 | 0.033 | x| |
|
| Geotrend/bert-base-vi-cased | 179M | vi | 512 | 0.098 | 0.175 | 0.202 | 0.241 | 0.356 | 0.131 | 0.136 | 0.139 | 0.142 | x| |
|
| NlpHUST/roberta-base-vn | x | vi | 512 | 0.050 | 0.097 | 0.126 | 0.163 | 0.369 | 0.071 | 0.076 | 0.078 | 0.083 | x| |
|
| aisingapore/sealion-bert-base| x | mul | 512 | 0.002 | 0.007 | 0.021 | 0.036 | 0.106 | 0.003 | 0.005 | 0.006 | 0.008 | x| |
|
| **Vi-Legal-PhoBert** | 125M | vi | 256 | **0.290**| **0.560**| **0.707**| **0.819**| **0.935**| **0.410**| **0.430**| **0.437**| **0.440**|**0.8401**| |
|
|