File size: 4,075 Bytes
cc73993 f6154c1 a77924f f6154c1 cc73993 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f f6154c1 a77924f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
language:
- pt
thumbnail: "Portugues BERT for the Legal Domain"
tags:
- bert
- pytorch
- tsdae
datasets:
- rufimelo/PortugueseLegalSentences-v1
license: "mit"
widget:
- text: "O advogado apresentou [MASK] ao juíz."
---
# Legal_BERTimbau
## Introduction
Legal_BERTimbau Large is a fine-tuned BERT model based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) Large.
"BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment. It is available in two sizes: Base and Large.
For further information or requests, please go to [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/)."
The performance of Language Models can change drastically when there is a domain shift between training and test data. In order create a Portuguese Language Model adapted to a Legal domain, the original BERTimbau model was submitted to a fine-tuning stage where it was performed 1 "PreTraining" epoch over 200000 cleaned documents (lr: 2e-5, using TSDAE technique)
## Available models
| Model | Arch. | #Layers | #Params |
| ---------------------------------------- | ---------- | ------- | ------- |
|`rufimelo/Legal-BERTimbau-base` |BERT-Base |12 |110M|
| `rufimelo/Legal-BERTimbau-large` | BERT-Large | 24 | 335M |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")
model = AutoModelForMaskedLM.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")
```
### Masked language modeling prediction example
```python
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")
model = AutoModelForMaskedLM.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")
pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer)
pipe('O advogado apresentou [MASK] para o juíz')
# [{'score': 0.5034703612327576,
#'token': 8190,
#'token_str': 'recurso',
#'sequence': 'O advogado apresentou recurso para o juíz'},
#{'score': 0.07347951829433441,
#'token': 21973,
#'token_str': 'petição',
#'sequence': 'O advogado apresentou petição para o juíz'},
#{'score': 0.05165359005331993,
#'token': 4299,
#'token_str': 'resposta',
#'sequence': 'O advogado apresentou resposta para o juíz'},
#{'score': 0.04611917585134506,
#'token': 5265,
#'token_str': 'exposição',
#'sequence': 'O advogado apresentou exposição para o juíz'},
#{'score': 0.04068068787455559,
#'token': 19737, 'token_str':
#'alegações',
#'sequence': 'O advogado apresentou alegações para o juíz'}]
```
### For BERT embeddings
```python
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained('rufimelo/Legal-BERTimbau-large-TSDAE')
input_ids = tokenizer.encode('O advogado apresentou recurso para o juíz', return_tensors='pt')
with torch.no_grad():
outs = model(input_ids)
encoded = outs[0][0, 1:-1]
#tensor([[ 0.0328, -0.4292, -0.6230, ..., -0.3048, -0.5674, 0.0157],
#[-0.3569, 0.3326, 0.7013, ..., -0.7778, 0.2646, 1.1310],
#[ 0.3169, 0.4333, 0.2026, ..., 1.0517, -0.1951, 0.7050],
#...,
#[-0.3648, -0.8137, -0.4764, ..., -0.2725, -0.4879, 0.6264],
#[-0.2264, -0.1821, -0.3011, ..., -0.5428, 0.1429, 0.0509],
#[-1.4617, 0.6281, -0.0625, ..., -1.2774, -0.4491, 0.3131]])
```
## Citation
If you use this work, please cite BERTimbau's work:
```bibtex
@inproceedings{souza2020bertimbau,
author = {F{\'a}bio Souza and
Rodrigo Nogueira and
Roberto Lotufo},
title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
year = {2020}
}
```
|