File size: 4,084 Bytes
cc73993
f6154c1
 
 
a77924f
f6154c1
 
 
 
 
 
 
 
cc73993
a77924f
f6154c1
a77924f
f6154c1
a77924f
f6154c1
a77924f
f6154c1
a77924f
f6154c1
a77924f
0e884e6
a77924f
 
f6154c1
a77924f
f6154c1
 
 
 
a77924f
f6154c1
a77924f
 
f6154c1
a77924f
0e884e6
a77924f
f6154c1
a77924f
 
f6154c1
a77924f
f6154c1
 
 
 
0e884e6
 
f6154c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77924f
f6154c1
a77924f
f6154c1
a77924f
f6154c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77924f
 
f6154c1
a77924f
f6154c1
a77924f
f6154c1
 
 
 
 
 
 
 
a77924f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
language: 
  - pt
thumbnail: "Portugues BERT for the Legal Domain"
tags:
  - bert
  - pytorch
  - tsdae
datasets:
 - rufimelo/PortugueseLegalSentences-v1
license: "mit"
widget:
 - text: "O advogado apresentou [MASK] ao juíz."
---

# Legal_BERTimbau

## Introduction

Legal_BERTimbau Large is a fine-tuned BERT model based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) Large.

"BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment. It is available in two sizes: Base and Large.

For further information or requests, please go to [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/)."

The performance of Language Models can change drastically when there is a domain shift between training and test data. In order create a Portuguese Language Model adapted to a Legal domain, the original BERTimbau model was submitted to a fine-tuning stage where it was performed 1 "PreTraining" epoch over 200000 cleaned documents (lr: 1e-5, using TSDAE technique)


## Available models

| Model                                    | Arch.      | #Layers | #Params |
| ---------------------------------------- | ---------- | ------- | ------- |
|`rufimelo/Legal-BERTimbau-base`	|BERT-Base	|12	|110M|
| `rufimelo/Legal-BERTimbau-large` | BERT-Large | 24      | 335M    |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE-v3")

model = AutoModelForMaskedLM.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")
```

### Masked language modeling prediction example

```python
from  transformers  import  pipeline
from  transformers  import  AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE-v3")
model = AutoModelForMaskedLM.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE-v3")

pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer)
pipe('O advogado apresentou [MASK] para o juíz')
# [{'score': 0.5034703612327576, 
#'token': 8190, 
#'token_str': 'recurso', 
#'sequence': 'O advogado apresentou recurso para o juíz'}, 
#{'score': 0.07347951829433441, 
#'token': 21973, 
#'token_str': 'petição', 
#'sequence': 'O advogado apresentou petição para o juíz'}, 
#{'score': 0.05165359005331993, 
#'token': 4299, 
#'token_str': 'resposta', 
#'sequence': 'O advogado apresentou resposta para o juíz'}, 
#{'score': 0.04611917585134506,
#'token': 5265, 
#'token_str': 'exposição', 
#'sequence': 'O advogado apresentou exposição para o juíz'}, 
#{'score': 0.04068068787455559, 
#'token': 19737, 'token_str': 
#'alegações', 
#'sequence': 'O advogado apresentou alegações para o juíz'}]

```

### For BERT embeddings

```python
import  torch
from  transformers  import  AutoModel

model = AutoModel.from_pretrained('rufimelo/Legal-BERTimbau-large-TSDAE')
input_ids = tokenizer.encode('O advogado apresentou recurso para o juíz', return_tensors='pt')

with  torch.no_grad():
	outs = model(input_ids)
	encoded = outs[0][0, 1:-1]
	
#tensor([[ 0.0328, -0.4292, -0.6230, ..., -0.3048, -0.5674, 0.0157], 
#[-0.3569, 0.3326, 0.7013, ..., -0.7778, 0.2646, 1.1310], 
#[ 0.3169, 0.4333, 0.2026, ..., 1.0517, -0.1951, 0.7050], 
#..., 
#[-0.3648, -0.8137, -0.4764, ..., -0.2725, -0.4879, 0.6264], 
#[-0.2264, -0.1821, -0.3011, ..., -0.5428, 0.1429, 0.0509], 
#[-1.4617, 0.6281, -0.0625, ..., -1.2774, -0.4491, 0.3131]])
```

## Citation

If you use this work, please cite BERTimbau's work:

```bibtex
@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}
```