Update README.md

f6154c1 about 2 years ago

4.08 kB

	---
	language:
	- pt
	thumbnail: "Portugues BERT for the Legal Domain"
	tags:
	- bert
	- pytorch
	- tsdae
	datasets:
	- rufimelo/PortugueseLegalSentences-v1
	license: "mit"
	widget:
	- text: "O advogado apresentou [MASK] ao juíz."
	---

	# Legal_BERTimbau

	## Introduction

	Legal_BERTimbau Large is a fine-tuned BERT model based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) Large.

	"BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment. It is available in two sizes: Base and Large.

	For further information or requests, please go to [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/)."

	The performance of Language Models can change drastically when there is a domain shift between training and test data. In order create a Portuguese Language Model adapted to a Legal domain, the original BERTimbau model was submitted to a fine-tuning stage where it was performed 1 "PreTraining" epoch over 200000 cleaned documents (lr: 2e-5, using TSDAE technique)


	## Available models

	\| Model \| Arch. \| #Layers \| #Params \|
	\| ---------------------------------------- \| ---------- \| ------- \| ------- \|
	\|`rufimelo/Legal-BERTimbau-base` \|BERT-Base \|12 \|110M\|
	\| `rufimelo/Legal-BERTimbau-large` \| BERT-Large \| 24 \| 335M \|

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")

	model = AutoModelForMaskedLM.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")
	```

	### Masked language modeling prediction example

	```python
	from transformers import pipeline
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")
	model = AutoModelForMaskedLM.from_pretrained("rufimelo/Legal-BERTimbau-large-TSDAE")

	pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer)
	pipe('O advogado apresentou [MASK] para o juíz')
	# [{'score': 0.5034703612327576,
	#'token': 8190,
	#'token_str': 'recurso',
	#'sequence': 'O advogado apresentou recurso para o juíz'},
	#{'score': 0.07347951829433441,
	#'token': 21973,
	#'token_str': 'petição',
	#'sequence': 'O advogado apresentou petição para o juíz'},
	#{'score': 0.05165359005331993,
	#'token': 4299,
	#'token_str': 'resposta',
	#'sequence': 'O advogado apresentou resposta para o juíz'},
	#{'score': 0.04611917585134506,
	#'token': 5265,
	#'token_str': 'exposição',
	#'sequence': 'O advogado apresentou exposição para o juíz'},
	#{'score': 0.04068068787455559,
	#'token': 19737, 'token_str':
	#'alegações',
	#'sequence': 'O advogado apresentou alegações para o juíz'}]

	```

	### For BERT embeddings

	```python
	import torch
	from transformers import AutoModel

	model = AutoModel.from_pretrained('rufimelo/Legal-BERTimbau-large-TSDAE')
	input_ids = tokenizer.encode('O advogado apresentou recurso para o juíz', return_tensors='pt')

	with torch.no_grad():
	outs = model(input_ids)
	encoded = outs[0][0, 1:-1]

	#tensor([[ 0.0328, -0.4292, -0.6230, ..., -0.3048, -0.5674, 0.0157],
	#[-0.3569, 0.3326, 0.7013, ..., -0.7778, 0.2646, 1.1310],
	#[ 0.3169, 0.4333, 0.2026, ..., 1.0517, -0.1951, 0.7050],
	#...,
	#[-0.3648, -0.8137, -0.4764, ..., -0.2725, -0.4879, 0.6264],
	#[-0.2264, -0.1821, -0.3011, ..., -0.5428, 0.1429, 0.0509],
	#[-1.4617, 0.6281, -0.0625, ..., -1.2774, -0.4491, 0.3131]])
	```

	## Citation

	If you use this work, please cite BERTimbau's work:

	```bibtex
	@inproceedings{souza2020bertimbau,
	author = {F{\'a}bio Souza and
	Rodrigo Nogueira and
	Roberto Lotufo},
	title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
	booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
	year = {2020}
	}
	```