File size: 5,813 Bytes

ed9deac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fd39d2
 
2e89e3e
 
 
 
ed9deac
 
 
 
 
1cc898c
54fbe35
ed9deac
54fbe35
 
ed9deac
54fbe35
ed9deac
54fbe35
 
5fd39d2
54fbe35
ed9deac
 
 
d05f802
 
 
 
2e89e3e
 
 
 
5fd39d2
 
 
 
 
 
d05f802
ed9deac
3b27a9c
 
5fd39d2
 
3b27a9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1cc898c
ed9deac
 
2e89e3e
ed9deac
5fd39d2
 
ed9deac
 
3b0896d
2e89e3e
ed9deac
2e89e3e
ed9deac
3b0896d

---
license: mit
language:
- pt
tags:
- t5
- ul2
- pt
- pt-br
datasets:
- allenai/c4
library_name: transformers
---

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

ULT5-pt é um modelo de arquitetura T5-v1.1 treinado com o framework UL2 - [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1), que utiliza Mixture-of-Denoisers (MoD), o qual combina o objetivo de Causal Language Modeling (CLM) com Span Corruption.

*ULT5-pt is a T5-v1.1 architecture model trained using the UL2 - Unifying Language Learning Paradigms framework, which uses Mixture-of-Denoisers (MoD), combining Causal Language Modeling (CLM) objective with Span Corruption.*

| Model                                    | Parameters  |
| :-:                                      |  :-:      |
| [thacio/ult5-pt-small](https://huggingface.co/thacio/ult5-pt-small) | 82.4M |

- **Developed by:** Thacio Garcia Scandaroli
- **Model type:** T5
- **Language(s) (NLP):** Português
- **License:** MIT


## Pretraining nad model characteristics

The model was trained with a portion of the C4 corpus in Portuguese using UL2 (https://huggingface.co/google/ul2), using R-Denoising, S-Denoising, and X-Denoising, and with dropout rate of 0.0.
Unlike the original work of UL2, a prefix token for S-Denoising was not used. For R-Denoising and X-Denoising, the tokens '<|NLU|>' and '<|NLG|>' and were used, respectively.

A context window of 1024 tokens was used. Also, a GPT2 tokenizer with a Portuguese vocabulary trained with Wikipedia was used to increase the amount of text that can be processed.*

*O modelo foi treinado com uma parte do corpus C4 em português utilizando o UL2 (https://huggingface.co/google/ul2), utilizando *R-Denoising*, *S-Denoising* e *X-Denoising*, e com dropout 0.0.*
*De forma diferente do paper original, não se utilizou token específico de prefixo para o *S-Denoising*. Para o *R-Denoising* e o *X-Denoising*, foram utilizados, respectivamente, os tokens <|NLU|> e <|NLG|>.*

*Utilizou-se uma janela de contexto para 1024 tokens e um tokenizador do GPT2 com vocabulário em português treinado com o wikipedia, aumentando a quantidade de texto que pode ser processada.*

## Uses

O uso recomendado é para fine-tunning.

Foi disponibilizado um tutorial em formato de notebook para fine-tune de modelos decoder e encoder-decoder (T5): [Fine-tune Large Language Models](endereço aqui)

Os modos de *span corruption* podem ser acionados adicionado ao início do text os prefixos '<|NLU|>' e '<|NLG|>'.
Os autores do UL2 apontam uma possivel diferença no resultado do fine-tune dependendo do modo ativado.
Porém, para o ult5-pt, não se notou diferença nos testes de benchmark.

*Fine-tunning is the recommended use for the model.

A tutorial (in portuguse) in notebook format for decoder and encoder-decoder (T5) model fine-tuning was provided: [Fine-tune Large Language Models](link here).

Span corruption modes can be activated by adding the prefixes '<|NLU|>' and '<|NLG|>' to the beginning of the text. The UL2 authors point out a possible difference in the fine-tuning result depending on the activated mode. However, for ult5-pt, no difference was noticed in benchmark tests.*

### Direct Use

Exemplo de geração de texto com top_k de 30

*Example of text generation with top_k of 30*

```python
from transformers import GPT2TokenizerFast, AutoModelForSeq2SeqLM

tokenizer = GPT2TokenizerFast.from_pretrained("thacio/ult5-pt-small")
model = AutoModelForSeq2SeqLM.from_pretrained("thacio/ult5-pt-small")

text='Um modelo de linguagem é um sistema de inteligência artificial que'

pred=model.generate(tokenizer.encode(text,return_tensors='pt'),max_new_tokens=30, eos_token_id=tokenizer.eos_token_id, top_k=30, do_sample=True)
print('input:',text)
print('generated:',tokenizer.batch_decode(pred, skip_special_tokens=True))
# input: Um modelo de linguagem é um sistema de inteligência artificial que
# generated: [' geraria a quantidade de informações por clique. Além das capacidades humanas, elas seriam muito mais produtivas do que as do cérebro humano.\nO que']
```

Embeddings:

```python
from transformers import T5EncoderModel, GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("thacio/ult5-pt-small")
model = T5EncoderModel.from_pretrained("thacio/ult5-pt-small")

text = 'Um modelo de linguagem é um sistema de inteligência artificial que aprende a gerar ou processar texto baseado em exemplos de treinamento.'
input_ids = tokenizer(text, return_tensors="pt").input_ids
outputs = model(input_ids)
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)

# tensor([[[-2.4537e-01,  7.9853e-02,  6.6387e-02,  ...,  1.8083e-01,
#           -4.8941e-02,  5.1888e-03],
#          [-3.0077e-01, -3.1949e-05, -1.9432e-01,  ..., -2.7167e-01,
#            3.8779e-02, -1.3541e-01],
#          [ 8.8356e-05,  3.6444e-03,  2.4887e-04,  ...,  1.3219e-03,
#            2.2221e-03,  1.1144e-03],
#          ...,
#          [-4.5300e-02, -4.6213e-02, -5.2453e-02,  ...,  1.7336e-01,
#           -2.6955e-02, -7.8869e-02],
#          [ 8.0028e-03, -9.6458e-02, -2.1417e-01,  ...,  5.1064e-01,
#           -1.0858e-03, -2.7367e-02],
#          [ 1.0856e-01,  4.4607e-02, -1.4409e-02,  ...,  6.7812e-02,
#            5.6911e-02,  1.2650e-01]]], grad_fn=<MulBackward0>)
```

## Bias, Risks, and Limitations

Os mesmos riscos, vieses e limitações dos outros modelos se aplicam a este, como o apontado em [GPT2](https://huggingface.co/gpt2).

*The same risks, biases, and limitations of other models apply to this one, as pointed out in GPT-2.*

## Citation

```bibtex
@misc{ult5-pt2023,
  author = {Thacio Garcia Scandaroli},
  title = {ULT5-pt: Portuguese Language Model trained with UL2},
  year = {2023},
}
```