File size: 5,011 Bytes

0541876
 
6d0a43c
 
3736034
4c27e21
 
578675e
4c27e21
578675e
 
6f72295
 
 
 
b27c12e
10bd8d0
 
 
cf9337c
10bd8d0
 
523c7a4
 
 
 
 
 
 
cf9337c
10bd8d0
cf9337c
9bc38d9
10bd8d0
6f72295
10bd8d0
 
 
 
 
 
d1b62aa
10bd8d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9bc38d9
cf9337c
9bc38d9
cf9337c
 
10bd8d0
9bc38d9
 
e6bff60
9bc38d9
 
 
 
 
 
 
 
e6bff60
9bc38d9
 
 
 
cf9337c
 
 
eeeaaf6
 
 
 
 
 
 
 
 
 
 
6f72295
eeeaaf6
 
d1b62aa
10bd8d0
d2d2de2
cf9337c
 
 
 
8e5fab5
 
 
7ec89f8
8e5fab5
d1b62aa
 
523c7a4

---
license: cc-by-nc-nd-4.0
language:
- es
pipeline_tag: text-generation
tags:
- dialogue
- conversational
- gpt
- gpt2
- text-generation
- spanish
- dialogpt
- chitchat
- ITG
inference: false
---

# DialoGPT-medium-spanish-chitchat

## Description

This is a **transformer-decoder** [GPT-2 model](https://huggingface.co/gpt2), adapted for the **single-turn dialogue task in Spanish**. We fine-tuned a [DialoGPT-medium](https://huggingface.co/microsoft/DialoGPT-medium) 345M parameter model from Microsoft, following the CLM (Causal Language Modelling) objective.

---

## Dataset

We used one of the datasets available in the [Bot Framework Tools repository](https://github.com/microsoft/botframework-cli). We processed [the professional-styled personality chat dataset in Spanish](https://github.com/microsoft/botframework-cli/blob/main/packages/qnamaker/docs/chit-chat-dataset.md), the file is available [to download here](https://qnamakerstore.blob.core.windows.net/qnamakerdata/editorial/spanish/qna_chitchat_professional.tsv)

---

## Example inference script

### Check at this example script to run our model in inference mode

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

CHAT_TURNS = 5
MAX_LENGTH = 1000

model = AutoModelForCausalLM.from_pretrained('ITG/DialoGPT-medium-spanish-chitchat')
tokenizer = AutoTokenizer.from_pretrained('ITG/DialoGPT-medium-spanish-chitchat')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
for i in range(CHAT_TURNS):
  user_input = input(f"Step - {i} >> user prompt ->")
  with torch.no_grad():
    # User turn, where "user_input" is the question (single-turn dialogue task)
    user_inputs_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors="pt")
    user_inputs_ids = user_inputs_ids.to(device)
    # The chat history adds the generated tokens for the answer
    chat_history = model.generate(user_inputs_ids, max_length=MAX_LENGTH, pad_token_id=tokenizer.eos_token_id)
    # decode just the last generated output tokens from the model (do not include the user prompt again)
    step_model_answer = tokenizer.decode(chat_history[:, user_inputs_ids.shape[-1]:][0], skip_special_tokens=True)
  print(f"Step - {i} >> DialoGPT-spanish model answer -> {step_model_answer}")
```
---

## Examples

### Single-turn generation examples from an interactive environment

|Role | Response |
|---------|--------|
|User | Hola, buenos dias |
| Bot | Buenos días|
|User | ¿Qué tal estás? |
| Bot | Fenomenal, gracias. |
|User | ¡Tengo hambre! |
| Bot | Quizá picar algo te ayude. |
|User | Estoy triste |
| Bot | Siento escuchar eso. |
|User | ¿Tienes personalidad? |
| Bot | Soy un bot. |
|User | ¡Cuenta un chiste! |
| Bot |  En realidad no tengo tanta gracia. |
|User | Jajajajajajajaja |
| Bot | ¡Me alegro de que te rías! |

---

## Fine-tuning hyper-parameters

|            **Hyper-parameter**           |          **Value**          |
|:----------------------------------------:|:---------------------------:|
|         Validation partition (%)         |             20%             |
|            Training batch size           |              8              |
|               Learning rate              |             5e-4            |
|            Max training epochs           |              20             |
|         Warmup training steps (%)        |             6%              |
|               Weight decay               |             0.01            |
| Optimiser (beta1, beta2, epsilon) | AdamW (0.9, 0.999, 1e-08) |
|  Monitoring metric (delta, patience)     |   Validation loss (0.1, 3)  |


## Fine-tuning in a different dataset or style

If you want to fine-tune your own dialogue model, we recommend you to start from the [DialoGPT model](https://huggingface.co/microsoft/DialoGPT-medium).
You can check the [original GitHub repository](https://github.com/microsoft/DialoGPT).

## Limitations

- This model uses the original English-based tokenizer from [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).
  Spanish tokenization is not considered but it has similarities in grammatical structure for encoding text. This overlap may help the model transfer its knowledge from English to Spanish.
  Moreover, the BPE (Byte Pair Encoding) implementation of the GPT-2 tokenizer **can assign a representation to every Unicode string**.
    **From the GPT-2 paper**: 
    > Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size. 
- This model is intended to be used **just for single-turn chitchat conversations in Spanish**.
- This model's generation capabilities are limited to the extent of the aforementioned fine-tuning dataset.
- This model generates short answers, providing general context dialogue in a professional style for the Spanish language.