base_model:
- BSC-LT/salamandra-7b-instruct
datasets:
- alinia/EADOP-RAG-out-of-domain
language:
- ca
- es
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
- legal
Salamandra 7B aligned EADOP Model Card
Salamandra 7B aligned EADOP is a full-finetuning version of BSC Language Technologies Unit's Salamandra Instruct 7B model of the Barcelona Supercomputing Center focused on improving the handling of out-of-domain Questions in a RAG instruction-following setting.
The model has been finetuned on a dataset consisting of 2,000+ human annotated in- and out-of-domain user messages and assistant responses in the context of a chatbot that can provide helpful information about the current Catalan legislation. The dataset alinia/EADOP-RAG-out-of-domain was collected in collaboration with the Entitat Autònoma del Diari Oficial i de Publicacions (EADOP) and it consists of user messages and assistant responses in Catalan and Spanish.
DISCLAIMER: This model is a proof-of-concept designed to demonstrate the effects of finetuning an Instruction model with a small dataset of out-of-domain questions in the model's capability to politely and informatively refuse to answer questions that are out-of-domain. As a proof-of-concept, the model is still prone to generate harmful or inappropriate content.
Model Details
Please refer to the Salamandra Instruct 7B model details for the specific details about the model architecture and pretraining.
Intended Use
This model was developed as a proof-of-concept to demonstrate the effects of finetuning an Instruction model with a small dataset of in- and out-of-domain questions in the model's capability to politely and informatively refuse to answer questions that are out-of-domain in the context of a domain-specific RAG-based chatbot.
How to use
This model uses the ChatML, the same instruction-following conversation format as the base model.
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "projecte-aina/salamandra-7b-aligned-EADOP"
text = "Quina és la finalitat del Servei Meterològic de Catalunya ?"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
message = [ { "role": "user", "content": text } ]
prompt = tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Using this template, each turn is preceded by a <|im_start|>
delimiter and the role of the entity
(either user
, for content supplied by the user, or assistant
for LLM responses), and finished with the <|im_end|>
token.
Finetuning Data
Please refer to alinia/EADOP-RAG-out-of-domain for the Dataset Card.
Author
This model has been finetuned by Alinia AI.
Contact
For further information, please email [email protected].
Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
License
Apache-2.0
Funding
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
Acknowledgements
The data collection process was supported by the Entitat Autònoma del Diari Oficial i de Publicacions (EADOP).