Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

.gitattributes +1 -0
README.md +107 -0
salamandra-7b-aligned-eadop.Q4_0.gguf +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+salamandra-7b-aligned-eadop.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,107 @@

+---
+base_model:
+- BSC-LT/salamandra-7b-instruct
+datasets:
+- alinia/EADOP-RAG-out-of-domain
+language:
+- ca
+- es
+library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
+tags:
+- legal
+---
+# Salamandra 7B aligned EADOP Model Card
+Salamandra 7B aligned EADOP is a full-finetuning version of
+[BSC Language Technologies Unit](https://huggingface.co/BSC-LT)'s
+[Salamandra Instruct 7B](https://huggingface.co/BSC-LT/salamandra-7b-instruct)
+model of the Barcelona Supercomputing Center focused on improving
+the handling of out-of-domain Questions in a RAG instruction-following setting.
+The model has been finetuned on a dataset consisting of 2,000+ human annotated in-
+and out-of-domain user messages and assistant responses in the context of a chatbot that can
+provide helpful information about the current Catalan legislation.
+The dataset [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain)
+was collected in collaboration with the
+[Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/)
+and it consists of user messages and assistant responses in Catalan and Spanish.
+> [!WARNING]
+> **DISCLAIMER:** This model is a proof-of-concept designed to demonstrate the effects of
+finetuning an Instruction model with a small dataset of out-of-domain questions in the model's
+capability to politely and informatively refuse to answer questions that are out-of-domain.
+> As a proof-of-concept, the model is still prone to generate harmful or inappropriate content.
+---
+## Model Details
+Please refer to the [Salamandra Instruct 7B model details](https://huggingface.co/BSC-LT/salamandra-7b-instruct#model-details)
+for the specific details about the model architecture and pretraining.
+## Intended Use
+This model was developed as a proof-of-concept to demonstrate the effects of finetuning
+an Instruction model with a small dataset of in- and out-of-domain questions in the model's
+capability to politely and informatively refuse to answer questions that are out-of-domain in
+the context of a domain-specific RAG-based chatbot.
+## How to use
+This model uses the ChatML, the same instruction-following conversation format as the base model.
+```python
+from datetime import datetime
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+model_id = "projecte-aina/salamandra-7b-aligned-EADOP"
+text = "Quina és la finalitat del Servei Meterològic de Catalunya ?"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+  )
+message = [ { "role": "user", "content": text } ]
+prompt = tokenizer.apply_chat_template(
+    message,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity
+(either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.
+---
+## Finetuning Data
+Please refer to [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain) for the Dataset Card.
+### Author
+This model has been finetuned by [Alinia AI](https://alinia.ai/).
+### Contact
+For further information, please email [[email protected]](mailto:[email protected]).
+### Copyright
+Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
+### License
+Apache-2.0
+### Funding
+This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
+### Acknowledgements
+The data collection process was supported by the [Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/).

salamandra-7b-aligned-eadop.Q4_0.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e4cb2fa7652f00051e2e9d5117f387930248f75aba70eff1fd57583f5f29e0b
+size 4647269536