mav23 commited on
Commit
43ac42a
1 Parent(s): a433bce

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ salamandra-7b-aligned-eadop.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - BSC-LT/salamandra-7b-instruct
4
+ datasets:
5
+ - alinia/EADOP-RAG-out-of-domain
6
+ language:
7
+ - ca
8
+ - es
9
+ library_name: transformers
10
+ license: apache-2.0
11
+ pipeline_tag: text-generation
12
+ tags:
13
+ - legal
14
+ ---
15
+
16
+ # Salamandra 7B aligned EADOP Model Card
17
+ Salamandra 7B aligned EADOP is a full-finetuning version of
18
+ [BSC Language Technologies Unit](https://huggingface.co/BSC-LT)'s
19
+ [Salamandra Instruct 7B](https://huggingface.co/BSC-LT/salamandra-7b-instruct)
20
+ model of the Barcelona Supercomputing Center focused on improving
21
+ the handling of out-of-domain Questions in a RAG instruction-following setting.
22
+
23
+ The model has been finetuned on a dataset consisting of 2,000+ human annotated in-
24
+ and out-of-domain user messages and assistant responses in the context of a chatbot that can
25
+ provide helpful information about the current Catalan legislation.
26
+ The dataset [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain)
27
+ was collected in collaboration with the
28
+ [Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/)
29
+ and it consists of user messages and assistant responses in Catalan and Spanish.
30
+
31
+ > [!WARNING]
32
+ > **DISCLAIMER:** This model is a proof-of-concept designed to demonstrate the effects of
33
+ finetuning an Instruction model with a small dataset of out-of-domain questions in the model's
34
+ capability to politely and informatively refuse to answer questions that are out-of-domain.
35
+ > As a proof-of-concept, the model is still prone to generate harmful or inappropriate content.
36
+ ---
37
+
38
+ ## Model Details
39
+ Please refer to the [Salamandra Instruct 7B model details](https://huggingface.co/BSC-LT/salamandra-7b-instruct#model-details)
40
+ for the specific details about the model architecture and pretraining.
41
+
42
+ ## Intended Use
43
+ This model was developed as a proof-of-concept to demonstrate the effects of finetuning
44
+ an Instruction model with a small dataset of in- and out-of-domain questions in the model's
45
+ capability to politely and informatively refuse to answer questions that are out-of-domain in
46
+ the context of a domain-specific RAG-based chatbot.
47
+
48
+ ## How to use
49
+
50
+ This model uses the ChatML, the same instruction-following conversation format as the base model.
51
+
52
+ ```python
53
+ from datetime import datetime
54
+ from transformers import AutoTokenizer, AutoModelForCausalLM
55
+ import transformers
56
+ import torch
57
+
58
+ model_id = "projecte-aina/salamandra-7b-aligned-EADOP"
59
+
60
+ text = "Quina és la finalitat del Servei Meterològic de Catalunya ?"
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ model_id,
65
+ device_map="auto",
66
+ torch_dtype=torch.bfloat16
67
+ )
68
+
69
+ message = [ { "role": "user", "content": text } ]
70
+
71
+ prompt = tokenizer.apply_chat_template(
72
+ message,
73
+ tokenize=False,
74
+ add_generation_prompt=True
75
+ )
76
+
77
+ inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
78
+ outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
79
+
80
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
81
+ ```
82
+
83
+ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity
84
+ (either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.
85
+
86
+ ---
87
+
88
+ ## Finetuning Data
89
+ Please refer to [alinia/EADOP-RAG-out-of-domain](https://huggingface.co/datasets/alinia/EADOP-RAG-out-of-domain) for the Dataset Card.
90
+
91
+ ### Author
92
+ This model has been finetuned by [Alinia AI](https://alinia.ai/).
93
+
94
+ ### Contact
95
+ For further information, please email [[email protected]](mailto:[email protected]).
96
+
97
+ ### Copyright
98
+ Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
99
+
100
+ ### License
101
+ Apache-2.0
102
+
103
+ ### Funding
104
+ This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
105
+
106
+ ### Acknowledgements
107
+ The data collection process was supported by the [Entitat Autònoma del Diari Oficial i de Publicacions (EADOP)](https://dogc.gencat.cat/ca/sobre-el-dogc/eadop/).
salamandra-7b-aligned-eadop.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e4cb2fa7652f00051e2e9d5117f387930248f75aba70eff1fd57583f5f29e0b
3
+ size 4647269536