Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,71 @@
|
|
1 |
---
|
2 |
-
license:
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
|
6 |
+
# FLOR-1.3B Instructed
|
7 |
+
|
8 |
+
## Table of Contents
|
9 |
+
<details>
|
10 |
+
<summary>Click to expand</summary>
|
11 |
+
|
12 |
+
- [Model description](#model-description)
|
13 |
+
- [Intended uses and limitations](#intended-uses-and-limitations)
|
14 |
+
- [How to use](#how-to-use)
|
15 |
+
- [Limitations and bias](#limitations-and-bias)
|
16 |
+
- [Training](#training)
|
17 |
+
- [Evaluation](#evaluation)
|
18 |
+
- [Additional information](#additional-information)
|
19 |
+
|
20 |
+
</details>
|
21 |
+
|
22 |
+
## Model description
|
23 |
+
|
24 |
+
**FLOR-1.3B-Instructed** is a 1.3B-parameter transformer-based causal language model for Catalan, Spanish, and English, trained on a combined dataset from (InstruCAT)[https://huggingface.co/datasets/BSC-LT/InstruCat], a Catalan language set of instruction generated automatically from prject-aina task orientated dataset, a subset of the [Dolly](databricks/databricks-dolly-15k) dataset for English, and [MENTOR_ES](https://huggingface.co/datasets/projecte-aina/MENTOR_ES) and [MENTOR_CA](https://huggingface.co/datasets/projecte-aina/MENTOR_CA), a Spanish and Catalan sets of instructions commisioned by the BSC Language Technologies Unit.
|
25 |
+
It is the result of a language adaptation technique performed on [BLOOM-7.1B](https://huggingface.co/bigscience/bloom-7b1),
|
26 |
+
which involves modifying the model's vocabulary and embedding layer, and continuously pre-training the model with 140B tokens in our target languages.
|
27 |
+
Blog post describing the base model with more parameters: [flor-6-3b, a chinchilla compliant model](https://medium.com/@mpamies247/flor-6-3b-a-chinchilla-compliant-model-for-catalan-spanish-and-english-7cdb389a9aac)
|
28 |
+
|
29 |
+
## Intended uses and limitations
|
30 |
+
|
31 |
+
The **FLOR-1.3B-Instructed** model is ready-to-use for some downstream tasks.
|
32 |
+
It can perform text-generation tasks because fine-tuned for specific scenarios, such as summarization, Question Answering, creative writing, etc.
|
33 |
+
|
34 |
+
## How to use
|
35 |
+
```python
|
36 |
+
import torch
|
37 |
+
from transformers import pipeline
|
38 |
+
|
39 |
+
pipe = pipeline("text-generation", model="projecte-aina/FLOR-1.3B-Instructed")
|
40 |
+
|
41 |
+
instruction = "Quants habitants t茅 Matar贸?"
|
42 |
+
|
43 |
+
context = "Matar贸 茅s una ciutat de Catalunya, capital de la comarca del Maresme. Situada al litoral mediterrani, a uns 30 km al nord-est de Barcelona, ha estat tradicionalment un centre administratiu de rellev脿ncia territorial i un pol de dinamisme econ貌mic. Compta amb prop de 130.000 habitants, essent actualment la vuitena poblaci贸 del Principat i la tretzena dels Pa茂sos Catalans. "
|
44 |
+
|
45 |
+
# We need to format the prompt and context using ### and \n
|
46 |
+
|
47 |
+
def givePrediction(instruction, context, max_new_tokens=50, repetition_penalty=1.2, top_k=50, top_p=0.95, do_sample=True, temperature=0.5)
|
48 |
+
text = f"### Instruction\n{{instruction}}\n### Context\n{{context}}\n### Answer\n"
|
49 |
+
response = pipe(text.format(instruction=instruction, context=context),temperature=temperature,repetition_penalty=repetition_penalty, max_new_tokens=max_new_tokens,top_k=top_k, top_p=top_p, do_sample=do_sample)[0]["generated_text"]
|
50 |
+
answer = response.split("###")[-1][8:-1]
|
51 |
+
return answer
|
52 |
+
|
53 |
+
answer = givePrediction(instruction, context)
|
54 |
+
|
55 |
+
print(answer)
|
56 |
+
'130 000'
|
57 |
+
|
58 |
+
```
|
59 |
+
|
60 |
+
## Limitations and bias
|
61 |
+
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
|
62 |
+
However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
|
63 |
+
on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
64 |
+
|
65 |
+
|
66 |
+
## Training
|
67 |
+
|
68 |
+
|
69 |
+
### Instruction Data
|
70 |
+
|
71 |
+
The training corpus is composed of 140B tokens gathered from web crawlings and public domain data.
|