crodri commited on
Commit
6cf3a57
verified
1 Parent(s): 165caaa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -1,3 +1,71 @@
1
  ---
2
- license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  ---
4
+
5
+
6
+ # FLOR-1.3B Instructed
7
+
8
+ ## Table of Contents
9
+ <details>
10
+ <summary>Click to expand</summary>
11
+
12
+ - [Model description](#model-description)
13
+ - [Intended uses and limitations](#intended-uses-and-limitations)
14
+ - [How to use](#how-to-use)
15
+ - [Limitations and bias](#limitations-and-bias)
16
+ - [Training](#training)
17
+ - [Evaluation](#evaluation)
18
+ - [Additional information](#additional-information)
19
+
20
+ </details>
21
+
22
+ ## Model description
23
+
24
+ **FLOR-1.3B-Instructed** is a 1.3B-parameter transformer-based causal language model for Catalan, Spanish, and English, trained on a combined dataset from (InstruCAT)[https://huggingface.co/datasets/BSC-LT/InstruCat], a Catalan language set of instruction generated automatically from prject-aina task orientated dataset, a subset of the [Dolly](databricks/databricks-dolly-15k) dataset for English, and [MENTOR_ES](https://huggingface.co/datasets/projecte-aina/MENTOR_ES) and [MENTOR_CA](https://huggingface.co/datasets/projecte-aina/MENTOR_CA), a Spanish and Catalan sets of instructions commisioned by the BSC Language Technologies Unit.
25
+ It is the result of a language adaptation technique performed on [BLOOM-7.1B](https://huggingface.co/bigscience/bloom-7b1),
26
+ which involves modifying the model's vocabulary and embedding layer, and continuously pre-training the model with 140B tokens in our target languages.
27
+ Blog post describing the base model with more parameters: [flor-6-3b, a chinchilla compliant model](https://medium.com/@mpamies247/flor-6-3b-a-chinchilla-compliant-model-for-catalan-spanish-and-english-7cdb389a9aac)
28
+
29
+ ## Intended uses and limitations
30
+
31
+ The **FLOR-1.3B-Instructed** model is ready-to-use for some downstream tasks.
32
+ It can perform text-generation tasks because fine-tuned for specific scenarios, such as summarization, Question Answering, creative writing, etc.
33
+
34
+ ## How to use
35
+ ```python
36
+ import torch
37
+ from transformers import pipeline
38
+
39
+ pipe = pipeline("text-generation", model="projecte-aina/FLOR-1.3B-Instructed")
40
+
41
+ instruction = "Quants habitants t茅 Matar贸?"
42
+
43
+ context = "Matar贸 茅s una ciutat de Catalunya, capital de la comarca del Maresme. Situada al litoral mediterrani, a uns 30 km al nord-est de Barcelona, ha estat tradicionalment un centre administratiu de rellev脿ncia territorial i un pol de dinamisme econ貌mic. Compta amb prop de 130.000 habitants, essent actualment la vuitena poblaci贸 del Principat i la tretzena dels Pa茂sos Catalans. "
44
+
45
+ # We need to format the prompt and context using ### and \n
46
+
47
+ def givePrediction(instruction, context, max_new_tokens=50, repetition_penalty=1.2, top_k=50, top_p=0.95, do_sample=True, temperature=0.5)
48
+ text = f"### Instruction\n{{instruction}}\n### Context\n{{context}}\n### Answer\n"
49
+ response = pipe(text.format(instruction=instruction, context=context),temperature=temperature,repetition_penalty=repetition_penalty, max_new_tokens=max_new_tokens,top_k=top_k, top_p=top_p, do_sample=do_sample)[0]["generated_text"]
50
+ answer = response.split("###")[-1][8:-1]
51
+ return answer
52
+
53
+ answer = givePrediction(instruction, context)
54
+
55
+ print(answer)
56
+ '130 000'
57
+
58
+ ```
59
+
60
+ ## Limitations and bias
61
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
62
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
63
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
64
+
65
+
66
+ ## Training
67
+
68
+
69
+ ### Instruction Data
70
+
71
+ The training corpus is composed of 140B tokens gathered from web crawlings and public domain data.