projecte-aina
/

FLOR-1.3B-Instructed

 ---
+license: apache-2.0
 ---
+# FLOR-1.3B Instructed
+## Table of Contents
+<details>
+<summary>Click to expand</summary>
+- [Model description](#model-description)
+- [Intended uses and limitations](#intended-uses-and-limitations)
+- [How to use](#how-to-use)
+- [Limitations and bias](#limitations-and-bias)
+- [Training](#training)
+- [Evaluation](#evaluation)
+- [Additional information](#additional-information)
+</details>
+## Model description
+**FLOR-1.3B-Instructed** is a 1.3B-parameter transformer-based causal language model for Catalan, Spanish, and English, trained on a combined dataset from (InstruCAT)[https://huggingface.co/datasets/BSC-LT/InstruCat], a Catalan language set of instruction generated automatically from prject-aina task orientated dataset, a subset of the [Dolly](databricks/databricks-dolly-15k) dataset for English, and [MENTOR_ES](https://huggingface.co/datasets/projecte-aina/MENTOR_ES) and [MENTOR_CA](https://huggingface.co/datasets/projecte-aina/MENTOR_CA), a Spanish and Catalan sets of instructions commisioned by the BSC Language Technologies Unit.
+It is the result of a language adaptation technique performed on [BLOOM-7.1B](https://huggingface.co/bigscience/bloom-7b1),
+which involves modifying the model's vocabulary and embedding layer, and continuously pre-training the model with 140B tokens in our target languages.
+Blog post describing the base model with more parameters: [flor-6-3b, a chinchilla compliant model](https://medium.com/@mpamies247/flor-6-3b-a-chinchilla-compliant-model-for-catalan-spanish-and-english-7cdb389a9aac)
+## Intended uses and limitations
+The **FLOR-1.3B-Instructed** model is ready-to-use for some downstream tasks.
+It can perform text-generation tasks because fine-tuned for specific scenarios, such as summarization, Question Answering, creative writing, etc.
+## How to use
+```python
+import torch
+from transformers import pipeline
+pipe = pipeline("text-generation", model="projecte-aina/FLOR-1.3B-Instructed")
+instruction = "Quants habitants té Mataró?"
+context = "Mataró és una ciutat de Catalunya, capital de la comarca del Maresme. Situada al litoral mediterrani, a uns 30 km al nord-est de Barcelona, ha estat tradicionalment un centre administratiu de rellevància territorial i un pol de dinamisme econòmic. Compta amb prop de 130.000 habitants, essent actualment la vuitena població del Principat i la tretzena dels Països Catalans. "
+# We need to format the prompt and context using ### and \n
+def givePrediction(instruction, context, max_new_tokens=50, repetition_penalty=1.2, top_k=50, top_p=0.95, do_sample=True, temperature=0.5)
+    text = f"### Instruction\n{{instruction}}\n### Context\n{{context}}\n### Answer\n"
+    response = pipe(text.format(instruction=instruction, context=context),temperature=temperature,repetition_penalty=repetition_penalty, max_new_tokens=max_new_tokens,top_k=top_k, top_p=top_p, do_sample=do_sample)[0]["generated_text"]
+    answer = response.split("###")[-1][8:-1]
+    return answer
+answer = givePrediction(instruction, context)
+print(answer)
+'130 000'
+```
+## Limitations and bias
+At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
+However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
+on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
+## Training
+### Instruction Data
+The training corpus is composed of 140B tokens gathered from web crawlings and public domain data.