gonzalez-agirre
commited on
Commit
•
cd6011a
1
Parent(s):
ffb8213
Update README.md
Browse files
README.md
CHANGED
@@ -81,6 +81,47 @@ pipeline_tag: text-generation
|
|
81 |
|
82 |
The **Cǒndor-7B** is a transformer-based causal language model for Catalan, Spanish, and English. It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token trilugual corpus collected from publicly available corpora and crawlers.
|
83 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
## Language adaptation
|
85 |
|
86 |
We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
|
@@ -133,13 +174,7 @@ The resulting dataset has the following language distribution:
|
|
133 |
|
134 |
|
135 |
|
136 |
-
## Intended uses & limitations
|
137 |
-
|
138 |
-
The **Cǒndor-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks. However, it is intended to be fine-tuned on a generative downstream task.
|
139 |
-
|
140 |
|
141 |
-
## Limitations and biases
|
142 |
-
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
143 |
|
144 |
## Training and evaluation data
|
145 |
|
|
|
81 |
|
82 |
The **Cǒndor-7B** is a transformer-based causal language model for Catalan, Spanish, and English. It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token trilugual corpus collected from publicly available corpora and crawlers.
|
83 |
|
84 |
+
|
85 |
+
## Intended uses & limitations
|
86 |
+
|
87 |
+
The **Cǒndor-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks. However, it is intended to be fine-tuned on a generative downstream task.
|
88 |
+
|
89 |
+
## How to use
|
90 |
+
|
91 |
+
Here is how to use this model:
|
92 |
+
|
93 |
+
```python
|
94 |
+
import torch
|
95 |
+
import transformers
|
96 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
97 |
+
|
98 |
+
input_text = "Maria y Miguel no tienen ningún "
|
99 |
+
model = "BSC-LT/condor-7b"
|
100 |
+
tokenizer = AutoTokenizer.from_pretrained(model)
|
101 |
+
|
102 |
+
pipeline = transformers.pipeline(
|
103 |
+
"text-generation",
|
104 |
+
model=model,
|
105 |
+
tokenizer=tokenizer,
|
106 |
+
torch_dtype=torch.bfloat16,
|
107 |
+
trust_remote_code=True,
|
108 |
+
device_map="auto",
|
109 |
+
)
|
110 |
+
generation = pipeline(
|
111 |
+
input_text,
|
112 |
+
max_length=200,
|
113 |
+
do_sample=True,
|
114 |
+
top_k=10,
|
115 |
+
eos_token_id=tokenizer.eos_token_id,
|
116 |
+
)
|
117 |
+
|
118 |
+
print(f"Result: {generation['generated_text']}")
|
119 |
+
```
|
120 |
+
|
121 |
+
## Limitations and biases
|
122 |
+
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
123 |
+
|
124 |
+
|
125 |
## Language adaptation
|
126 |
|
127 |
We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
|
|
|
174 |
|
175 |
|
176 |
|
|
|
|
|
|
|
|
|
177 |
|
|
|
|
|
178 |
|
179 |
## Training and evaluation data
|
180 |
|