mariagrandury
commited on
Commit
•
a6354e1
1
Parent(s):
f5f761a
Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,9 @@ datasets:
|
|
13 |
library_name: transformers
|
14 |
---
|
15 |
|
16 |
-
|
|
|
|
|
17 |
|
18 |
The model is released under the Apache 2.0 license.
|
19 |
|
@@ -21,7 +23,6 @@ The model is released under the Apache 2.0 license.
|
|
21 |
<img src="https://huggingface.co/clibrain/lince-zero/resolve/main/LINCE-CLIBRAIN-HD.jpg" alt="lince logo"">
|
22 |
</div>
|
23 |
|
24 |
-
# Model Card for LINCE-ZERO
|
25 |
|
26 |
# Table of Contents
|
27 |
|
@@ -59,14 +60,13 @@ The model is released under the Apache 2.0 license.
|
|
59 |
|
60 |
## Model Description
|
61 |
|
62 |
-
LINCE-ZERO (Llm for Instructions from Natural Corpus en Español) is a state-of-the-art Spanish instruction language model. Developed by
|
63 |
|
64 |
- **Developed by:** [Clibrain](https://www.clibrain.com/)
|
65 |
- **Model type:** Language model, instruction model, causal decoder-only
|
66 |
- **Language(s) (NLP):** es
|
67 |
- **License:** apache-2.0
|
68 |
- **Parent Model:** [https://huggingface.co/tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
|
69 |
-
- **Resources for more information:** Paper coming soon
|
70 |
|
71 |
## Model Sources
|
72 |
|
@@ -95,7 +95,7 @@ LINCE-ZERO has limitations associated with both the underlying language model an
|
|
95 |
|
96 |
Since the model has been fine-tuned on translated versions of the Alpaca and Dolly datasets, it has potentially inherited certain limitations and biases:
|
97 |
|
98 |
-
- Alpaca: The Alpaca dataset is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases inherent in that model. As the authors report, hallucination seems to be a common failure mode for Alpaca, even compared to text-davinci-003
|
99 |
- Dolly: The Dolly dataset incorporates information from Wikipedia, which is a crowdsourced corpus. Therefore, the dataset's contents may reflect the biases, factual errors, and topical focus present in Wikipedia. Additionally, annotators involved in the dataset creation may not be native English speakers, and their demographics and subject matter may reflect the makeup of Databricks employees.
|
100 |
|
101 |
## Recommendations
|
@@ -108,7 +108,7 @@ If considering LINCE-ZERO for production use, it is crucial to thoroughly evalua
|
|
108 |
|
109 |
## Training Data
|
110 |
|
111 |
-
LINCE-ZERO is based on
|
112 |
|
113 |
Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
|
114 |
|
@@ -122,7 +122,7 @@ For detailed information about the model architecture and compute infrastructure
|
|
122 |
|
123 |
To prepare the training data, both the Alpaca and Dolly datasets, originally in English, were translated into Spanish using …
|
124 |
|
125 |
-
The data was tokenized using LINCE-ZERO’s tokenizer, which is based on the Falcon
|
126 |
|
127 |
### Training Hyperparameters
|
128 |
|
|
|
13 |
library_name: transformers
|
14 |
---
|
15 |
|
16 |
+
# Model Card for LINCE-ZERO
|
17 |
+
|
18 |
+
**LINCE ZERO** (Llm for Instructions from Natural Corpus en Español) is a state-of-the-art Spanish instruction language model. Developed by [Clibrain](https://www.clibrain.com/), it is a causal decoder-only model with 7B parameters. LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
|
19 |
|
20 |
The model is released under the Apache 2.0 license.
|
21 |
|
|
|
23 |
<img src="https://huggingface.co/clibrain/lince-zero/resolve/main/LINCE-CLIBRAIN-HD.jpg" alt="lince logo"">
|
24 |
</div>
|
25 |
|
|
|
26 |
|
27 |
# Table of Contents
|
28 |
|
|
|
60 |
|
61 |
## Model Description
|
62 |
|
63 |
+
LINCE-ZERO (Llm for Instructions from Natural Corpus en Español) is a state-of-the-art Spanish instruction language model. Developed by [Clibrain](https://www.clibrain.com/), it is a causal decoder-only model with 7B parameters. LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
|
64 |
|
65 |
- **Developed by:** [Clibrain](https://www.clibrain.com/)
|
66 |
- **Model type:** Language model, instruction model, causal decoder-only
|
67 |
- **Language(s) (NLP):** es
|
68 |
- **License:** apache-2.0
|
69 |
- **Parent Model:** [https://huggingface.co/tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
|
|
|
70 |
|
71 |
## Model Sources
|
72 |
|
|
|
95 |
|
96 |
Since the model has been fine-tuned on translated versions of the Alpaca and Dolly datasets, it has potentially inherited certain limitations and biases:
|
97 |
|
98 |
+
- Alpaca: The Alpaca dataset is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases inherent in that model. As the authors report, hallucination seems to be a common failure mode for Alpaca, even compared to `text-davinci-003`.
|
99 |
- Dolly: The Dolly dataset incorporates information from Wikipedia, which is a crowdsourced corpus. Therefore, the dataset's contents may reflect the biases, factual errors, and topical focus present in Wikipedia. Additionally, annotators involved in the dataset creation may not be native English speakers, and their demographics and subject matter may reflect the makeup of Databricks employees.
|
100 |
|
101 |
## Recommendations
|
|
|
108 |
|
109 |
## Training Data
|
110 |
|
111 |
+
LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
|
112 |
|
113 |
Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
|
114 |
|
|
|
122 |
|
123 |
To prepare the training data, both the Alpaca and Dolly datasets, originally in English, were translated into Spanish using …
|
124 |
|
125 |
+
The data was tokenized using LINCE-ZERO’s tokenizer, which is based on the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.
|
126 |
|
127 |
### Training Hyperparameters
|
128 |
|