clibrain
/

lince-zero

@@ -40,14 +40,7 @@ The model is released under the Apache 2.0 license.
   - [Recommendations](#recommendations)
 - [Training Details](#training-details)
   - [Training Data](#training-data)
-  - [Training Procedure](#training-procedure)
-    - [Preprocessing](#preprocessing)
-    - [Speeds, Sizes, Times](#speeds-sizes-times)
 - [Evaluation](#evaluation)
-  - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
-    - [Testing Data](#testing-data)
-    - [Factors](#factors)
-    - [Metrics](#metrics)
   - [Results](#results)
 - [Environmental Impact](#environmental-impact)
 - [Technical Specifications](#technical-specifications)
@@ -55,9 +48,9 @@ The model is released under the Apache 2.0 license.
   - [Compute Infrastructure](#compute-infrastructure)
     - [Hardware](#hardware)
     - [Software](#software)
 - [Citation](#citation)
 - [Contact](#contact)
-- [How to Get Started with the Model](#how-to-get-started-with-the-model)
 # 🐯 Model Details
@@ -111,45 +104,18 @@ If considering LINCE-ZERO for production use, it is crucial to thoroughly evalua
 ## Training Data
-LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated into Spanish.
 Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
 Dolly is a 13.1 MB dataset of 15,011 instruction-following records in American English. It was generated by thousands of Databricks employees, who were requested to provide reference texts copied from Wikipedia for specific categories. To learn more, consult [Dolly’s Data Card](https://huggingface.co/datasets/databricks/databricks-dolly-15k).
-After combining both translations, the dataset was augmented to a total of 80k examples.
-## Training Procedure
-For detailed information about the model architecture and compute infrastructure, please refer to the Technical Specifications section.
-### Preprocessing
-To prepare the training data, both the Alpaca and Dolly datasets, originally in English, were translated into Spanish using …
-The data was tokenized using LINCE-ZERO’s tokenizer, which is based on the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.
-### Training Hyperparameters
-More information needed
-### Speeds, Sizes, Times
-More information needed (throughput, start/end time, checkpoint size if relevant, etc.)
 # ✅ Evaluation
-## Testing Data, Factors & Metrics
-### Testing Data
-The model has been tested on a X% of the augmented combination of Alpaca (24.2 MB) and Dolly (13.1 MB) translated into Spanish.
-### Metrics
-Since LINCE-ZERO is an instruction model, the metrics used to evaluate it are:
-- X: <value>
 ### Results

   - [Recommendations](#recommendations)
 - [Training Details](#training-details)
   - [Training Data](#training-data)
 - [Evaluation](#evaluation)
   - [Results](#results)
 - [Environmental Impact](#environmental-impact)
 - [Technical Specifications](#technical-specifications)
   - [Compute Infrastructure](#compute-infrastructure)
     - [Hardware](#hardware)
     - [Software](#software)
+- [How to Get Started with the Model](#how-to-get-started-with-the-model)
 - [Citation](#citation)
 - [Contact](#contact)
 # 🐯 Model Details
 ## Training Data
+LINCE-ZERO is based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and has been fine-tuned using an augmented combination of the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets, both translated with the best quality into Spanish.
 Alpaca is a 24.2 MB dataset of 52,002 instructions and demonstrations in English. It was generated by OpenAI's `text-davinci-003` engine using the data generation pipeline from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct) with some modifications. For further details, refer to [Alpaca's Data Card](https://huggingface.co/datasets/tatsu-lab/alpaca).
 Dolly is a 13.1 MB dataset of 15,011 instruction-following records in American English. It was generated by thousands of Databricks employees, who were requested to provide reference texts copied from Wikipedia for specific categories. To learn more, consult [Dolly’s Data Card](https://huggingface.co/datasets/databricks/databricks-dolly-15k).
+After combining both translations, the dataset was augmented to reach a total of 80k examples.
 # ✅ Evaluation
+This is WIP.
 ### Results