data update
Browse files
README.md
CHANGED
@@ -300,12 +300,14 @@ print(output)
|
|
300 |
|
301 |
<!-- TO DO: To be completed once the paper is ready, we may changed title to Supervised Finetuning -->
|
302 |
## Training Data
|
303 |
-
|
304 |
-
|
305 |
-
*
|
306 |
-
*
|
307 |
-
|
308 |
-
*
|
|
|
|
|
309 |
|
310 |
## Infrastructure
|
311 |
We train the Granite Language models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|
|
|
300 |
|
301 |
<!-- TO DO: To be completed once the paper is ready, we may changed title to Supervised Finetuning -->
|
302 |
## Training Data
|
303 |
+
Granite Language Instruct models are trained on a collection of publicly available datasets with non-restrictive license, as well as an IBM collection of synthetic datasets. We annotated and filtered these datasets to only include high-quality instances from each of them in our final mixture. This dataset selection is representative of the following domains:
|
304 |
+
|
305 |
+
* English datasets: [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus), [WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub), [OASST-OctoPack](https://huggingface.co/datasets/bigcode/oasst-octopack), [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater), [SoftAge-Multiturn](https://huggingface.co/datasets/SoftAge-AI/multi-turn_dataset), [Glaive-RAG-v1 ](https://huggingface.co/datasets/glaiveai/RAG-v1 ), [EvolKit-20k](https://huggingface.co/datasets/arcee-ai/EvolKit-20k ), [Magpie-Phi3-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Phi3-Pro-300K-Filtered).
|
306 |
+
* Multilingual datasets: [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) and IBM Synthetic datasets (e.g., Blue Multilingual, Daring Anteater Translated).
|
307 |
+
* Code datasets: [Glaive Code Assistant V3](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3), [SQL Create Context Instruction](https://huggingface.co/datasets/bugdaryan/sql-create-context-instruction), and [Self-OSS-Instruct-SC2](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k). Single and multi-turn IBM synthetic datasets, including a set of datasets generated via the evol-instruct method.
|
308 |
+
* Math: [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA), [StackMathQA](https://huggingface.co/datasets/math-ai/StackMathQA ), and [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)
|
309 |
+
* Tools: [xlam-function-calling](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), [Glaive Function Calling V2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2), [Hermes Function Calling V1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1), and IBM Synthetic API data.
|
310 |
+
* Safety: [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [HarmBench Behaviors](https://github.com/centerforaisafety/HarmBench/blob/main/data/behavior_datasets/harmbench_behaviors_text_all.csv), [Strong Reject](https://github.com/alexandrasouly/strongreject/blob/main/strongreject_dataset/strongreject_dataset.csv), [AdvBench](https://huggingface.co/datasets/walledai/AdvBench), [MistralGuard](https://huggingface.co/datasets/natolambert/xstest-v2-copy), [Do-Not-Answer](https://huggingface.co/datasets/LibrAI/do-not-answer), and IBM Synthetic data for safety.
|
311 |
|
312 |
## Infrastructure
|
313 |
We train the Granite Language models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|