amezasor commited on
Commit
f3aadbd
1 Parent(s): 00749d7

instruction data update

Browse files
Files changed (1) hide show
  1. README.md +8 -6
README.md CHANGED
@@ -299,12 +299,14 @@ print(output)
299
 
300
  <!-- TO DO: To be completed once the paper is ready, we may changed title to Supervised Finetuning -->
301
  ## Training Data
302
- This model is trained on a mix of open-source and proprietary datasets.
303
- <!-- ### Instruction Datasets
304
- * Language Instruction Datasets: We include high-quality datasets such as [TO DO: List of datasets]
305
- * Synthetic Instruction Datasets: [TO DO: paragraph about synthetic data]
306
- ### Processing
307
- * [TO DO: Data annotation with MagPie pipeline: quality, duplicates] -->
 
 
308
 
309
  <!-- CHECK: removed Vela, only talk about blue-vela-->
310
  ## Infrastructure
 
299
 
300
  <!-- TO DO: To be completed once the paper is ready, we may changed title to Supervised Finetuning -->
301
  ## Training Data
302
+ Granite Language Instruct models are trained on a collection of publicly available datasets with non-restrictive license, as well as an IBM collection of synthetic datasets. We annotated and filtered these datasets to only include high-quality instances from each of them in our final mixture. This dataset selection is representative of the following domains:
303
+
304
+ * English datasets: [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus), [WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub), [OASST-OctoPack](https://huggingface.co/datasets/bigcode/oasst-octopack), [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater), [SoftAge-Multiturn](https://huggingface.co/datasets/SoftAge-AI/multi-turn_dataset), [Glaive-RAG-v1 ](https://huggingface.co/datasets/glaiveai/RAG-v1 ), [EvolKit-20k](https://huggingface.co/datasets/arcee-ai/EvolKit-20k ), [Magpie-Phi3-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Phi3-Pro-300K-Filtered).
305
+ * Multilingual datasets: [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) and IBM Synthetic datasets (e.g., Blue Multilingual, Daring Anteater Translated).
306
+ * Code datasets: [Glaive Code Assistant V3](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3), [SQL Create Context Instruction](https://huggingface.co/datasets/bugdaryan/sql-create-context-instruction), and [Self-OSS-Instruct-SC2](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k). IBM Synthetic datasets (e.g., Blue SC Instruct, Blue CodeGenPlus, Blue OCP Multiturn), including various synthetic data generated via the evol-instruct method (e.g., Blue SC Instruct, Blue OCP Multiturn, Blue Self-OSS-Instruct-SC2, Blue Glaive Code Assistant V3).
307
+ * Math: [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA), [StackMathQA](https://huggingface.co/datasets/math-ai/StackMathQA ), and [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)
308
+ * Tools: [xlam-function-calling](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), [Glaive Function Calling V2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2), [Hermes Function Calling V1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1), and IBM Synthetic API data.
309
+ * Safety: [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [HarmBench Behaviors](https://github.com/centerforaisafety/HarmBench/blob/main/data/behavior_datasets/harmbench_behaviors_text_all.csv), [Strong Reject](https://github.com/alexandrasouly/strongreject/blob/main/strongreject_dataset/strongreject_dataset.csv), [AdvBench](https://huggingface.co/datasets/walledai/AdvBench), [MistralGuard](https://huggingface.co/datasets/natolambert/xstest-v2-copy), [Do-Not-Answer](https://huggingface.co/datasets/LibrAI/do-not-answer), and IBM Synthetic data for safety.
310
 
311
  <!-- CHECK: removed Vela, only talk about blue-vela-->
312
  ## Infrastructure