File size: 920 Bytes
78dbc2a
 
 
 
 
 
 
 
 
76aa86c
 
 
 
 
 
 
3039772
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
---
title: README
emoji: 🐨
colorFrom: pink
colorTo: indigo
sdk: static
pinned: false
---

# The Stack v2 Training Data

This organization contains the full datasets used to train StarCoder2:

- `the-stack-v2-train-full`: contains the training data with 600+ programming languages used to train StarCoder2-15B  with the files concatenated per repository
- `the-stack-v2-train-full-files`: same as `the-stack-v2-train-full` but without repository concatenation which makes filtering files or licenses easier
- `the-stack-v2-train-smol`: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B  with the files concatenated per repository
- `the-stack-v2-train-smol-files`: same as `the-stack-v2-train-smol` but without repository concatenation which makes filtering files or licenses easier

See the [tech report](https://arxiv.org/pdf/2402.19173) for all the details on the dataset.