Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,79 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- HuggingFaceTB/finemath
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
base_model:
|
8 |
+
- meta-llama/Llama-3.2-3B
|
9 |
+
---
|
10 |
+
|
11 |
+
# Model Card
|
12 |
+
|
13 |
+
## Model summary
|
14 |
+
|
15 |
+
This model is part of the 📐 [FineMath](https://huggingface.co/datasets/HuggingFaceTB/finemath) ablations, we continue pretraining [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) base on different math datasets for 60B tokens.
|
16 |
+
The model has 3.21B parameters and 4096 context length. It was trained on **160B tokens** using a mix of 40% [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and 30% FineMath-4+ and 30% InfiWebMath-4+ from the 📐 [FineMath](https://huggingface.co/datasets/HuggingFaceTB/finemath) dataset.
|
17 |
+
|
18 |
+
- **License**: Apache-2
|
19 |
+
- **Languages**: English
|
20 |
+
|
21 |
+
## Use
|
22 |
+
|
23 |
+
### Intended use
|
24 |
+
|
25 |
+
This model was trained on English math data and is not instruction-tuned, making it intended for text completion in English with a focus on math.
|
26 |
+
It is important to note that the primary intended use case of this model is to compare its performance with other models trained under the same conditions. This model is not necessarily the best possible outcome achievable with the given dataset.
|
27 |
+
|
28 |
+
### Generation
|
29 |
+
|
30 |
+
```python
|
31 |
+
# pip install -q transformers
|
32 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
33 |
+
|
34 |
+
model = "HuggingFaceTB/finemath-ablation-4plus-160B"
|
35 |
+
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
36 |
+
|
37 |
+
tokenizer = AutoTokenizer.from_pretrained(model)
|
38 |
+
model = AutoModelForCausalLM.from_pretrained(model).to(device)
|
39 |
+
|
40 |
+
inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
|
41 |
+
outputs = model.generate(inputs)
|
42 |
+
print(tokenizer.decode(outputs[0]))
|
43 |
+
```
|
44 |
+
|
45 |
+
## Intermediate checkpoints
|
46 |
+
|
47 |
+
We are releasing intermediate checkpoints for this model at intervals of every 10000 training steps (10B tokens) in separate branches. The naming convention is `10B`.
|
48 |
+
|
49 |
+
You can load a specific model revision with `transformers` using the argument `revision`:
|
50 |
+
```python
|
51 |
+
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/finemath-ablation-4plus-160B", revision="10B")
|
52 |
+
```
|
53 |
+
You can access all the revisions for the models via the following code:
|
54 |
+
```python
|
55 |
+
from huggingface_hub import list_repo_refs
|
56 |
+
out = list_repo_refs("HuggingFaceTB/finemath-ablation-4plus-160B")
|
57 |
+
print([b.name for b in out.branches])
|
58 |
+
```
|
59 |
+
|
60 |
+
## Training
|
61 |
+
### Model
|
62 |
+
- **Architecture**: Llama3
|
63 |
+
- **Pretraining steps**: 60k
|
64 |
+
- **Pretraining tokens**: 60B
|
65 |
+
- **Precision**: bfloat16
|
66 |
+
|
67 |
+
### Hardware
|
68 |
+
- **GPUs**: 64 H100
|
69 |
+
|
70 |
+
### Software
|
71 |
+
- [nanotron](https://github.com/huggingface/nanotron/) for training
|
72 |
+
- [datatrove](https://github.com/huggingface/datatrove) for tokenization
|
73 |
+
- [lighteval](https://github.com/huggingface/lighteval) for evaluation
|
74 |
+
|
75 |
+
## Evaluation
|
76 |
+
We used the SmolLM2 setup to evaluate all our ablation models with `lighteval`. You can find the details here: https://github.com/huggingface/smollm/tree/main/evaluation#smollm2-base-models
|
77 |
+
|
78 |
+
## Limitations
|
79 |
+
This model was predominantly trained on English math data, potentially limiting its performance in other languages. Furthermore, the model's behavior is influenced by the quality and diversity of its training data, which may include biases and harmful content.
|