Simingh commited on
Commit
fc98136
Β·
verified Β·
1 Parent(s): 1a8f295

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -3
README.md CHANGED
@@ -1,3 +1,39 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - OpenCoder-LLM/fineweb-code-corpus
5
+ - OpenCoder-LLM/fineweb-math-corpus
6
+ - OpenCoder-LLM/RefineCode-code-corpus-meta
7
+ - OpenCoder-LLM/opc-annealing-corpus
8
+ language:
9
+ - en
10
+ - zh
11
+ ---
12
+
13
+ <div align="center">
14
+ <img src="https://github.com/OpenCoder-llm/opencoder-llm.github.io/blob/main/static/images/opencoder_icon.jpg?raw=true" width="50%" alt="OpenCoder-Icon" />
15
+ </div>
16
+
17
+
18
+
19
+ <p align="center">
20
+ <!-- <a href="https://arxiv.org/pdf/2411.04905"><b>Paper Link</b>πŸ‘οΈ</a> -->
21
+ 🏠 <a href="https://opencoder-llm.github.io/">Home Page</a>&nbsp&nbsp |
22
+ &nbsp&nbsp πŸ€— <a href="https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e">Model</a>&nbsp&nbsp |
23
+ &nbsp&nbsp πŸ“Š <a href="https://huggingface.co/collections/OpenCoder-LLM/opencoder-datasets-672e6db6a0fed24bd69ef1c2">Dataset</a>&nbsp&nbsp |
24
+ &nbsp&nbsp πŸ“„<a href="https://arxiv.org/abs/2411.04905">Paper</a>&nbsp&nbsp
25
+ </p>
26
+
27
+
28
+ ## 1. Introduction
29
+ **OpenCoder** is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs.
30
+
31
+ This repository contains all the intermediate checkpoints of OpenCoder-1.5B-Base, saving in different branches. For the final results, please refer to πŸ€— [OpenCoder-1.5B-Base](https://huggingface.co/infly/OpenCoder-1.5B-Base)
32
+
33
+ ## 2. Branches Overview
34
+
35
+ - `pretrain_iter_0001000` - `pretrain_iter_0484865`: Intermediate checkpoints during the pretraining stage.
36
+
37
+ - `anneal_iter_0001000` - `anneal_iter_0023841`: Intermediate checkpoints during the annealing stage.
38
+
39
+ We use `pretrain_iter_0373000` as the starting point for the annealing stage, and use `anneal_iter_0023000` as the final base model.