qnguyen3 commited on
Commit
3741dbd
1 Parent(s): d0d4129

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -1
README.md CHANGED
@@ -1,3 +1,52 @@
1
  ---
2
  license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ widget:
4
+ - text: My name is El Microondas the Wise, and
5
+ example_title: El Microondas
6
+ - text: Kennesaw State University is a public
7
+ example_title: Kennesaw State University
8
+ - text: Bungie Studios is an American video game developer. They are most famous for
9
+ developing the award winning Halo series of video games. They also made Destiny.
10
+ The studio was founded
11
+ example_title: Bungie
12
+ - text: The Mona Lisa is a world-renowned painting created by
13
+ example_title: Mona Lisa
14
+ - text: The Harry Potter series, written by J.K. Rowling, begins with the book titled
15
+ example_title: Harry Potter Series
16
+ - text: 'Question: I have cities, but no houses. I have mountains, but no trees. I
17
+ have water, but no fish. What am I?
18
+
19
+ Answer:'
20
+ example_title: Riddle
21
+ - text: The process of photosynthesis involves the conversion of
22
+ example_title: Photosynthesis
23
+ - text: Jane went to the store to buy some groceries. She picked up apples, oranges,
24
+ and a loaf of bread. When she got home, she realized she forgot
25
+ example_title: Story Continuation
26
+ - text: 'Problem 2: If a train leaves Station A at 9:00 AM and travels at 60 mph,
27
+ and another train leaves Station B at 10:00 AM and travels at 80 mph, when will
28
+ they meet if the distance between the stations is 300 miles?
29
+
30
+ To determine'
31
+ example_title: Math Problem
32
+ - text: In the context of computer programming, an algorithm is
33
+ example_title: Algorithm Definition
34
+ ---
35
+ # Mixsmol-4x400M-v0.1 by Ontocord
36
+ This is the first checkpoint (Epoch 1) of Mixsmol-4x400M-v0.1
37
+ Note that this is an experimental in data mixing. Therefore, we only trained the model on 50B tokens (95% English and 5% Vietnamese) to test the following:
38
+ - Reasoining capabilities through high-quality synthetic textbooks data pretraining
39
+ - Crosslingual understanding through machine translation and multilingual + multiple tasks pretraining
40
+
41
+ After verifying our hypothesis with this run, we will schedule a second run on bigger data and compute for it to achieve its maximum capability.
42
+
43
+ ## Data
44
+ - Synthetic Textbooks: 8M samples
45
+ - RefinedWeb: 1M samples
46
+ - RedPajama-v2: 500K samples
47
+ - MathPile: Everything
48
+ - ThePile: MiniPile Subset
49
+ - GoodWiki
50
+ - The Stack Smol XL
51
+ - The Vault: train_small split
52
+ - Instruction Pretraining: 250k samples