Transformers
English
Inference Endpoints
jncraton's picture
Upload 7 files
e4a23a5
metadata
license: apache-2.0
datasets:
  - cerebras/SlimPajama-627B
  - bigcode/starcoderdata
language:
  - en

TinyLlama-1.1B

The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs πŸš€πŸš€. The training has started on 2023-09-01.

We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.

Releases Schedule

We will be rolling out intermediate checkpoints following the below schedule. We also include some baseline models for comparison.

Date HF Checkpoint Tokens Step HellaSwag Acc_norm
Baseline StableLM-Alpha-3B 800B -- 38.31
Baseline Pythia-1B-intermediate-step-50k-105b 105B 50k 42.04
Baseline Pythia-1B 300B 143k 47.16
2023-09-04 TinyLlama-1.1B-intermediate-step-50k-105b 105B 50k 43.50
2023-09-16 -- 500B -- --
2023-10-01 -- 1T -- --
2023-10-16 -- 1.5T -- --
2023-10-31 -- 2T -- --
2023-11-15 -- 2.5T -- --
2023-12-01 -- 3T -- --

It can be observed that TinyLlama has so far progressed well πŸŽ‰πŸŽ‰.

Meanwhile, you can track the live cross entropy loss here.

Training Details

Below are some details of our training setup:

Setting Description
Parameters 1.1B
Attention Variant Grouped Query Attention
Model Size Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632
Sequence Length 2048
Batch Size 2 million tokens (2048 * 1024)
Learning Rate 4e-4
Learning Rate Schedule Cosine with 2000 warmup steps
Training Data Slimpajama & Starcoderdata
Data Preprocessing Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata
Combined Dataset Size 1 trillion tokens
Total Tokens During Training 3 trillion (3 epochs/1430k steps)
Natural Language to Code Ratio 7:3
Hardware 16 A100-40G GPUs