---
title: README
emoji: 📈
colorFrom: red
colorTo: yellow
sdk: static
pinned: false
---

# Pico: A Lightweight Framework for Studying Learning Dynamics

Pico is a lightweight research framework that demystifies how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information.

Pico consists of two key components:
1. **Pre-trained Model Suite** (hosted here on HuggingFace)
2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico))

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.

## 🤗 HuggingFace Resources (You Are Here)

> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!

### Pre-trained Model Suite (Releasing January 2025)
Our complete suite of models from 1M to 1B parameters:
- **pico-tiny** (1M parameters) 
- **pico-small** (10M parameters)
- **pico-medium** (100M parameters)
- **pico-large** (500M parameters)
- **pico-xl** (1B parameters)

Each model includes:
- Complete training checkpoints
- Saved activations and gradients
- Pre-computed evaluation perplexity scores

### Available Datasets
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
   - 420B tokens of pre-processed text
   - Cleaned and shuffled DOLMA corpus

2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
   - Smaller version for quick experiments

3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)**
   - Batch of eval data for generating model activations

## 🔧 GitHub Training Framework

Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
- Train models with custom architectures
- Experiment with different training regimes
- Modify checkpoint saving behavior
- Implement custom evaluation metrics

The training framework makes it easy to:
1. Train multiple models of different sizes
2. Ensure consistent training across all models
3. Save rich checkpoint data for learning dynamics analysis
4. Compare learning dynamics across scales

## 🛠️ Using the Resources

### Using Pre-trained Models (HuggingFace)
```python
from transformers import AutoModelForCausalLM

# Load our pre-trained model
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")

# Access specific checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "pico-lm/pico-small",
    revision="step-xyz"
)
```

### Training Your Own Suite (GitHub)
```bash
# Clone the repository
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
source setup.sh

# Configure your model suite
# Edit configs/train.yaml to specify model sizes and training parameters

# Train your suite
python train.py --config configs/train.yaml
```

## 📊 Model Details

### Architecture
All models (both pre-trained and self-trained) use:
- LLAMA-style transformer
- RMSNorm for normalization
- RoPE positional embeddings
- Multi-head attention with KV-cache
- SwiGLU activation function

### Training Configuration
Standard configuration (customizable in GitHub training):
- Batch size: 1024
- Learning rate: 1e-3
- Weight decay: 0.1
- Gradient clipping: 1.0
- Mixed precision training

## 🔬 Research Applications

Perfect for researchers studying:
- Learning dynamics across model scales
- Mechanistic interpretability
- Architecture and training effects
- Emergent model behaviors

Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.

## 🤝 Contributing

Contributions welcome on both platforms:
- **HuggingFace**: Model weights, datasets, and evaluation results
- **GitHub**: Training framework improvements, analysis tools, and documentation

## 📫 Contact

- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)

## 🔍 Citation

```bibtex
@software{pico2024,
    author = {Diehl Martinez, Richard},
    title = {Pico: Framework for Training Tiny Language Models},
    year = {2024},
}
```