🚀 Pico Train

Pico Train is a lightweight framework for training language models—from tiny-scale (~1M parameters) to mid-scale (~1B parameters)—with built-in rich checkpointing that captures activations, gradients, and model states, enabling detailed learning dynamics research.

Our suite of pre-trained models is already publicly available on our Hugging Face organization, and a dedicated companion library for advanced analysis—pico-analyze—is fully released for deeper checkpoint studies.

For a detailed run-through, check out the full tutorial on our website at picolm.io.

Key Features

Pico Decoder: LLAMA-style Transformer Architecture
- RMSNorm, RoPE, multi-head self-attention with KV-cache, and SwiGLU activations
- Currently supports the pico-decoder model, with future expansions planned (pico-diffusion, pico-statespace, etc.)
Comprehensive Checkpoints
- Saves model states, optimizer states, and training metadata
- Enriched with activation and gradient snapshots for interpretability
Focused Scale Range
- Optimized to train models from 1M to 1B parameters, where learning dynamics research is most viable
Clean, Pre-tokenized Data
- Uses a pre-tokenized, pre-shuffled version of Dolma that we make available on Hugging Face
- Facilitates training models using identical data for consistency and comparability
Research Ready
- Minimal, well-documented code suitable for forking and tailoring
- Logs essential metrics (e.g. perplexity) throughout training
- Works seamlessly with pico-analyze for advanced post-training interpretation

Training Philosophy

All models in the Pico suite (both pre-trained and user-trained):

Employ identical architectures and optimizer settings
Share the same data order and tokens
Automatically log rich checkpoint data (including activations, gradients)
Facilitate direct cross-scale comparisons

This uniformity means you can isolate model size as the primary variable, giving you clearer insights into how model capacity affects learning.

Resources

Pre-trained Models (1M–1B parameters), publicly hosted on Hugging Face
Pre-tokenized Datasets for straightforward streaming-based training
Extensive Checkpoints logging activation and gradient snapshots
Evaluation Metrics (perplexity and more) tracked at each checkpoint

Core Components

Pico-Decoder Model
- LLAMA-style auto-regressive transformer
- RMSNorm
- RoPE (Rotary Positional Embeddings)
- Multi-head attention with KV-cache
- SwiGLU activation
Future plans include additional architectures like pico-diffusion and pico-statespace.
Training & Checkpointing
- Automatic storage of model and optimizer states
- Periodic hooks for saving learning dynamics (activations, gradients)
- Optional logging to Weights & Biases
Config-Driven Setup
- Specify architecture, optimizer, dataset, and logging settings in YAML
- Straightforward to extend or modify

Quick Start

Clone the Repository

git clone https://github.com/pico-lm/pico-train
cd pico

Configure Environment

Create a .env file at the root with your Hugging Face and Weights & Biases tokens:
```
export HF_TOKEN=your_huggingface_token
export WANDB_API_KEY=your_wandb_key
```
Install Dependencies
```
source setup.sh
```
This script checks your environment, installs necessary tools, and sets up a Poetry virtual environment.
Train Your Model Suite
- Edit (or create) a config file (e.g., configs/demo.yaml) to specify your architecture and training preferences.
- Then run:
```
poetry run train --config_path configs/demo.yaml
```
- This launches training, automatically checkpointing states and saving learning dynamics data.
Explore Checkpoints
- By default, checkpoints are stored under runs/YOUR_RUN_NAME/checkpoints/.
- Each checkpoint contains:
  - Model state (PyTorch + Hugging Face formats)
  - Optimizer state
  - Gradients and activations for interpretability
  - Evaluation logs (e.g. perplexity) and metrics

Repository Structure

src/model/pico_decoder.py
- Core LLAMA-style decoder implementation (attention, RMSNorm, RoPE, etc.)
src/training/trainer.py
- Main training loop
- Manages distributed and multi-node settings
- Collects/logs metrics
- Orchestrates checkpoint saving
src/checkpointing
- Logic for saving model states, gradients, activations
- Tools for uploading checkpoints to Hugging Face
src/config
- Flexible Dataclass-based config system (model and training hyperparameters, checkpointing, logging)
configs/demo.yaml
- Example config with default values for quick experimentation

Advanced Analysis with Pico Analyze

For deeper checkpoint analysis—comparing gradients, tracking representation shifts, measuring sparsity—use our companion repository pico-analyze. It automatically processes pico-train checkpoints and applies advanced metrics like CKA, PWCCA, Gini, Hoyer, and more to reveal how your models learn over time.

License

Pico is open-source under the Apache License 2.0.

Citation

If you use Pico in your research, please cite:

@software{pico2025,
    author = {Diehl Martinez, Richard},
    title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
    year = {2025},
    url = {https://github.com/pico-lm}
}

Happy Training! For more information and tutorials, visit our website at picolm.io.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
configs		configs
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Pico Train

Key Features

Training Philosophy

Resources

Core Components

Quick Start

Repository Structure

Advanced Analysis with Pico Analyze

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

ashokpeeta/pico-train

Folders and files

Latest commit

History

Repository files navigation

🚀 Pico Train

Key Features

Training Philosophy

Resources

Core Components

Quick Start

Repository Structure

Advanced Analysis with Pico Analyze

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages