Pico Train is a lightweight framework for training language models—from tiny-scale (~1M parameters) to mid-scale (~1B parameters)—with built-in rich checkpointing that captures activations, gradients, and model states, enabling detailed learning dynamics research.
Our suite of pre-trained models is already publicly available on our Hugging Face organization, and a dedicated companion library for advanced analysis—pico-analyze—is fully released for deeper checkpoint studies.
For a detailed run-through, check out the full tutorial on our website at picolm.io.
-
Pico Decoder: LLAMA-style Transformer Architecture
- RMSNorm, RoPE, multi-head self-attention with KV-cache, and SwiGLU activations
- Currently supports the pico-decoder model, with future expansions planned (pico-diffusion, pico-statespace, etc.)
-
Comprehensive Checkpoints
- Saves model states, optimizer states, and training metadata
- Enriched with activation and gradient snapshots for interpretability
-
Focused Scale Range
- Optimized to train models from 1M to 1B parameters, where learning dynamics research is most viable
-
Clean, Pre-tokenized Data
- Uses a pre-tokenized, pre-shuffled version of Dolma that we make available on Hugging Face
- Facilitates training models using identical data for consistency and comparability
-
Research Ready
- Minimal, well-documented code suitable for forking and tailoring
- Logs essential metrics (e.g. perplexity) throughout training
- Works seamlessly with pico-analyze for advanced post-training interpretation
All models in the Pico suite (both pre-trained and user-trained):
- Employ identical architectures and optimizer settings
- Share the same data order and tokens
- Automatically log rich checkpoint data (including activations, gradients)
- Facilitate direct cross-scale comparisons
This uniformity means you can isolate model size as the primary variable, giving you clearer insights into how model capacity affects learning.
- Pre-trained Models (1M–1B parameters), publicly hosted on Hugging Face
- Pre-tokenized Datasets for straightforward streaming-based training
- Extensive Checkpoints logging activation and gradient snapshots
- Evaluation Metrics (perplexity and more) tracked at each checkpoint
-
Pico-Decoder Model
- LLAMA-style auto-regressive transformer
- RMSNorm
- RoPE (Rotary Positional Embeddings)
- Multi-head attention with KV-cache
- SwiGLU activation
Future plans include additional architectures like pico-diffusion and pico-statespace.
-
Training & Checkpointing
- Automatic storage of model and optimizer states
- Periodic hooks for saving learning dynamics (activations, gradients)
- Optional logging to Weights & Biases
-
Config-Driven Setup
- Specify architecture, optimizer, dataset, and logging settings in YAML
- Straightforward to extend or modify
-
Clone the Repository
git clone https://github.com/pico-lm/pico-train cd pico
-
Configure Environment
Create a
.env
file at the root with your Hugging Face and Weights & Biases tokens:export HF_TOKEN=your_huggingface_token export WANDB_API_KEY=your_wandb_key
-
Install Dependencies
source setup.sh
This script checks your environment, installs necessary tools, and sets up a Poetry virtual environment.
-
Train Your Model Suite
- Edit (or create) a config file (e.g.,
configs/demo.yaml
) to specify your architecture and training preferences. - Then run:
poetry run train --config_path configs/demo.yaml
- This launches training, automatically checkpointing states and saving learning dynamics data.
- Edit (or create) a config file (e.g.,
-
Explore Checkpoints
- By default, checkpoints are stored under
runs/YOUR_RUN_NAME/checkpoints/
. - Each checkpoint contains:
- Model state (PyTorch + Hugging Face formats)
- Optimizer state
- Gradients and activations for interpretability
- Evaluation logs (e.g. perplexity) and metrics
- By default, checkpoints are stored under
-
src/model/pico_decoder.py
- Core LLAMA-style decoder implementation (attention, RMSNorm, RoPE, etc.)
-
src/training/trainer.py
- Main training loop
- Manages distributed and multi-node settings
- Collects/logs metrics
- Orchestrates checkpoint saving
-
src/checkpointing
- Logic for saving model states, gradients, activations
- Tools for uploading checkpoints to Hugging Face
-
src/config
- Flexible Dataclass-based config system (model and training hyperparameters, checkpointing, logging)
-
configs/demo.yaml
- Example config with default values for quick experimentation
For deeper checkpoint analysis—comparing gradients, tracking representation shifts, measuring sparsity—use our companion repository pico-analyze. It automatically processes pico-train checkpoints and applies advanced metrics like CKA, PWCCA, Gini, Hoyer, and more to reveal how your models learn over time.
Pico is open-source under the Apache License 2.0.
If you use Pico in your research, please cite:
@software{pico2025,
author = {Diehl Martinez, Richard},
title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
year = {2025},
url = {https://github.com/pico-lm}
}
Happy Training! For more information and tutorials, visit our website at picolm.io.