Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models know themselves through automated interpretability.
This library provides utilities for generating and scoring text explanations of sparse autoencoder (SAE) and transcoder features. The explainer and scorer models can be run locally or accessed using API calls via OpenRouter.
The branch used for the article Automatically Interpreting Millions of Features in Large Language Models is the legacy branch article_version, that branch contains the scripts to reproduce our experiments. Note that we're still actively improving the codebase and that the newest version on the main branch could require slightly different usage.
Install this library as a local editable installation. Run the following command from the delphi
directory.
pip install -e .
To run the default pipeline from the command line, use the following command:
python -m delphi meta-llama/Meta-Llama-3-8B EleutherAI/sae-llama-3-8b-32x --explainer_model 'hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4' --dataset_repo 'EleutherAI/fineweb-edu-dedup-10b' --dataset_split 'train[:1%]' --n_tokens 10_000_000 --max_latents 100 --hookpoints layers.5 --filter_bos --name llama-3-8B
This command will:
- Cache activations for the first 10 million tokens of EleutherAI/rpj-v2-sample.
- Generate explanations for the first 100 features of layer 5 using the specified explainer model.
- Score the explanations uses fuzzing and detection scorers.
- Log summary metrics including per-scorer F1 scores and confusion matrices, and produce histograms of the scorer classification accuracies.
The pipeline is highly configurable and can also be called programmatically (see the end-to-end test for an example).
To use other scorer types, instantiate a custom pipeline.
The first step to generate explanations is to cache sparse model activations. To do so, load your sparse models into the base model, load the tokens you want to cache the activations from, create a FeatureCache
object and run it. We recommend caching over at least 10M tokens.
from sparsify.data import chunk_and_tokenize
from delphi.features import FeatureCache
data = load_dataset("EleutherAI/rpj-v2-sample", split="train[:1%]")
tokens = chunk_and_tokenize(data, tokenizer, max_seq_len=256, text_key="raw_content")["input_ids"]
cache = FeatureCache(
model,
submodule_dict,
batch_size = 8
)
cache.run(n_tokens = 10_000_000, tokens=tokens)
Caching saves .safetensors
of dict["activations", "locations"]
.
cache.save_splits(
n_splits=5,
save_dir="raw_latents"
)
Safetensors are split into shards over the width of the autoencoder.
The .features
module provides utilities for reconstructing and sampling various statistics for sparse features. In this version of the code you needed to specify the width of the autoencoder, the minimum number examples for a feature to be included and the maximum number of examples to include, as well as the number of splits to divide the features into.
from delphi.features import FeatureLoader, FeatureDataset
from delphi.config import FeatureConfig
#
cfg = FeatureConfig(width=131072, min_examples=200, max_examples=10000, n_splits=5)
dataset = FeatureDataset(
raw_dir="feature_folder",
modules=[".model.layer.0"], # This a list of the different caches to load from
cfg=cfg,
)
The feature dataset will construct lazy loaded buffers that load activations into memory when called as an iterator object. You can iterate through the dataset using the FeatureLoader
object. The feature loader will take in the feature dataset, a constructor and a sampler.
loader = FeatureLoader(
dataset=dataset,
constructor = constructor,
sampler = sampler,
)
We have a simple sampler and constructor that take arguments from the ExperimentConfig
object. The constructor defines builds the context windows from the cached activations and tokens, and the sampler divides these contexts into a training and testing set, used to generate explanations and evaluate them.
from delphi.features.constructors import default_constructor
from delphi.features.samplers import sample
from delphi.config import ExperimentConfig
cfg = ExperimentConfig(
n_examples_train=40, # Number of examples shown to the explainer model
n_examples_test=100, # Number of examples shown to the scorer models
n_quantiles=10, # Number of quantiles to divide the data into
example_ctx_len=32, # Length of each example
n_non_activating=100, # Number of non-activating examples shown to the scorer model
train_type="quantiles", # Type of sampler to use for training
test_type="even", # Type of sampler to use for testing
)
constructor = partial(default_constructor, tokens=dataset.tokens, n_not_active=cfg.n_non_activating, ctx_len=cfg.example_ctx_len, max_examples=cfg.max_examples)
sampler = partial(sample, cfg=cfg)