Introduction

Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models know themselves through automated interpretability.

This library provides utilities for generating and scoring text explanations of sparse autoencoder (SAE) and transcoder features. The explainer and scorer models can be run locally or accessed using API calls via OpenRouter.

The branch used for the article Automatically Interpreting Millions of Features in Large Language Models is the legacy branch article_version, that branch contains the scripts to reproduce our experiments. Note that we're still actively improving the codebase and that the newest version on the main branch could require slightly different usage.

Installation

Install this library as a local editable installation. Run the following command from the delphi directory.

pip install -e .

Getting Started

To run the default pipeline from the command line, use the following command:

python -m delphi meta-llama/Meta-Llama-3-8B EleutherAI/sae-llama-3-8b-32x --explainer_model 'hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4' --dataset_repo 'EleutherAI/fineweb-edu-dedup-10b' --dataset_split 'train[:1%]' --n_tokens 10_000_000 --max_latents 100 --hookpoints layers.5 --filter_bos --name llama-3-8B

This command will:

Cache activations for the first 10 million tokens of EleutherAI/rpj-v2-sample.
Generate explanations for the first 100 features of layer 5 using the specified explainer model.
Score the explanations uses fuzzing and detection scorers.
Log summary metrics including per-scorer F1 scores and confusion matrices, and produce histograms of the scorer classification accuracies.

The pipeline is highly configurable and can also be called programmatically (see the end-to-end test for an example).

To use other scorer types, instantiate a custom pipeline.

Caching

The first step to generate explanations is to cache sparse model activations. To do so, load your sparse models into the base model, load the tokens you want to cache the activations from, create a FeatureCache object and run it. We recommend caching over at least 10M tokens.

from sparsify.data import chunk_and_tokenize
from delphi.features import FeatureCache

data = load_dataset("EleutherAI/rpj-v2-sample", split="train[:1%]")
tokens = chunk_and_tokenize(data, tokenizer, max_seq_len=256, text_key="raw_content")["input_ids"]

cache = FeatureCache(
    model,
    submodule_dict,
    batch_size = 8
)

cache.run(n_tokens = 10_000_000, tokens=tokens)

Caching saves .safetensors of dict["activations", "locations"].

cache.save_splits(
    n_splits=5,
    save_dir="raw_latents"
)

Safetensors are split into shards over the width of the autoencoder.

Loading Feature Records

The .features module provides utilities for reconstructing and sampling various statistics for sparse features. In this version of the code you needed to specify the width of the autoencoder, the minimum number examples for a feature to be included and the maximum number of examples to include, as well as the number of splits to divide the features into.

from delphi.features import FeatureLoader, FeatureDataset
from delphi.config import FeatureConfig

#
cfg = FeatureConfig(width=131072, min_examples=200, max_examples=10000, n_splits=5)

dataset = FeatureDataset(
    raw_dir="feature_folder",
    modules=[".model.layer.0"], # This a list of the different caches to load from
    cfg=cfg,
)

The feature dataset will construct lazy loaded buffers that load activations into memory when called as an iterator object. You can iterate through the dataset using the FeatureLoader object. The feature loader will take in the feature dataset, a constructor and a sampler.

loader = FeatureLoader(
    dataset=dataset,
    constructor = constructor,
    sampler = sampler,
)

We have a simple sampler and constructor that take arguments from the ExperimentConfig object. The constructor defines builds the context windows from the cached activations and tokens, and the sampler divides these contexts into a training and testing set, used to generate explanations and evaluate them.

from delphi.features.constructors import default_constructor
from delphi.features.samplers import sample
from delphi.config import ExperimentConfig

cfg = ExperimentConfig(
    n_examples_train=40, # Number of examples shown to the explainer model
    n_examples_test=100, # Number of examples shown to the scorer models
    n_quantiles=10, # Number of quantiles to divide the data into
    example_ctx_len=32, # Length of each example
    n_non_activating=100, # Number of non-activating examples shown to the scorer model
    train_type="quantiles", # Type of sampler to use for training
    test_type="even", # Type of sampler to use for testing
)

constructor = partial(default_constructor, tokens=dataset.tokens, n_not_active=cfg.n_non_activating, ctx_len=cfg.example_ctx_len, max_examples=cfg.max_examples)
sampler = partial(sample, cfg=cfg)

Name		Name	Last commit message	Last commit date
Latest commit History 1,074 Commits
.github/workflows		.github/workflows
.vscode		.vscode
delphi		delphi
examples		examples
experiments/output_features		experiments/output_features
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Installation

Getting Started

Caching

Loading Feature Records

Generating Explanations

Scoring Explanations

Simulation

Surprisal

Embedding

Scripts

Experiments

License

About

Uh oh!

Releases

Packages

Languages

License

alowet/delphi

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Getting Started

Caching

Loading Feature Records

Generating Explanations

Scoring Explanations

Simulation

Surprisal

Embedding

Scripts

Experiments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages