Computer Science > Computation and Language

arXiv:2311.04354 (cs)

[Submitted on 7 Nov 2023 (v1), last revised 12 Feb 2025 (this version, v3)]

Title:Uncovering Intermediate Variables in Transformers using Circuit Probing

Authors:Michael A. Lepori, Thomas Serre, Ellie Pavlick

Abstract:Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. Across these three experiments we demonstrate that circuit probing combines and extends the capabilities of existing methods, providing one unified approach for a variety of analyses. Finally, we demonstrate circuit probing on a real-world use case: uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2311.04354 [cs.CL]
	(or arXiv:2311.04354v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.04354

Submission history

From: Michael Lepori Jr. [view email]
[v1] Tue, 7 Nov 2023 21:27:17 UTC (844 KB)
[v2] Fri, 17 Nov 2023 15:15:17 UTC (844 KB)
[v3] Wed, 12 Feb 2025 18:24:34 UTC (915 KB)

Computer Science > Computation and Language

Title:Uncovering Intermediate Variables in Transformers using Circuit Probing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Uncovering Intermediate Variables in Transformers using Circuit Probing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators