Menu

Citing NDIF

If you use NNsight or NDIF resources in your research, please cite the following:

Citation

Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd, Jannik Brinkmann, Caden Juang, Koyena Pal, Can Rager, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Michael Ripa, Adam Belfki, Nikhil Prakash, Sumeet Multani, Carla Brodley, Arjun Guha, Jonathan Bell, Byron Wallace, and David Bau. "NNsight and NDIF: Democratizing Access to Foundation Model Internals," 2024. arXiv preprint arXiv:2407.14561. Available at https://arxiv.org/abs/2407.14561.

BibTex

@article{fiotto2024nnsight,
  title={{NNsight} and {NDIF}: Democratizing Access to Foundation Model Internals},
  author={Fiotto-Kaufman, Jaden and Loftus, Alexander R and Todd, Eric and Brinkmann, Jannik and Juang, Caden and Pal, Koyena and Rager, Can and Mueller, Aaron and Marks, Samuel and Sharma, Arnab Sen and Lucchetti, Francesca and Ripa, Michael and Belfki, Adam and Prakash, Nikhil and Multani, Sumeet and Brodley, Carla and Guha, Arjun and Bell, Jonathan and Wallace, Byron and Bau, David},
  journal={arXiv preprint arXiv:2407.14561},
  year={2024}
}

In addition, when you publish work using NNsight or NDIF resources, we'd love you to email us directly at info@ndif.us to tell us about your work. This helps us track our impact and supports our continued efforts to provide open-source resources for reproducible and transparent research on large-scale AI systems.

Research Using NDIF

Maheep Chaudhary, Atticus Geiger. Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small. ArXiv 2024.
Evaluates the utility of high-dimensional sparse autoencoders (SAEs) for causal analysis in mechanistic interpretability, using the RAVEL benchmark on GPT-2 small. Compares four SAEs to neurons as a baseline and linear features learned via distributed alignment search (DAS) as a skyline. Findings indicate that SAEs struggle to match the neuron baseline and fall significantly short of the DAS skyline in distinguishing between knowledge of a city's country and continent.

Arnab Sen Sharma, David Atkinson, David Bau. Locating and Editing Factual Associations in Mamba. COLM 2024.
Investigates factual recall mechanisms in the Mamba state space model, comparing it to autoregressive transformer models. Finds that key components responsible for factual recall are localized in middle layers and at specific token positions, mirroring patterns seen in transformers. Demonstrates that rank-one model editing can insert facts at particular locations and adapts attention-knockout techniques to analyze information flow. Despite architectural differences, the study concludes that Mamba and transformer models share significant similarities in factual recall processes.

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling. Benchmarking Mental State Representations in Language Models. ArXiv 2024.
Conducts a benchmark study on the internal representation of mental states in language models, analyzing different model sizes, fine-tuning strategies, and prompt designs. Finds that the quality of belief representations improves with model size and fine-tuning but is sensitive to prompt variations. Extends previous activation editing experiments, showing that reasoning performance can be improved by steering model activations without training probes. First to investigate the impact of prompt variations on probing performance in Theory of Mind tasks.

Sheridan Feucht, David Atkinson, Byron Wallace, David Bau. Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs. ArXiv 2024.
Investigates how LLMs transform arbitrary groups of tokens into higher-level representations, focusing on multi-token words and named entities. Identifies a pronounced "erasure" effect where information about previous tokens is quickly forgotten in early layers. Proposes a method to probe the implicit vocabulary of LLMs by analyzing token representation changes across layers, providing results for Llama-2-7b and Llama-3-8B. This study represents the first effort to explore the implicit vocabulary of LLMs.

Clément Dumas, Veniamin Veselovsky, Giovanni Monea, Robert West, Chris Wendler. How do Llamas process multilingual text? A latent exploration through activation patching. ICML 2024 Workshop on Mechanistic Interpretability.
Analyzes Llama-2's forward pass during word translation tasks to explore whether it develops language-agnostic concept representations. Shows that language encoding occurs earlier than concept encoding and that activation patching can independently alter either the concept or the language. Demonstrates that averaging latents across languages does not hinder translation performance, providing evidence for universal concept representation in multilingual models.

Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell. Evidence of Learned Look-Ahead in a Chess-Playing Neural Network. ArXiv 2024.
Presents evidence of learned look-ahead in the policy network of Leela Chess Zero, showing that it internally represents future optimal moves, which are critical in certain board states. Demonstrates through activations, attention heads, and a probing model that neural networks can predict optimal moves ahead, providing a basis for understanding learned algorithmic capabilities in neural networks.

Wentao Zhu, Zhining Zhang, Yizhou Wang. Language Models Represent Beliefs of Self and Others. ICML 2024.
Investigates the presence of Theory of Mind (ToM) abilities in large language models, identifying internal representations of self and others' beliefs through neural activations. Shows that manipulating these representations significantly alters ToM performance, highlighting their importance in social reasoning. Extends findings to various social reasoning tasks involving causal inference.

Research Referencing NDIF

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, Yonatan Belinkov. The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability. ArXiv 2024.

Daniel D. Johnson. Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data. ArXiv 2024.

Florian Dietz, Sophie Fellenz, Dietrich Klakow, Marius Kloft. Comgra: A Tool for Analyzing and Debugging Neural Networks. ArXiv 2024.

Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell. Black-Box Access is Insufficient for Rigorous AI Audits. ACM 2024.

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà. A Primer on the Inner Workings of Transformer-based Language Models. ArXiv 2024.

Aaditya K. Singh, Ted Moskovitz, Felix Hill, Stephanie C.Y. Chan, Andrew M. Saxe. What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation. ArXiv 2024.

Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, Christopher Potts. Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. ArXiv 2024.