Title: Neurons, Directions, and Hulls: Rethinking the Geometry of Concepts in Large Language Models
Abstract: How is a “concept” represented in a large language model, geometrically speaking? Answers to this question should determine how we investigate what language models know and how they learn. In this talk, I will begin by surveying common units of analysis in interpretability—neurons, attention heads, and learned directions—and asking which constitute the right mediator for causal analysis. I will then summarize two case studies empirically demonstrating how learned directions (a popular mediator in recent work) have enabled a more precise and predictive science of language model behavior. First, sparse features learned by autoencoders enable more selective and robust interventions to model behaviors. Second, these features allow us to track how grammatical concepts emerge and consolidate throughout training. However, when we rigorously evaluate feature representations in multi-concept settings, they fail independence and sparsity tests: steering one concept frequently affects unrelated concepts, and single directions prove insufficient for capturing even simple concepts. Thus, are directions really the right unit of analysis? I will conclude by summarizing recent arguments for an alternative geometric picture—concepts as convex hulls spanned by archetypal exemplars—and discuss implications for the next generation of interpretability research.
Bio: Aaron Mueller is an assistant professor of Computer Science and, by courtesy, of Data Science at Boston University. His research centers on developing language modeling methods and evaluations inspired by causal and linguistic principles, and applying these to precisely control and improve the generalization of computational models of language. He completed a Ph.D. at Johns Hopkins University and was a Zuckerman postdoctoral fellow at Northeastern and the Technion. His work has been published in ML and NLP venues (such as ICML, ACL, and EMNLP) and has won awards at TMLR and ACL. He is a recurring organizer of the BlackboxNLP and BabyLM workshops, and has recently been featured in IEEE Spectrum (2024) and MIT Technology Review (2025).