{"title":"Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small","authors":"Maheep Chaudhary, Atticus Geiger","doi":"arxiv-2409.04478","DOIUrl":null,"url":null,"abstract":"A popular new method in mechanistic interpretability is to train\nhigh-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE\nfeatures as the atomic units of analysis. However, the body of evidence on\nwhether SAE feature spaces are useful for causal analysis is underdeveloped. In\nthis work, we use the RAVEL benchmark to evaluate whether SAEs trained on\nhidden representations of GPT-2 small have sets of features that separately\nmediate knowledge of which country a city is in and which continent it is in.\nWe evaluate four open-source SAEs for GPT-2 small against each other, with\nneurons serving as a baseline, and linear features learned via distributed\nalignment search (DAS) serving as a skyline. For each, we learn a binary mask\nto select features that will be patched to change the country of a city without\nchanging the continent, or vice versa. Our results show that SAEs struggle to\nreach the neuron baseline, and none come close to the DAS skyline. We release\ncode here: https://github.com/MaheepChaudhary/SAE-Ravel","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"51 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04478","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A popular new method in mechanistic interpretability is to train
high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE
features as the atomic units of analysis. However, the body of evidence on
whether SAE feature spaces are useful for causal analysis is underdeveloped. In
this work, we use the RAVEL benchmark to evaluate whether SAEs trained on
hidden representations of GPT-2 small have sets of features that separately
mediate knowledge of which country a city is in and which continent it is in.
We evaluate four open-source SAEs for GPT-2 small against each other, with
neurons serving as a baseline, and linear features learned via distributed
alignment search (DAS) serving as a skyline. For each, we learn a binary mask
to select features that will be patched to change the country of a city without
changing the continent, or vice versa. Our results show that SAEs struggle to
reach the neuron baseline, and none come close to the DAS skyline. We release
code here: https://github.com/MaheepChaudhary/SAE-Ravel