Steering Hallucinations in Language Models: Extending the Linear Representation Hypothesis with Sparse Autoencoders and Transcoders
This Repository holds code for extracting the top features (via SAEs) corresponding to hallucinations on the LAMBADA dataset -- a dataset containing cloze-style/'fill-in-the-blank'-style questions. These features are then steered to manipulate the amount of hallucinations in generating responses to LAMBADA prompts. We then run the steered model on two other tasks/datasets -- WritingPrompts and PIQA -- to test for generalizability and continuity of our findings in other task sets. Transcoder Circuits are also used to aid in feature interpretation, as our SAE-based analysis suggests hallucination-causing features may not capture the semantic intricacies of LLM representation space, but rather its underlying structure. The main SAE-based steering and analysis is in SAE_Hallucination_Interpretability.ipynb. Transcoder circuits and attribution graph-based analysis is in Transcoders_Circuit_Visualisation.ipynb. Analysis of how steering influences the human-readable outputs and structural metadata of inputted LAMBADA prompts is in Steered_Hallucination_Output.ipynb. See the included paper for an in-depth overview and discussion of this project and its results.