Steering_Hallucinations_LLMs

Steering Hallucinations in Language Models: Extending the Linear Representation Hypothesis with Sparse Autoencoders and Transcoders

This Repository holds code for extracting the top features (via SAEs) corresponding to hallucinations on the LAMBADA dataset -- a dataset containing cloze-style/'fill-in-the-blank'-style questions. These features are then steered to manipulate the amount of hallucinations in generating responses to LAMBADA prompts. We then run the steered model on two other tasks/datasets -- WritingPrompts and PIQA -- to test for generalizability and continuity of our findings in other task sets. Transcoder Circuits are also used to aid in feature interpretation, as our SAE-based analysis suggests hallucination-causing features may not capture the semantic intricacies of LLM representation space, but rather its underlying structure. The main SAE-based steering and analysis is in SAE_Hallucination_Interpretability.ipynb. Transcoder circuits and attribution graph-based analysis is in Transcoders_Circuit_Visualisation.ipynb. Analysis of how steering influences the human-readable outputs and structural metadata of inputted LAMBADA prompts is in Steered_Hallucination_Output.ipynb. See the included paper for an in-depth overview and discussion of this project and its results.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
README.md		README.md
SAE_Hallucination_Interpretability.ipynb		SAE_Hallucination_Interpretability.ipynb
Steered_Hallucination_Outputs.ipynb		Steered_Hallucination_Outputs.ipynb
Transcoders_Circuit_Visualisation.ipynb		Transcoders_Circuit_Visualisation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Steering_Hallucinations_LLMs

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

PBales1/Steering_Hallucinations_LLMs

Folders and files

Latest commit

History

Repository files navigation

Steering_Hallucinations_LLMs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages