Trustworthy-ML-Lab

All

21 repositories

ThinkEdit
Public
An effective weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.
deep-learning interpretable-machine-learning large-language-models generative-ai mechanistic-interpretability reasoning-language-models
Python
•1•10•0•0•Updated May 6, 2025May 6, 2025
Linear-Explanations
Public
[ICML 24] A novel automated neuron explanation framework that can accurately describe poly-semantic concepts in deep neural networks
computer-vision deep-learning interpretable-machine-learning mechanistic-interpretability
Jupyter Notebook
•0•11•0•0•Updated May 2, 2025May 2, 2025
posthoc-generative-cbm
Public
[CVPR 2025] Concept Bottleneck Autoencoder (CB-AE) -- efficiently transform any pretrained (black-box) image generative model into an interpretable generative concept bottleneck model (CBM) with minimal concept supervision, while preserving image quality
computer-vision deep-learning interpretable-deep-learning concept-bottleneck-models interpretability-and-explainability generative-ai mechanistic-interpretability
Jupyter Notebook
•1•7•1•0•Updated Apr 16, 2025Apr 16, 2025
effective_skill_unlearning
Public
[NAACL 25] Two novel, light-weight, and training-free skill unlearning methods for LLMs
natural-language-processing deep-learning interpretability large-language-model
Python
•0•3•0•0•Updated Mar 27, 2025Mar 27, 2025
RAT_MisD
Public
Boosting misclassification detection ability by radius-aware training (RAT)
deep-learning misclassification-detection
Python
•0•0•0•0•Updated Mar 21, 2025Mar 21, 2025
Describe-and-Dissect
Public
[TMLR 25] An automated method for explaining complex neuron behaviors in deep vision models using large language models
deep-neural-networks computer-vision deep-learning explainable-ai interpretable-machine-learning large-language-models generative-ai mechanistic-interpretability
Jupyter Notebook
•1•10•1•0•Updated Feb 20, 2025Feb 20, 2025
CB-LLMs
Public
[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.
natural-language-processing deep-learning interpretable-deep-learning explainable-ai large-language-models mechanistic-interpretability
Python
•2•12•0•0•Updated Feb 13, 2025Feb 13, 2025
Concept-Bottleneck-LLM
Public
Python
•0•4•0•0•Updated Feb 1, 2025Feb 1, 2025
provable-efficient-dataset-distill-KRR
Public
Python
•
Apache License 2.0
•0•1•0•0•Updated Dec 10, 2024Dec 10, 2024
VLG-CBM
Public
[NeurIPS 24] A new training and evaluation framework for learning interpretable deep vision models and benchmarking different interpretable concept-bottleneck-models (CBMs)
deep-neural-networks computer-vision deep-learning explainable-ai interpretable-machine-learning concept-bottleneck-models large-language-models
Jupyter Notebook
•0•16•1•0•Updated Dec 7, 2024Dec 7, 2024
Interpretability-Guided-Defense
Public
[ECCV 24] A new and low-cost test-time defense for DNNs based on neuron-level-interpretability methods
computer-vision deep-learning interpretability robustness adversarial-machine-learning adversarial-examples
Python
•0•4•0•0•Updated Oct 1, 2024Oct 1, 2024
Audio_Network_Dissection
Public
[ICML 24] AND: the first framework to provide automatic natural language explanations for deep acoustic network
deep-neural-networks deep-learning interpretable-machine-learning mechanistic-interpretability
Jupyter Notebook
•0•2•0•0•Updated Sep 29, 2024Sep 29, 2024
DSC-210-NLA-FA22
Public
Jupyter Notebook
•0•0•0•0•Updated Sep 23, 2024Sep 23, 2024
concept-driven-continual-learning
Public
official code repo
Jupyter Notebook
•1•0•0•0•Updated Sep 10, 2024Sep 10, 2024
Robust_HighUtil_Smoothed_DRL
Public
[ICML 24] S-DQN and S-PPO: Robust smoothed deep RL agents without sacrificing performance
deep-learning deep-reinforcement-learning robustness adversarial-machine-learning robust-machine-learning robust-learning randomized-smoothing
Python
•0•5•0•0•Updated Jun 27, 2024Jun 27, 2024
NN-LPK
Public
Python
•0•2•0•0•Updated Jun 14, 2024Jun 14, 2024
Provably-Robust-Conformal-Prediction
Public
[ICLR 24] This work proposes RSCP+ to provide robustness guarantee in evaluation, and two novel methods PTT and RCT to robustify conformal predictions with improved efficiency through post-hoc transformation and training.
deep-neural-networks deep-learning robustness adversarial-machine-learning robust-machine-learning
Python
•1•4•0•0•Updated Apr 3, 2024Apr 3, 2024
Label-free-CBM
Public
[ICLR 23] A new framework to transform any neural networks into an interpretable concept-bottleneck-model (CBM) without needing labeled concept data
deep-neural-networks computer-vision deep-learning interpretability interpretable-deep-learning
Jupyter Notebook
•20•98•1•0•Updated Mar 31, 2024Mar 31, 2024
Efficient-LLM-automated-interpretability
Public
[NeurIPS'23 ATTRIB] An efficient framework to generate neuron explanations for LLMs
deep-learning interpretability explainable-ai large-language-models mechanistic-interpretability
Python
•1•5•0•0•Updated Dec 23, 2023Dec 23, 2023
CLIP-dissect
Public
[ICLR 23 spotlight] An automatic and efficient tool to describe functionalities of individual neurons in DNNs
deep-neural-networks computer-vision deep-learning interpretable-deep-learning explainable-ai interpretable-machine-learning mechanistic-interpretability
Jupyter Notebook
•15•49•0•0•Updated Nov 6, 2023Nov 6, 2023
corrupting_neuron_explanations
Public
[ICCV 23] Evaluating robustness of neuron explanation methods
deep-neural-networks computer-vision deep-learning robustness interpretable-machine-learning robust-machine-learning mechanistic-interpretability
Jupyter Notebook
•
MIT License
•1•4•0•0•Updated Sep 30, 2023Sep 30, 2023