phd thesis
Gradient-based Methods for Deep Model Interpretability
École Polytechnique Fédérale de Lausanne (EPFL)
2021 · EPFL thesis distinction award (top 8%, EE dept.)
For more information, please see my google scholar page. Representative papers are highlighted below.
conference papers
Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
A.J. Li,
S. Srinivas,
U. Bhalla,
H. Lakkaraju
pdf
· summary
Sparse autoencoder (SAEs) concepts lack robustness, minimal input perturbations can substantially distort SAE concept representations without affecting the underlying model's behavior, raising concerns about their reliability.
Certifying LLM Safety against Adversarial Prompting
A. Kumar,
C. Agarwal,
S. Srinivas,
A. Li,
S. Feizi,
H. Lakkaraju
pdf
· code
· summary
We present a simple method to detect LLM adversarial attacks, by systematically deleting tokens until the underlying string is labelled harmful.
Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness
S. Srinivas*,
S. Bordt*,
H. Lakkaraju
pdf
· code
· summary
Previous work finds gradients of robust models to be "perceptually aligned". We explain this phenomenon by observing that robust models in practice are not robust in all directions, in fact they are mostly only robust outside the data manifold. This causes their gradients to align with the manifold, causing them to be perceptually aligned.
Spotlight presentation (Top 3%)
Discriminative Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability
U. Bhalla*,
S. Srinivas*,
H. Lakkaraju
pdf
· code
· summary
Given a pre-trained model, adapt this model to be robust to the perturbations introduced by feature attribution methods. Doing so results in models that recover ground truth attributions!
On Minimizing the Impact of Dataset Shifts on Actionable Explanations
A. Meyer*,
D. Ley*,
S. Srinivas,
H. Lakkaraju
pdf
· summary
How to train classifiers such that they are unaffected by small shifts in the dataset? We show theoretically and experimentally that weight decay, model curvature and robustness are all important factors that can help minimize the impact of such dataset shifts.
Oral presentation (Top 5%)
Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post hoc Explanations
T. Han,
S. Srinivas,
H. Lakkaraju
pdf
· code
· workshop
· summary
Several popular post-hoc explanations such as LIME, SHAP, and gradient based explanations can be viewed as performing local function approximation (LFA). Thinking of LFA as a framework for explanations enables us to make useful statements about explanations such as a no-free lunch theorem, and identify which explanations to use.
Best paper award at ICML "Interpretable ML for Healthcare" workshop, 2022
Efficiently Training Low-Curvature Neural Networks
S. Srinivas*,
K. Matoba*,
H. Lakkaraju,
F. Fleuret
pdf
· slides
· poster
· code
· summary
We train low-curvature neural networks, that are "as linear as possible" by (1) replacing ReLU with a variant of softplus, (2) spectral normalization of linear layers, (3) (optionally) using gradient-norm regularization; and minimizing the curvatures and spectral norms of each layer independently. This approach rivals adversarial training without training with adversarial examples.
Data-Efficient Structured Pruning via Submodular Optimization
M. El-Halabi,
S. Srinivas,
S. Lacoste-Julien
pdf
· summary
Pruning neurons in neural networks can be cast as a submodular optimization problem, enabling proposal of principled algorithms with rigorous theoretical guarantees that perform well when pruning with small number of data points.
Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability
S. Srinivas,
F. Fleuret
pdf
· slides
· poster
· code
· workshop
· summary
Commonly used input-gradient saliency maps for explaining discriminative neural nets capture information about an implicit density model, rather than that of the underlying discriminative model which it is intended to explain.
Oral presentation (Top 1%)
Full-Gradient Representation for Neural Network Visualization
S. Srinivas,
F. Fleuret
pdf
· poster
· code
· summary
Compute saliency information from all intermediate layers in neural networks, rather than just from the input, as is done commonly. This provably captures two desirable properties (sensitivity and completeness) which typical saliency maps cannot capture.
190+ Github stars · 200+ citations
Knowledge Transfer with Jacobian Matching
S. Srinivas,
F. Fleuret
pdf
· slides
· poster
· workshop
· summary
Perform sample-efficient distillation by requiring that the student model mimic the input-gradients of the teacher model. This is equivalent (in expectation) to performing classical distillation with data augmentation via additive input noise.
Best paper award at NeurIPS "Learning with Limited Data" workshop, 2017
Learning Neural Network Architectures using Backpropagation
S. Srinivas,
R.V. Babu
pdf
· poster
· summary
Automatically prune unimportant neurons during neural network training, by introducing multiplicative binary gating variables with each neuron, and encouraging the gate variables to be as sparse as possible via regularization.
A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
S. Srinivas,
R. Sarvadevabhatla,
K.R. Mopuri,
N. Prabhu,
S.S. Kruthiventi,
R.V. Babu
pdf
· summary
A recipe-style survey of pre-2015 deep neural networks as applied to computer vision.
300+ citations · Top 25% of all research outputs scored on Altmetric
workshop papers / tech reports
Towards Unifying Interpretability and Control: Evaluation via Intervention
U. Bhalla,
S. Srinivas,
A. Ghandeharioun,
H. Lakkaraju
pdf
· summary
Popular mech interp methods such as sparse autoencoders underperform simpler alternatives such as prompting and logitlens on the task of controlling model outputs, raising questions on the faithfulness of such methods.
Generalized Group Data Attribution
D. Ley,
S. Srinivas,
S. Zhang,
G. Rusak,
H. Lakkaraju
pdf
· summary
Data attribution methods, such as influence functions, can be made drastically more efficient (10-50x) by attributing to groups rather than individual data points. This can be used for fast dataset pruning and noisy label identification.
All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models
C. Badrinath,
U. Bhalla,
A. Oesterling,
S. Srinivas,
H. Lakkaraju
pdf
· summary
We find that many generative image models recover approximately similar representations.
Published at the Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM)
Word-Level Explanations for Analyzing Bias in Text-to-Image Models
A. Lin,
L.M. Paes,
S.H. Tanneru,
S. Srinivas,
H. Lakkaraju
pdf
· summary
For text to image models, we find which input words contribute to bias in output images. For example, we find that the word "doctor" in the input leads to an over-representation of males in the output.
Published at the Workshop on Challenges in Deploying Generative AI
Consistent Explanations in the Face of Model Indeterminacy via Ensembling
D. Ley,
L. Tang,
M. Nazari,
H. Lin,
S. Srinivas,
H. Lakkaraju
pdf
· summary
With model ensembles, feature attributions are fairly consistent. We find strategies that lead to efficient construction of such ensembles.
Published at the Workshop on Interpretable Machine Learning for Healthcare (IMLH)
Cyclical Pruning for Sparse Neural Networks
S. Srinivas,
A. Kuzmin,
M. Nagel,
M. van Baalen,
A. Skliar,
T. Blankevoort
pdf
· slides
· summary
Algorithms for training sparse neural networks should be more like projected gradient descent / iterative hard thresholding, which alternates between sparsification (i.e., projection step) and densification (i.e., gradient step), as opposed to common pruning approaches which do not perform densification.
Oral presentation at the Workshop on Efficient Computer Vision for Deep Learning (ECV)
Estimating Confidence for Deep Neural Networks through Density modelling
A. Subramanya,
S. Srinivas,
R.V. Babu
pdf
· slides
· summary
Model the density of intermediate features in a neural network using a high-dimensional Gaussian distribution. If features for a test point fall outside the "typical set" for such a Gaussian, then declare that test point to be out-of-distribution.
Training Sparse Neural Networks
S. Srinivas,
A. Subramanya,
R.V. Babu
pdf
· slides
· summary
Encourage weight sparsity in neural networks by introducing multiplicative binary gating variables along with each weight, and regularizing gates to be sparse.
200+ citations · Oral presentation at Embedded Vision Workshop
Generalized Dropout
S. Srinivas,
R.V. Babu
pdf
· summary
A generalized version of dropout where dropout probabilities are automatically tuned during training. This is done by introducing multiplicative bernoulli gating variables to each neuron within a neural network, and modelling the bernoulli probability by penalizing from a beta distribution.
Compensating for Large In-plane Rotations in Natural Images
L. Boominathan,
S. Srinivas,
R.V. Babu
pdf
· poster
· summary
Correct for large in-plane rotation in images by (1) detecting the presence of rotation using a CNN, and (2) correcting it iteratively using Bayesian optimization.