Research themes
My research focuses on understanding deep learning from an empirical and scientific perspective, aiming to derive actionable insights that can improve its practical application. Major themes include:
model interpretability via representation analysis
Deep learning works by transforming inputs into latent representations. Can we understand the information encoded in these representations?
- Bhalla*, Oesterling*, Srinivas, Calmon & Lakkaraju (2024): Representations of vision-language models can be decomposed into sparse linear combinations of semantic concept vectors, enabling dataset analysis and model editing
- Bhalla, Srinivas, Ghandeharioun, Lakkaraju (2024): We unify 4 representation analysis methods as “encoder-decoder” methods and evaluate them via control-based metrics
- Li, Srinivas, Bhalla & Lakkaraju (2026): Concept representations in sparse autoencoders are fragile — minimal input perturbations can substantially distort SAE-based interpretations without affecting the underlying model
data-centric machine learning
Model behaviour depends critically on training data. Can we identify datapoints and subsets that influence key model properties?
- Ley, Srinivas, Zhang, Rusak & Lakkaraju (2024): Generalized group data attribution extends arbitrary attribution algorithms to groups of training points, yielding ~50x speedups
- Bordt, Srinivas, Boreiko & von Luxburg (2024): Benchmark contamination in LLM pre-training does not automatically inflate scores — data seen early is often simply forgotten by end of training
model interpretability via feature attribution
How can we identify which input features a model relies on, and verify that attribution methods are faithful to model behaviour?
- Srinivas & Fleuret (2019): ReLU network outputs decompose exactly as a sum of layerwise gradients, yielding a principled saliency method
- Srinivas & Fleuret (2021): Popular saliency methods can be unfaithful — producing attributions independent of the model
- Han, Srinivas & Lakkaraju (2022): 8 popular attribution methods unify under local linear approximations of non-linear models
- Srinivas*, Bordt* & Lakkaraju (2023) and Bhalla*, Srinivas* & Lakkaraju (2023): Model robustness is a key factor in producing faithful attributions; we formalize ground-truth attributions via a signal-distractor decomposition
alternate notions of model robustness
Training adversarially robust models is expensive. Are there alternate robustness definitions that are both meaningful and easier to train for?
- Srinivas*, Matoba*, Lakkaraju & Fleuret (2022): Smoothness (small Hessian) as a robustness proxy — competitive with adversarial training while maintaining accuracy and stable gradients
- Han*, Srinivas* & Lakkaraju (2024): Average-case robustness as an alternative to worst-case adversarial robustness, with efficient estimation methods
- Srinivas*, Bordt* & Lakkaraju (2023): Off-manifold robustness — robustness only to perturbations of irrelevant “distractor” features, characteristic of Bayes-optimal predictors
computational efficiency of deep models
How can we eliminate redundant neurons or weights in a pre-trained model while preserving performance?
- Srinivas, Kuzmin, Nagel, van Baalen, Skliar, Blankevoort (2022): Cyclical sparsity schedules that alternate between pruning and unpruning yield state-of-the-art sparse networks
- Srinivas & Babu (2015): Duplicate neurons in pre-trained models can be identified and removed without accuracy loss via a simple “surgery” step
- Srinivas & Babu (2016) and Srinivas, Subramanya & Babu (2017): Trainable binary gates on neurons/weights enable automatic pruning during training
Some other projects I have worked on:
- gradient-based knowledge distillation: Srinivas & Fleuret (2018): Knowledge distilled from teacher to student by matching gradients, improving sample efficiency
- certified robustness to LLM attacks: Kumar, Agarwal, Srinivas, Li, Feizi & Lakkaraju (2024): Certified defense against a class of adversarial LLM attacks via an “erase-and-check” procedure