research themes

Research themes

My research focuses on understanding deep learning from an empirical and scientific perspective, aiming to derive actionable insights that can improve its practical application. Major themes include:

model interpretability via representation analysis

Deep learning works by transforming inputs into latent representations. Can we understand the information encoded in these representations?

Bhalla*, Oesterling*, Srinivas, Calmon & Lakkaraju (2024): Representations of vision-language models can be decomposed into sparse linear combinations of semantic concept vectors, enabling dataset analysis and model editing
Bhalla, Srinivas, Ghandeharioun, Lakkaraju (2024): We unify 4 representation analysis methods as “encoder-decoder” methods and evaluate them via control-based metrics
Li, Srinivas, Bhalla & Lakkaraju (2026): Concept representations in sparse autoencoders are fragile — minimal input perturbations can substantially distort SAE-based interpretations without affecting the underlying model

data-centric machine learning

Model behaviour depends critically on training data. Can we identify datapoints and subsets that influence key model properties?

Ley, Srinivas, Zhang, Rusak & Lakkaraju (2024): Generalized group data attribution extends arbitrary attribution algorithms to groups of training points, yielding ~50x speedups
Bordt, Srinivas, Boreiko & von Luxburg (2024): Benchmark contamination in LLM pre-training does not automatically inflate scores — data seen early is often simply forgotten by end of training

model interpretability via feature attribution

How can we identify which input features a model relies on, and verify that attribution methods are faithful to model behaviour?

Srinivas & Fleuret (2019): ReLU network outputs decompose exactly as a sum of layerwise gradients, yielding a principled saliency method
Srinivas & Fleuret (2021): Popular saliency methods can be unfaithful — producing attributions independent of the model
Han, Srinivas & Lakkaraju (2022): 8 popular attribution methods unify under local linear approximations of non-linear models
Srinivas*, Bordt* & Lakkaraju (2023) and Bhalla*, Srinivas* & Lakkaraju (2023): Model robustness is a key factor in producing faithful attributions; we formalize ground-truth attributions via a signal-distractor decomposition

alternate notions of model robustness

Training adversarially robust models is expensive. Are there alternate robustness definitions that are both meaningful and easier to train for?

Srinivas*, Matoba*, Lakkaraju & Fleuret (2022): Smoothness (small Hessian) as a robustness proxy — competitive with adversarial training while maintaining accuracy and stable gradients
Han*, Srinivas* & Lakkaraju (2024): Average-case robustness as an alternative to worst-case adversarial robustness, with efficient estimation methods
Srinivas*, Bordt* & Lakkaraju (2023): Off-manifold robustness — robustness only to perturbations of irrelevant “distractor” features, characteristic of Bayes-optimal predictors

computational efficiency of deep models

How can we eliminate redundant neurons or weights in a pre-trained model while preserving performance?

Srinivas, Kuzmin, Nagel, van Baalen, Skliar, Blankevoort (2022): Cyclical sparsity schedules that alternate between pruning and unpruning yield state-of-the-art sparse networks
Srinivas & Babu (2015): Duplicate neurons in pre-trained models can be identified and removed without accuracy loss via a simple “surgery” step
Srinivas & Babu (2016) and Srinivas, Subramanya & Babu (2017): Trainable binary gates on neurons/weights enable automatic pruning during training

Some other projects I have worked on:

gradient-based knowledge distillation: Srinivas & Fleuret (2018): Knowledge distilled from teacher to student by matching gradients, improving sample efficiency
certified robustness to LLM attacks: Kumar, Agarwal, Srinivas, Li, Feizi & Lakkaraju (2024): Certified defense against a class of adversarial LLM attacks via an “erase-and-check” procedure