π f-INE: Influence Estimation using Hypothesis Testing
Published in ICML, 2025 workshop on DataWorld, 2025
Recommended citation: https://openreview.net/pdf?id=26zj5O7dph
π₯ Co-authors: Sai Praneeth Karimireddy, Shashwat Sourav, Prathosh A.P.
Understanding the influence of individual or groups of training examples on a modelβs predictions lies at the core of data valuation, attribution, debugging, and privacy. Yet, due to the inherent randomness and complexity of modern ML training pipelines, reliably measuring this influence remains elusive. In fact, we show that existing influence estimation methods fail to account for this, leading to highly inconsistent results - small changes in data ordering can result in drastically different influence estimates. Instead, we introduce f-influence - a new definition of influence grounded in hypothesis testing that explicitly captures training-time randomness. We define the influence of a data subset as the statistical ease of distinguishing between the outputs of models trained with and without that subset. We prove that f-influence satisfies desirable properties such as compositionality and asymptotic normality, analogous to central limit theorems. Leveraging these properties, we design a new algorithm f- INfluence Estimation (f-INE) algorithm that efficiently estimates the influence of training data in a single training run. Our approach is a theoretically sound, highly scalable, and empirically robust alternative to prior methods, enabling consistent influence estimation β© Full Paper