Data Contribution Estimation for Machine Learning

NeurIPS 2023 Tutorial

Abstract:

Tasks enabled by data contribution estimation (DCE) aid model improvement through data improvement. While benchmark DCE evaluation tasks show application across many ML domains, DCE has limited visibility in other research domains that stand to benefit from its use cases. We propose a tutorial on data contribution for machine learning to address this. This tutorial will provide an overview of DCE for machine learning and natural language processing. Following this tutorial, attendees will have gained an understanding of 1) broadly, what questions data contribution estimation aims to answer; 2) the theory and methods that are widely in use within the DCE community that can be applied to broad range of domains; 3) DCE from the perspectives of large language models and privacy.

Speakers:

Stephanie Schoch is a PhD candidate at the University of Virginia, where she is a member of the Information and Language Processing lab. Her research interests broadly span assessment and applications of data quality and contribution estimations in natural language processing and machine learning. Her recent work focuses on improving Shapley approximation performance through improved value function selection and improving Shapley approximation efficiency through novel sampling strategies.

Ruoxi Jia is an assistant professor in the the Bradley Department of Electrical and Computer Engineering at Virginia Tech. Jia’s research interest lies broadly in the span of machine learning, security, privacy, and cyber-physical systems. Jia’s recent work focuses on data-centric and trustworthy machine learning. Jia has been investigating data valuation for machine learning since 2019 and her work covers fundamental theoretical, algorithmic, and empirical aspects of data valuation. Her work on data valuation has been adopted in various use cases such as improving dataset quality, incentive design, debugging an ML pipeline.

Yangfeng Ji is the William Wulf Assistant Professor in the Department of Computer Science at the University of Virginia, where he leads the Information and Language Processing lab. His research interests include building machine learning models for text understanding and generation. His work on entity-driven story generation won an Outstanding Paper Award at NAACL 2018. He is a co-author of an EMNLP 2020 tutorial on text generation and NAACL 2022 tutorial on contrastive learning.

Panelists:

Coming Soon