Conference on Advances in Data Science: Talk Titles & Abstracts for Invited Speakers

Sujay Sanghavi, University of Texas at Austin

Talk Title: “Ignoring Causality to Improve Ranking”

Abstract: The stereotypical ML problem setup involves training and test datasets that contain the same set of features with which to make predictions. In ranking problems specifically, however, these datasets are derived from past logs of user behavior; these logs also naturally contain acausal information. In a web search setting, for example, logs contain the position of the item that was clicked on by a user; in an e-commerce setting, logs contain information about which other items a user clicked on, in addition to the item she ended up purchasing. Position, or “was clicked but not purchased”, are clearly not usable features at prediction time, when a new ranked list has to be produced. In this work we show that such acausal features, while not directly usable, can be meaningfully leveraged via distillation: first train a “teacher” model = that uses these acausal features, and then use it to train a student model that does not use them. We first show empirically that this process yields a better final ranker, on 3 moderate-scale open-source ranking datasets, and a proprietary large-scale Amazon dataset. We then develop a simple analytical model to rigorously examine and explain when acausal features can help, and when they cannot.