The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such large scale pipelines, we focus on issues related to understanding, validating, cleaning, and enriching training data. The goal of the tutorial is to bring forth these issues, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art. I will also present a high level overview of Google's production machine learning platform and discuss our approach to some of the challenges.

Dr. Sudeep Roy

Google, USA