Kaiyuan Hu, Jiongli Zhu, Boris Glavic, Babak Salimi.
Machine learning models are increasingly employed in high-stakes decision making in domains such as personalized medicine, policing, and many others. As data quality issues are prevalent and recovering a ground truth clean dataset is often impossible or prohibitively expense, heuristic cleaning techniques are employed in practice to clean training data. The net result are models whose predictions can fundamentally not be trusted as we do not know how much the model's predictions differ from a model trained on the unknown ground truth clean data. We present Zorro, a principled framework for modeling the uncertainty in model parameters and predictions arising from the multiplicity of datasets that could feasibly be the ground truth clean version of a dirty training / test dataset. Under the hood, Zorro employs a novel framework for training and prediction with linear models over uncertain data. Given training and test datasets that are subject to data quality issues, we compute a sound over-approximation of all possible models, the set of models generated by training a model on each possible clean version of the dataset, and then over-approximate all possible predictions based on these models. Using Zorro, we can certify the robustness of models, i.e., to what degree are the model parameters impacted by data quality issues, and of individual and aggregated predictions.
Demo Video