R. Stuart Geiger, Staff Researcher

Closed (1) Garbage In, Garbage Out? Do Machine Learning Research Papers Report Where Training Data Comes From?

Applications for Spring 2019 are now closed for this project.

Many machine learning classifiers are trained on data labeled by humans. For example, classifiers for distinguishing between spam and non-spam e-mails are typically based on data from users who have flagged messages as spam or not. Many machine learning projects for new use cases will recruit teams of humans to label data for a particular purpose, often using crowdsourcing platforms like Amazon Mechanical Turk. In this project, we are investigating to what extent published machine learning application papers give specific details about how humans labeled such training data. This information is crucial for building trustworthy and high-quality classifiers, but it is often not reported. We will examine a large number of cutting-edge machine learning research papers published in various fields, and for each paper, we will record questions like: Does the paper report how many human labelers were involved, what their qualifications were, whether they independently checked each other’s work, how often they agreed or disagreed, and how they dealt with disagreements? Much of machine learning focuses on what to do once you have labeled training data, but this project tackles the equally-important issue about whether such data is reliable in the first place. The goal of this research is to understand to what extent cutting-edge machine learning research follows best practices in reporting such data, with a published scientific paper that we expect to make significant impact across many fields.

Students will be reading and annotating the methods section of machine learning papers, both in teams and independently. Students will gain familiarity with machine learning in an accessible and applied context. There are roles for those just interested in annotating papers, as well as those interested in also doing basic statistical analyses and visualizations of the results -- asking questions such as if machine learning papers in certain disciplines report these details more often than in other disciplines. Hourly commitment ranges from 5-10 hours/week.
, Staff Researcher

Qualifications: This research project is a good fit for students who are interested in machine learning, but it does not require knowledge of programming or in the mathematical aspects of machine learning. Roles are available both for those with and without prior experience in machine learning. Students from various class levels and majors are encouraged to apply, including but not limited to EECS, data science, engineering, math, statistics, the social sciences, and the humanities.

Weekly Hours: 6-8 hrs

Related website: http://stuartgeiger.com
Related website: http://bids.berkeley.edu