R. Stuart Geiger, Staff Researcher

Closed (1) Garbage In, Garbage Out? Do Machine Learning Research Papers Report Where Training Data Comes From?

Closed. This professor is continuing with Spring 2022 apprentices on this project; no new apprentices needed for Fall 2022.

Many machine learning classifiers are trained on data labeled by humans. For example, classifiers for distinguishing between spam and non-spam e-mails are typically based on data from users who have flagged messages as spam or not. Many machine learning projects for new use cases will recruit teams of humans to label data for a particular purpose, often using crowdsourcing platforms like Amazon Mechanical Turk. In this meta-research project, we are investigating to what extent published machine learning application papers give specific details about how humans labeled such training data. This information is crucial for building trustworthy and high-quality classifiers, but it is often not reported. We will examine a large number of cutting-edge machine learning research papers published in various fields, and for each paper, we will record questions like: Does the paper report how many human labelers were involved, what their qualifications were, whether they independently checked each other’s work, how often they agreed or disagreed, and how they dealt with disagreements? Much of machine learning focuses on what to do once you have labeled training data, but this project tackles the equally-important issue about whether such data is reliable in the first place. The goal of this research is to understand to what extent cutting-edge machine learning research follows best practices in reporting such data, with a published scientific paper that we expect to make significant impact across many fields.

Students will be reading and annotating the methods section of machine learning papers, both in teams and independently. Students will gain familiarity with machine learning in an accessible and applied context, as well as issues with managing and cleaning data. Note that we will not be developing our own machine learning systems, and so students seeking hands-on experience with programming or advanced statistics would likely find a better fit with other projects. Hourly commitment ranges from 5-10 hours/week and requires a 1-1.5 hour weekly meeting.
, Staff Researcher

Qualifications: This research project is a good fit for students who are interested in machine learning, but it does not require knowledge of programming or in the mathematical aspects of machine learning. Roles are available both for those with and without prior experience in machine learning. Students from various class levels and majors are encouraged to apply, including but not limited to EECS, data science, engineering, math, statistics, the social sciences, and the humanities.

Weekly Hours: 6-8 hrs

Related website: http://stuartgeiger.com
Related website: http://bids.berkeley.edu