David Bamman, Professor

Closed (1) Machine learning for the computational humanities

Applications for Spring 2018 are now closed for this project.

There are several research opportunities available for undergraduates in my group for spring 2018. All are projects in the "computational humanities"; each is designed around a team, ideally comprised of at least one person with strong technical skills and one domain expert (e.g., majoring in English, comp lit, film and media studies, etc.). Technical roles require strong programming skills and good performance in CS 189 (machine learning) or INFO 159 (natural language processing).

1. Charting the history of typography in printed books; you will develop methods to recognize page-level design features (e.g., font, kerning, image layout) in books and build models to predict the influence of specific books, designers and publishers on subsequent design.

2. Natural language processing for literature. Many state-of-the-art methods in natural language processing are optimized for contemporary newswire; in this track, you will develop methods for improving NLP for literary texts and also pioneer new NLP tasks suited to this domain (such as distinguishing "narration" from "description" passages or identifying which scenes in a novel are coreferent with each other).

3. Parsing indices in printed books. An index in a printed book acts a map to the important people places, and concepts in the book; the human act of creating an index is a form of organization placed on the book. You will develop methods to accurately OCR and parse the page structure of indices, align them with the text they reference within the book, and model the changing style of that organizational practice over time.

Past URAP research has appeared at EMNLP 2016 and EMNLP 2017; when applying, mention the specific research project(s) you're interested in. The strongest applications will have done some legwork on the research problem (such as a basic literature review).


All projects will involve reading research literature, creating annotated data for training and evaluation, and building models using techniques from machine learning, natural language processing, and computer vision. Participation in biweekly group meetings to discuss progress and questions (lasting one hour) is required.

Projected outcomes:

Learn about different areas of NLP, machine learning and the digital humanities; gain hands-on experience with a creating a dataset (a fundamental step in data science).

Qualifications: Technical roles: strong programming skills and good performance in CS 189 (machine learning) and/or INFO 159 (natural language processing). Domain roles: major in the specific research area (e.g, English, film and media studies); upper division standing and strong interest in the application of empirical methods/data science.

Weekly Hours: 9-12 hrs

Related website: http://people.ischool.berkeley.edu/~dbamman/