David Bamman, Professor

Closed (1) Machine learning for the computational humanities

Applications for Fall 2017 are now closed for this project.

There are several research opportunities available for undergraduates in my group for the 2017-2018 year. All are projects in the "computational humanities"; each is designed around a team, ideally comprised of at least one person with strong technical skills and one domain expert (e.g., majoring in English, comp lit, film and media studies, etc.). Technical roles require strong programming skills and good performance in CS 189 (machine learning). All roles require a research commitment for the entire academic year.

1. Charting the history of typography in printed books; you will develop methods to recognize page-level design features (e.g., font, kerning, image layout) in books and build models to predict the influence of specific books, designers and publishers on subsequent design.

2. Director attribution. What are the visual features of movies that define the style of directors? You will build on techniques in computer vision to generate features in movies, and train and evaluate classifiers to predict their director.

3. Setting coreference. Scenes in literary novels often take place in some physical location within the universe of the book. Which scenes occur at the same place? You will develop models to establish the "coreference" of settings (akin to pronominal coreference resolution in NLP).

4. Parsing indices in printed books. An index in a printed book acts a map to the important people places, and concepts in the book; the human act of creating an index is a form of organization placed on the book. You will develop methods to accurately OCR and parse the page structure of indices, align them with the text they reference within the book, and model the changing style of that organizational practice over time.

Past URAP research has appeared at EMNLP 2016 and EMNLP 2017; when applying, mention the specific research project(s) you're interested in. The strongest applications will have done some legwork on the research problem (such as a basic literature review).


All projects will involve reading research literature, creating annotated data for training and evaluation, and building models using techniques from machine learning, natural language processing, and computer vision. Participation in biweekly group meetings to discuss progress and questions (lasting one hour) is required.

Projected outcomes:

Learn about different areas of NLP, machine learning and the digital humanities; gain hands-on experience with a creating a dataset (a fundamental step in data science).

Qualifications: Technical roles: strong programming skills and good performance in CS 189 (machine learning). Domain roles: major in the specific research area (e.g, English, film and media studies); upper division standing and strong interest in the application of empirical methods/data science.

Weekly Hours: 9-12 hrs

Related website: http://people.ischool.berkeley.edu/~dbamman/