Data Science and Demography: Adding photographs to big historical datasets
Dennis Feehan, Professor
Demography
Applications for Fall 2025 are closed for this project.
Big historical datasets have set off an exciting wave of discoveries throughout the social sciences. These datasets have hundreds of millions of rows with rich information about historical populations, such as the complete set of responses to the 1940 US census and the record of all Social Security enrollments. They have provided the basis for important new findings related to health and mortality, migration, fertility, assimilation, and more.
These big historical datasets have been especially useful because scholars have use insights from statistics and machine learning to develop sophisticated strategies for linking different sources of data together. For example, they have linked censuses across time, so that you can track the trajectory of specific individuals or families from 1930 to 1940 to 1950. By observing people over time, scholars can see how, for example, economic changes may help to explain migration patterns. Another example: scholars have linked historical census records to death certificates, which allows them to use hundreds of millions of observations to study what predicts age at death in great detail.
So far, one type of information has been missing from this rich ecosystem of historical data: pictures. We may know someone's name, age, birthplace, marital status, occupation, and date of death -- but we do not know what they looked like. Yet, there is reason to think that images -- that is, pictures of people -- could help explain what drives important outcomes, such as health and death, economic success, and marriage. Through images, we may be able to measure things like people's skin tone (which may affect their experience of racial discrimination); their facial structure (which may affect their attractiveness and thus their romantic and economic welfare); how fashionable their hair/clothes are (which may be related to their personality, social networks, and socioeconomic status); and more.
Fortunately, there is a source of historical images available: school yearbooks. The Internet Archive (and other sources) have digitized hundreds of these hisotrical yearbooks; typically, they have each student's picture, name, and the location of the school. In this project, we will start to develop data and methods that will allow us to add images and other information from these historical yearbooks to this rich historical data ecosystem.
This project would be perfect for students who are interested in the intersection of data science and the social sciences.
Role: Key tasks will include:
- annotating historical yearbooks
- identifying names and pictures from historical yearbooks; these can help train machine learning methods for linking them
- helping to develop typologies of different types of yearbooks
- developing measures to assess quality of matches between yearbooks and census data
- reviewing the academic literature on
- social stratification and health (including education, race/ethnicity)
- facial recognition and machine learning/artificial intelligence
- computational social science
- personality and health
- developing pipelines to link photographs to historical census records
- discovering/engineering features from the photographs that help predict important outcomes (like mortality, socioeconomic status, etc)
Learning outcomes:
- understanding the process of developing a pipeline used in an academic research project
- reading academic literature on social stratification and health
- reading academic and applied literature on evaluating machine learning pipelines
- learning and improving programming and analysis skills related to machine learning and data analysis
Qualifications: - Interest in social sciences and data science
- Some experience with programming and data analysis (ideally Python or R)
Hours: 6-8 hrs
Digital Humanities and Data Science Social Sciences