Using Language Models for Text Coding Validation
Marika Landau-Wells, Professor
Political Science
Applications for Fall 2024 are closed for this project.
Social scientists often assign categorical values to text data in order to structure it (e.g., categorizing statements made by Congress members as pro- or anti-immigration). Traditionally, this coding has been done manually by humans who read and categorize the texts of interest. This method risks both systematic error (e.g., biased coding) and unsystematic error (e.g., mistaken coding). To remove unsystematic error, researchers often rely on multiple coders and methods of reconciliation (e.g., average score pooling). This does not address systematic error, however. Classification algorithms, which also do not address systematic error, have also performed poorly when the sense-meaning of categories is subtle. In this project, I test a new approach to validating human coding of complex text using language models. My method leverages the independent nature of the models’ language spaces to address both types of error, while offering significant scaling advantages. The undergraduate portion of this project consists of testing the method in new datasets and developing flexible code to share with other researchers.
Role: There are three primary tasks for undergraduates in this project attached to three learning outcomes:
1) Locate new datasets in which to test the independent coder method. The goal is to become more familiar with text-based research in political science and the most frequently used corpora.
2) Adapt existing Python and R code to test the method. The goal is to learn the basic coding skills required to apply the method across platforms (while the model is implemented in Python, most political scientists use R).
3) Develop tools for sharing the method (optional). The goal is to develop more advanced coding skills by creating a set of scripts that is flexible enough to be easy for basic R users.
Qualifications: Some familiarity with:
1) Machine learning and/or natural language processing through coursework or prior apprenticeships
2) Familiarity with Python and R, including the ability to import and manipulate large datasets
Hours: to be negotiated
Related website: http://https://www.marikalandau-wells.com
Social Sciences