Quantitative Text Analysis and the European Union
Christopher Ansell, Professor
Political Science
Closed. This professor is continuing with Fall 2023 apprentices on this project; no new apprentices needed for Spring 2024.
Parliamentary questions are one of the key means of oversight in the European Union. This project aims to analyze the ways that this oversight process is used at the European level through large scale, quantitative text analysis, machine learning, and statistical methods as well as finalize the data structure for a large N, text dataset that will be made publicly available at the conclusion of the research.
Role: Students will assist with the development of the programs and algorithms needed to analyze the text and to optimize the data structure. This will take the form of coding, largely in python and R, for large scale text analysis and natural language processing. Appropriate levels of technical skills are prerequisites.
*FOR SPRING 2022 EXTENDED APPLICATIONS, WE ARE ONLY INTERESTED IN SUBPROJECTS 1 and 2*
We are seeking students for 4 separate sub-projects! Please indicate which project(s) you might be interested in when applying. The first two sub-projects would be interesting to students interested in the early stages of data science work, while the last sub-project would be especially interesting for students interested in continuing real world applications of machine learning and optimization. Possibilities for overlap, and transitioning to another sub-project when one is complete do exist.
1. While we have the first phase of data wrangling done, we have the second phase to go. This involves messy real world data, which must be parsed and reorganized to make usable for text analysis. This will likely involve significant parsing with regular expressions and be heavily reliant on python. The end product is will be a dataset that we will make publicly available.
2. We are working to merge and refine several data sources together, as well as clean the existing and new data sources. This will involve merging multiple data sources, using a variety of techniques to account for inconsistent joining keys and involve designing and structuring data to be used efficiently.
3. We are working to develop and optimize a text classification scheme to assess the presence of Euroskepticism in questions through machine learning. Two preliminary models have been devised, using BERT and Universal Sentence Encoding. We are continuing to optimize the classification scheme, using our training data, to be able to apply it to the remainder of the large data set that is being finalized in subprojects 1 and 2. The tasks for this semester are largely about optimizing machine learning classification models, so experience with optimization, with imbalanced learn methods, and machine learning classification are highly beneficial.
4. We are going to be optimizing and refining existing topic modeling projects, which have been ongoing for several semesters. As the final data from subprojects 1 and 2 are merged, we will begin a new round of topic modeling process, which combines both computer analysis and inductive interpretation. This subproject will require familiarity with R. Familiarity with the STM package would be bonus, but it is not a requirement. Some graphing/visualization is also required, though will likely be based on existing code (ggplot).
Students will gain experience with developing the tools necessary for large N, quantitative text analysis and hands on experience with designing a dataset of significant size for analysis, as well as practical experience with using programming languages on a real world project. Aspects of the project deal with structural topic modeling, sentiment analysis, machine learning, natural language processing, and statistical methods.
In addition to developing technical skills using real life data, students will discover the way that institutions and oversight work within the complex multilevel governance of the European Union. Students will receive hands-on training in data collection and interpreting the relationships and networks, and strengths and weaknesses, of the relationships between levels of government in Europe. Their efforts will significantly inform research that will explore policymaking in contemporary Europe. Beyond substantive areas, students will learn about both the early and implementation stages of large research projects, formulating research questions, and collecting/interpreting data.
Qualifications: Please clearly indicate which project you are applying for!
Students should be comfortable in python and/or R. Some degree of experience/interest with text analysis or natural language processing would be beneficial, but if students are willing to learn techniques in this area not necessarily required. In terms of course work, CS61A is likely required. CS61B is preferred. An upper division analysis class (CS70, DATA 100, or equivalent) would also be preferred for data analysis and inference. Machine learning experience preferred for sub-project 3, and R is especially important for subproject 4. Other statistical background may also be helpful, especially experience with discontinuity designs.
*FOR SPRING 2022 EXTENDED APPLICATIONS, WE ARE ONLY INTERESTED IN SUBPROJECTS 1 and 2*
Hours: to be negotiated
Off-Campus Research Site: remote
Social Sciences Arts & Humanities