Heather Haveman, Professor

Closed (1) Tech Has a Gender Problem

Applications for fall 2021 are now closed for this project.

The tech sector has a gender problem: women are underrepresented in engineering and management jobs. One reason may be that tech firms’ corporate cultures are misogynistic (“bro culture”). Female tech workers have complained loudly and clearly about discrimination and harassment. They are not alone: according to a survey by the Pew Research Organization, over 70% of Americans perceive that discrimination against women is a problem in the tech industry. But tech firms are not all alike. Instead, they vary greatly in both the representation of women in engineering and management positions, and in their corporate cultures. This project seeks to document variation in tech firms’ cultures and measure associations between those cultures and the representation of women. To capture tech firms’ cultures, we will use data from Glassdoor.com, a web platform that allows employees to comment on their firms.

We will have 2 sets of apprentices, both of whom will work closely with Professor Haveman, with weekly meetings, either in-person or through zoom.
1) URAP participants continuing from last year will help with analysis of data on employee reviews from Glassdoor.com, using pretrained word embeddings to assess how reviews line up on 2 axes of cultural difference: gender (male/female) and age (young/old). They will also help us analyze the content of the #metoo movement posts on Twitter.

2) New URAP participants will help us collect and code data on the composition of large tech firms' leadership, using publicly data available through the business school library. Students will work independently, then come together to cross-validate their work and reconcile differences in coding, under the supervision of Professor Heather Haveman and graduate student Jasmine Sanders.

Day-to-day supervisor for this project: Jasmine Sanders, Graduate Student

Qualifications: In general, I value apprentices who pay close attention to detail, are enthusiastic, and can stick to a schedule and follow through on deliverables. Apprentices must have be willing attend carefully to the details of their coding assignments.

Weekly Hours: 6-8 hrs

Off-Campus Research Site: zoom link

Related website: http://www.heatherhaveman.net/

Related website: http://naniettecoleman.com
Related website: https://www.facebook.com/Interdisciplinary-Research-Group-on-Privacy-at-UC-Berkeley-120024302718218/

Closed (3) Web-Crawling for All: Toward a Universal, Accessible Framework for Collecting Digital Data

Applications for fall 2021 are now closed for this project.

When it comes to data collection, web-crawling (i.e., web-scraping, screen-scraping) is a common approach in our increasingly digital era--and a common stumbling block. With such a wide range of tools and languages available (Selenium, Requests, and HTML, to name just a few), developing and implementing a web-crawling pipeline is often a frustrating experience for researchers--especially those without a computer science background. There is a pressing need for a universal web-crawling pipeline able to scrape text and objects across website formats and research objectives. We are developing such a pipeline using versatile, scalable architecture built around Python's scrapy module, and we need your help to finalize, apply, fine-tune, and distribute what we've built. Will you join us?

Our main goal is a scalable, robust web-crawling pipeline applicable across web designs and accessible for researchers with minimal computational skills. Our method involves using scrapy spiders on our Virtual Machines (VM) to recursively gather website links (to a given depth), collect and parse items (text, images, PDFs, and .docs), and save them to a robust virtual database. The spiders are coordinated using a big data architecture (with multiple containers) consisting of Redis and Flask (crawler management), Node (back-end), React (front-end), and MongoDB (database management). Downstream features on our bucket list include real-time metrics and access to scraped data, error checks and backup scrapers (including the simple wget algorithm), and toggles for capturing data over time with the Internet Archive.

In Spring 2021 we dockerized and parallelized the crawling architecture, created a pilot web interface, and achieved basic functionality. We have three interwoven goals for Fall 2021: to crawl all 100,000 or so U.S. school websites, to release a fully working alpha version of the crawling app, and to publish a paper on our method (probably in the Journal of Statistical Software). Crawling school websites could spawn a new era of educational research, following the lead of a seminal paper that downloaded and analyzed the websites of all charter schools (https://bit.ly/BIDS-post-2020). New directions in such work include studies of school curricula, race and class segregation, and disciplinary regimes.

Project timeline this semester:
- By end of September: New members on-boarded and educated, data architecture debugged
- By end of October: Alpha web interface and pipeline finalized, implemented real-time access to scraped data
- By end of November: Proof of concept: Complete school data crawled via pipeline
- By end of semester: Release alpha version of Crawl4All, submit paper for publication

Apprentices are expected to apply themselves steadily to collaboratively complete coding tasks that develop a universal, accessible web-crawling pipeline. Onboarding will involve reading blogs and documentation the first few weeks. Team members will communicate regularly with the team, push code to git consistently (with branch-based documentation), and mark completed tasks on Trello. Their code should be functional, clean, and constantly improving--so too their teamwork and leadership.

Our team will include apprentices at UC Berkeley and Georgetown University; one group will develop the web interface and APIs, another will fine-tune the crawling and data architecture. We will coordinate tasks and code via Trello, Slack, and git. We will have short, weekly virtual stand-ups to report your progress and biweekly meetings to review new features and check in on broader goals. Apprentices will regularly study and take notes on code examples and documentation (see them at https://bit.ly/scrapy-notes).

Day-to-day supervisor for this project: Jaren Haber, PhD, Post-Doc

Qualifications: We are looking for independent thinkers/tinkerers with significant Python experience to join our web-scraping team. Applicants must be comfortable with git and GitHub and Jupyter notebooks (and maybe an IDE of their choice, like PyCharm), virtual/cloud computing environments (like Docker containers or Google Cloud), and web-crawling/scraping. Other important qualities include a taste for challenge, openness to feedback, and integrity in timely completion of tasks. They must also commit to producing readable, well-documented code in a modular format. Experience with Natural Language Processing (NLP) is also a plus. Specific technical skills we seek: Python (advanced), Web-crawling (advanced), Web development (intermediate), Scrapy (beginner or better), Cloud computing (advanced), Docker (intermediate), Database management (intermediate), MongoDB (beginner or better), Natural language processing (intermediate), Shell (beginner or better). In your application, please respond to these specific questions: 1. What web-crawling projects you have been involved in? 2. What experience do you have applying big data modules, APIs, or related languages? (e.g., MongoDB, Flask, React) 3. What would you like to do with a universal crawling app? What does it make possible?

Weekly Hours: to be negotiated

Off-Campus Research Site: Zoom

Related website: https://bit.ly/BIDS-post-2020
Related website: https://bit.ly/scrapy-notes

Closed (4) Creating and Testing Algorithms (Open to Technical and Non-Technical URAPers)

Closed. This professor is continuing with Spring 2021 apprentices on this project; no new apprentices needed for Fall 2021.

This project has two interconnected parts related to creating and testing algorithms for social science research. These two parts require distinct skillsets. Part A requires advanced coding skills (python) and is open to EECS and Data Science majors. Part B does not require coding skills and is open to students from all fields.

Part A: Natural Language Processing/Machine Learning (Technical Only)

For the first part of this project, URAPers will apply a variety of machine learning and natural language processing techniques to two different contexts: 1.) aviation accidents and 2.) the cannabis industry. For the aviation context, students will analyze how different features of NASA Accident Reports aid or hamper organizational learning among air traffic controllers. For the cannabis context, URAPers will compare the language used by legal vs. illegal cannabis dispensaries as they compete with one another.

Project teams will be assisted by a GitLab Enterprise Solutions Architect who is certified in SCRUM, Kanban, Lean, and Business Agility project management frameworks.

Applicants should have advanced python and programming skills. Students should have at least 2 years of coding experience and excelled in the introductory EECS sequence 61a, 61b, and 61c.

Part B: Collecting and classifying data on the cannabis industry (Open to Everyone)

For the second part of this project, URAPers will help us systematically categorize cannabis product descriptions and cannabis dispensary “About me” sections along various product and organizational dimensions. By doing so, Research Apprentices will provide the “inputs” for machine learning algorithms that URAPers in Part A will be creating. Using both team efforts, we will explore how the presence of the illegal market affects the strategies pursued by legal dispensaries. No computer science experience is necessary for this project.

Research assistants will also help us collect data on cannabis dispensary license owners so that we can survey the degree of diversity in the legal cannabis markets. News coverage and unscientific surveys suggest that the cannabis industry lacks diversity, being primarily made up of white men. We will test this popular contention empirically by conducting demographic and biographical research on the industry.

Students will be provided a list of cannabis dispensary licensees and look up each individual, inputting their biographical and demographic information into a survey form that we provide. In so doing, we will be able to 1.) provide the first empirical analysis of diversity in the cannabis industry and 2.) see what legal, social, and economic factors predict the demographic and biographical attributes of cannabis dispensary owners.

We are seeking students who are meticulous, punctual, and excited about the project.

Qualifications: See above.

Weekly Hours: to be negotiated

Off-Campus Research Site: Zoom

Related website: www.heatherhaveman.net
Related website: http://bids.berkeley.edu/people/cyrus-dioun