The history of "data science"
Shreeharsh Kelkar, Professor
Interdisciplinary Studies Field (ISF)
Applications for Fall 2024 are closed for this project.
In October 2012, the Harvard Business Review declared “data scientist” to be the “sexiest job of the 21st century.” Part of a "Spotlight package" on the power of "big data" and its potential to change organizations and management, the articles in the issue collectively argued that with the growth of the internet, as customers interacted with businesses through software-driven web applications, "companies that make a sophisticated analysis of the huge data streams now available can unlock deep insights and value." For a company to do this well, they needed “data scientists,” a "hybrid of data hacker, analyst, communicator, and trusted adviser." Data scientists, the article argued, had to be good not just at writing code or doing statistics but also in "speaking the language of business and helping leaders reformulate their challenges in ways that big data can tackle."
We are now twelve years past this declaration, so it is safe to say that the position of “data scientist” is now somewhat more established and institutionalized. Our own university graduated its first "data science" majors in 2018 and currently, it is the third-most popular major at Berkeley after computer science and economics.
But what makes an expert identify as a "data scientist" and not something else? After all, quantitative data analysis is something many experts have been doing for decades, well before the advent of data science. Studies that look at what data scientists do are often based on people who already identify as data scientists rather than starting from people who do data-driven analysis and then asking why some of them might want to identify as data scientists.
This project seeks to understand why some experts identify as data scientists and others don't. It starts with the hypothesis that identifying as a data scientist is a strategic decision that is shaped by organizational context. We will look into the evolution of data science in one specific domain: education. The project will show how and why or why not certain education researchers came to identify as data scientists.
The project will use archival work to construct a historical narrative of "data science." In the first phase of this project to be carried out over Fall 2024, we will look at the proceedings of three conferences that are arguably about analyzing educational data but whose beginnings stretch back before the word "data scientist" became prominent.
For more background, I encourage students to read the series of blog-posts at this link: https://computingandsociety.substack.com/p/data-science
Role: Learning outcomes:
Over the course of this project, the student will be able to:
- Articulate the history of "data science" as a field and a term
- Develop and hone research skills that involve archival work
- Develop and hone writing skills through writing a report that describes clearly the research findings and justifications for methods
Tasks:
Over Fall 2024 (and possibly Spring 2025 if the project goes well in Fall 2024), the student will have to carry out the following tasks:
- Systematically analyze the publications of these three conferences: Education Data Mining (EDM) conference, Learning at Scale (L@S) conference, and Learning Analytics and Knowledge (LAK) conference.
- The specific conferences may change as the theory evolves.
- Read the abstracts of the papers submitted and classify them based on what they are about.
- Create a database of these publications in an Excel file (or any other software of the student's choice)
- Create a database of the authors who submit to these conferences (also in Excel). Find out some basic information about these authors including their institutional affiliation, their background, and how they identify (data scientist, learning scientist, etc.)
- Write a short report describing what the student found
Qualifications: Desirable but not essential:
- the student should have some background and interest in the topic and history of "data science"; this could be from having taken a CS course on machine learning, a course on the history of AI, or a course in the history of science and technology (or some other background reading).
- the student should have some experience in working with basic quantitative and qualitative research methods. Some experience in Excel would be good to have. Also, being able to quickly scan research articles for particular terms and being able to summarize them quickly.
- Ideally, the student should be either a junior or a senior; those majoring in ISF, Media Studies, American Studies, history, anthropology, sociology, rhetoric, political science/economy, computer science, and data science are especially encouraged to apply.
Hours: 6-8 hrs
Related website: https://shreeharshkelkar.net
Related website: https://computingandsociety.substack.com/p/data-science