Skip to main content
  • UC Berkeley
  • College of Letters & Science
Berkeley University of California

URAP

Project Descriptions
Spring 2026

URAP Home Project Listings Application Contact

RAG pipeline for rare disease cohort identification

Madhumita Sushil, Professor  
Medicine  

Applications for Spring 2026 are closed for this project.

Rare diseases affect 1 in 10 people worldwide, but our understanding of these diseases remains very limited. Our team has developed a RAG pipeline to retrieve and embed clinical notes to enable searching for patients diagnosed with a given disease. This enables research in rare diseases to advance their understanding. We need to scale up the pipeline to include all patient data at UCSF, enabling search at scale.

Role: - Embed 175 million clinical notes and serialize to a database, to scale up the previously developed RAG pipeline for rare disease identification.
- Improve the query efficiency of the pipeline if it is slow for data of this magnitude.
- Create a UI to accompany the RAG pipeline.

Qualifications: - Experience with LLMs and RAG pipelines
- Experience with SQL databases.
- Keen interest in scaling LLM / RAG pipelines for efficiency on TBs of data.

Hours: to be negotiated

Off-Campus Research Site: UCSF Mission Bay. We anticipate a Hybrid setup, but an in-person visit is recommended for the best experience.

 Digital Humanities and Data Science   Biological & Health Sciences

Return to Project List

Office of Undergraduate Interdisciplinary Studies, Undergraduate Division
College of Letters & Science, University of California, Berkeley
Accessibility   Nondiscrimination   Privacy Policy