RAG pipeline for rare disease cohort identification
Madhumita Sushil, Professor
Medicine
Applications for Spring 2026 are closed for this project.
Rare diseases affect 1 in 10 people worldwide, but our understanding of these diseases remains very limited. Our team has developed a RAG pipeline to retrieve and embed clinical notes to enable searching for patients diagnosed with a given disease. This enables research in rare diseases to advance their understanding. We need to scale up the pipeline to include all patient data at UCSF, enabling search at scale.
Role: - Embed 175 million clinical notes and serialize to a database, to scale up the previously developed RAG pipeline for rare disease identification.
- Improve the query efficiency of the pipeline if it is slow for data of this magnitude.
- Create a UI to accompany the RAG pipeline.
Qualifications: - Experience with LLMs and RAG pipelines
- Experience with SQL databases.
- Keen interest in scaling LLM / RAG pipelines for efficiency on TBs of data.
Hours: to be negotiated
Off-Campus Research Site: UCSF Mission Bay. We anticipate a Hybrid setup, but an in-person visit is recommended for the best experience.
Digital Humanities and Data Science Biological & Health Sciences