Analyzing the robustness of automated evaluation metrics for medical information extraction with generative LLMs
Atul Butte, Professor
UC San Francisco
Applications for Fall 2024 are closed for this project.
LLMs have demonstrated tremendous potential in various domains, including healthcare. However, evaluating their outputs, especially in clinical settings, presents unique challenges. This project will compare the evaluation metrics obtained from automated methods for assessing free-text outputs generated by LLMs with the assessments from expert clinical judgment. The findings will inform which automated metrics are reliable for clinical evaluations to enable scalable and reliable future evaluations. By identifying and curating effective evaluation methods, this project aims to enhance the reliability of comparing generative LLMs in clinical settings. The findings will provide valuable insights for both academic research and practical implementations of LLMs in healthcare.
Role: 1. Literature Review: Conduct a comprehensive review of current evaluation methods for LLMs, focusing on metrics used in clinical contexts.
2. Expert Evaluation Collection: Collaborate with clinical experts to gather evaluations of LLM-generated content. Refine dashboards to collect their evaluations and analyze the generated data.
3. Implementation of automated metrics: Implement automated evaluation metrics, including GPTScore, BERTScore, multi-LLM, and LLM agents-based evaluation.
4. Correlation Analysis: Analyze the correlation between expert clinical evaluations and existing automated metrics.
Qualifications: Background in python is necessary, and knowledge of LLMs would be useful.
Day-to-day supervisor for this project: Madhumita Sushil
Hours: 9-11 hrs
Off-Campus Research Site: Remote
Digital Humanities and Data Science Biological & Health Sciences