Machine Learning Approaches for Automated Cell Type Classification in Single-Cell Genomics Data
Peng He, Professor
UC San Francisco
Applications for Spring 2025 are closed for this project.
This cutting-edge research project focuses on developing and optimizing machine learning tools to automatically identify cell types and states from single-cell genomics data. Single-cell technologies have revolutionized our understanding of cellular diversity, but the manual annotation of cell types remains a significant bottleneck in data analysis. This project aims to address this challenge by creating supervised learning models that can automatically classify cells based on their molecular profiles.
The research involves working with two distinct types of single-cell data:
Single-cell RNA sequencing (scRNA-seq), which measures gene expression levels
Single-cell ATAC sequencing (scATAC-seq), which measures chromatin accessibility
The project will leverage existing annotated datasets especially the high-resolution cell atlases established in our lab to train and optimize classification models, exploring various machine learning approaches and parameter optimization strategies to achieve accurate and robust cell type prediction.
Role: The undergraduate researcher will be actively involved in the following tasks:
For scRNA-seq Analysis:
Process and prepare transcript count matrices for machine learning applications
Implement and test various classification algorithms
Conduct systematic parameter optimization experiments
Evaluate model performance using standard metrics
Document results and maintain detailed experimental records
For scATAC-seq Analysis:
Compare different dimensionality reduction techniques for feature selection
Analyze both peak score matrices and genomic bin score matrices
Implement and evaluate various classification approaches
Optimize model parameters for improved accuracy
Learning Outcomes:
Gain practical experience in machine learning and bioinformatics
Develop proficiency in programming for biological data analysis
Learn essential concepts in single-cell genomics
Acquire skills in data visualization and scientific documentation
Understand the principles of model optimization and evaluation
Experience working with large-scale biological datasets
Qualifications: Strong programming experience in Python or R
Basic understanding of statistics and probability
Familiarity with linear algebra concepts
Experience with data analysis and visualization
Day-to-day supervisor for this project: Konstantinos Stasinos, Post-Doc
Hours: to be negotiated
Off-Campus Research Site: Hybrid/remote working is also allowed
Related website: https://peng-he-lab.github.io/
Related website: https://profiles.ucsf.edu/peng.he