Benchmarking and Development of Computational Methods to Predict the Pathogenicity of Human Structural Variants
Steven Brenner, Professor
Plant and Microbial Biology
Applications for Fall 2025 are closed for this project.
Structural variants (SVs) encompass diverse genomic alterations spanning hundreds to millions of base pairs and can profoundly impact genome function by disrupting coding sequences, altering gene dosage, or perturbing regulatory landscapes. However, it remains challenging to determine which SVs contribute to disease due to their complex effects on gene regulation, insufficient representation in population databases compared to single nucleotide variations, and the resource-intensive nature of experimental validation.
Current SV pathogenicity prediction methods fall into three categories: (1) Aggregation based methods that summarize existing pathogenicity scores for each base or amino acid across the SV region. (2) Rule-based methods implementing expert-defined criteria such as a previously recommended framework provided by ACMG. (3) Machine learning-based methods trained on labeled datasets to identify patterns distinguishing pathogenic from benign variants. All these approaches have limitations, and methods often disagree. Most tools were not specifically designed to distinguish rare pathogenic from rare benign SVs, representing the most clinically relevant challenge. The wide variability in the genomic applicability of these methods, with some focusing on exonic regions while others attempt genome-wide prediction, further complicates cross-method comparisons.
This project will systematically evaluate the performance of current SV predictors and further develop an ensemble model with special focus on rare variant pathogenicity predictions.
Role: The student will work on benchmarking previously published SV predictors and further develop an ensemble model for SV pathogenicity prediction using curated variant datasets. Specific tasks will include:
1. Identify training data for methods to execute.
2. Curate variant datasets from authoritative sources, such as ClinVar and DECIPHER.
3. Collect, download, and start testing the usability of previously published SV predictors.
Long-term goals include:
1. Benchmarking the SV predictors using curated datasets with known pathogenicity labels, using multiple metrics under diverse conditions.
2. Develop an ensemble model based on various SV predictors which focuses on predicting the pathogenicity of rare SVs.
Through this project, the student will gain experience in computational genomics, machine learning on biological data, and large-scale dataset handling.
Qualifications: (1) Willing to learn and conduct research in a fast-paced environment. (2) Experience with machine learning models, next-generation sequencing (NGS) data analysis, and programming in Python.
(3) Candidates must:
• Attend a 3-hour lab meeting every week.
• Attend a research subgroup meeting every week.
• Adhere to all lab policies (including weekly notebooks to track research and semester reports).
• Must register for credits, regardless of program-specific requirements.
(4) Applicants with GPA under 3.6 will be considered only in exceptional circumstances.
Being interested in committing to the fall and spring, and possibly the summer, is a plus.
Day-to-day supervisor for this project: Yaqi Su
Hours: 12 or more hours
Related website: https://genomeinterpretation.org
Related website: http://compbio.berkeley.edu/