Genomics: Insight

Machine Learning and DNA Mutations

A New Era of Clinical Genomics with Alpha Missense?

Aikam S and Jeremy H
February 7, 2024

Missense Variants: Variations in Genes Coding for Proteins

A missense variant results from a point mutation (single nucleotide mutation) resulting in a different amino acid. There are 71 million possible missense variants in the human genome, and the average person carries more than 9000 variants. Missense mutations, while mostly benign, have the ability to substantially alter the structure and function of the resulting protein, or render it nonfunctional. Sickle Cell and SOD1 (antioxidant enzyme protecting the cell from reactive oxygen species toxicity) mediated ALS, a nervous system disease, are common examples of missense mutation-mediated disease1. Only 0.1% of all possible missense variants have been confirmed by human experts - therefore, understanding the danger or pathogenicity of a specific mutation has significant clinical and research implications in accelerating research into more complex genetic diseases.

Existing Protein Structure Prediction Methods

Historically, researchers have utilized a variety of techniques in predicting missense variants. The three main methods include thread based modeling (TBM), homology based modeling, and free modeling (FM). Each one uses multiple sequence alignment (the comparison of multiple amino acid sequences) to make their predictions. Thread fold recognition matches amino acid sequences or structures to known protein structures to inform its prediction. Since TBM relies on the existence of known structures, this method might not accurately predict how a mutation affects the protein’s function or stability.  Homology based modeling is similar to TBM but it only uses the sequence to model, not the known structure. Since computing can be intensive for longer amino acid sequences, FM is used. Free modeling is a fragment based modeling approach that compares sequences that are 20 amino acids long to a known sequence to predict similarity. This reduces the stress on the computing power since the strands are being compared in smaller fragments. The three methods pull sequences from the Protein Data Bank, which is the world’s largest dataset for protein structure prediction. While these three methods are helpful, researchers require tools that can not only understand the structure of the protein, but also how they change or function differently as a result of such variants.

In a search for a faster and less intensive prediction method, Google Deepmind released Alpha Fold in 2018. It was a more advanced version of the existing methods, as it used an input amino acid sequence to construct a multi sequence alignment based on several databases of protein sequences to determine which parts of the sequence are mutation prone, detecting correlation between them- combining TBM and homology modeling. It greatly improved our understanding of protein structure, creating a new database called the Alpha Fold Protein Structure Database, with over 360,000 predicted structures and 200 million entries, reducing the unknown structures from 5027 to only 29(4). However, AlphaFold still failed to predict defects in protein folding due to point mutations (missenses). It cannot predict novel structures, since its algorithm is based on multi-sequence alignment and requires known structures to make predictions.

...reducing the unknown structures from 5027 to only 29

Alpha Missense: Accurate mutation prediction 

Alpha Missense (AM) is a machine learning tool developed to predict pathogenicity of missense variants across human proteome. AM was trained on the ClinVar database, which is a publicly accessible database hosted by the National Center for Biotechnology Information (NCBI) about human genomic variations and their associations with human health. It leverages an unsupervised protein language model (leverages massive sequence data sets to predict amino acid sequence) and incorporates structural context using some systems from Alpha Fold. Unlike its predecessors, it employs a two step process in structure prediction. Stage one involves multiple sequence alignment reconstruction and structure prediction (like AlphaFold). Stage two, however, leverages missense variant classification and weak labels (efficient data annotation), which predicts pathogenicity (predicted 89% of all the missense variants to be malignant or benign)(5) . Alpha Missense allows for the prediction of missense pathogenicity with high accuracy (90% accuracy on the ClinVar dataset)(6) but also has released resources including 71 million missense variant predictions for community use(7). Alpha Missense advances prediction of novel structures with point mutations. It is now being used to identify the risk for disease of all possible amino acid single substitution mutations.

...prediction of missense pathogenicity with high accuracy (90% accuracy on the ClinVar dataset)

Extracting clinical insights through Alpha Missense 

AM’s ability to distinguish between benign and malignant variants has shown promising potential in clinical contexts. A recent case report by Yimin Zhang et. al (8) describes a 70 year old patient presenting with chronic (> 10 year) cough and mucus production. Despite a long medical history involving multiple infections, hospital visits, and treatments, traditional diagnostic methods were inconclusive in identifying the underlying genetic contributors to her condition. Pathologists extracted DNA and performed Whole Exome Sequencing (WES) to extract insights and identified 297 possible missense variants from which 48 were predicted pathogenic by Alpha Missense.

Further analysis using Phenolyzer,  which correlates genetic variants with specific phenotypes, pinpointed two genes, CFTR and PLG, as being highly relevant to the patient’s condition. CFTR mutations are known to cause Cystic Fibrosis (CF), a condition marked by vulnerability to various lung pathogens, while PLG plays a crucial role in protecting against bacterial infections and sepsis lethality. The Identification of CFTR and PLG not only provides a molecular understanding for the patient’s condition, but also suggests potential therapeutic targets improving the patient’s prognosis.

The impact of integrating advanced digital pathology tools such as AM and Phenolyzer is shown through this case, streamlining the identification of variants, and providing a more personalized approach to patient care.

Pathologists extracted DNA and performed Whole Exome Sequencing (WES) to extract insights and identified 297 possible missense variants from which 48 were predicted pathogenic by Alpha Missense.


The advent and continuous development of tools like AM represent a significant leap forward in genetic research and the field of personalized medicine. ML like these enhance our understanding of complex genetic disease and shift us away from a one-size-fits-all approach towards treatment.  Despite these achievements, it is important to note that while the precision of AM is impressive, no method is infallible, and the ethical dimensions surrounding the use of predictive ML tools in clinical practice must be carefully navigated. Bias introduced through the lack of human sample diversity in the training data, data governance and ownership, as well as privacy concerns all highlight the need for robust ethical frameworks and guidelines to ensure that these technologies benefit society in a responsible manner. We chose this topic because we are interested in the intersection of computer science and clinical diagnosis, and it seems the ethical use of AM will be an essential part of the process in the future.


  2. Boillée, S., Velde, C., Cleveland, D. (2006). ALS: A Disease of Motor Neurons and Their Nonneuronal Neighbors.
  4. Bertoline, L., Lima, A., Kreiger, J., Teixeira, S. (2023). Before and after AlphaFold2: An overview of protein structure prediction. 
  5. Yi, M., et. al. (2024). AlphaMissense, a groundbreaking advancement in artificial intelligence for predicting the effects of missense variants.  
  6. Jun Cheng et. al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. 
  8. Tianyuang Wang (2024). Predicting valuable missense variants with AlphaMissense in a multiple pulmonary infection `patient. 

About the Author

Aikam S and Jeremy H

Aikam Singh and Jeremy Hsieh are high school juniors at Polytechnic School. They are passionate about computer science and are interested in the applications of machine learning in disease prediction and how artificial intelligence tools can improve the accuracy of clinical diagnosis.