Computer referral of evolutionary history gives clues to genetic variants | 2021-11-04
Using an unsupervised machine learning approach to examine genetic variation across genomes encoding proteins from 140,000 species, researchers at Harvard Medical School and the University of Oxford have developed a new classification system variants that work on par with wet lab approaches.
One of the eye-catching aspects of the study is the ability to make pathogenicity predictions from evolving data, opening up the possibility that “in the future we will go to a clinician and these scores will be used to establish a diagnosis … informed by a sequence obtained from an obscure organism at the bottom of the ocean “, said Pascal Notin BioWorld Science.
Notin, Mafalda Dias and Jonathan Frazer are postdoctoral researchers and co-authors of the article describing their method, which they named the evolutionary variant effect model (EVE). The article appeared in the October 27, 2021 issue of Nature.
Genome sequencing is now cheap enough that, in principle, it can become part of routine medical care.
But if obtaining sequences has become trivial, their interpretation remains anything but.
Even in the best-studied disease genes, such as BRCA, there are variants of unknown significance (SUVs). And in genes that are less studied, or less strongly linked to disease risk, than BRCA – that is, most of them – these SUVs are more the rule than the exception.
Individuals differ from each other, on average, in about 0.1% of their genome, which corresponds to about 3 million base pairs. But “only 2% of the variants have a clinical annotation to date,” said Frazer BioWorld Science.
Even before the advent of massive computing power, sequence alignment was used to better understand the functional importance of single amino acids.
When DNA sequences are highly conserved across species, this usually indicates that they are functionally important. If an amino acid remains unchanged during evolution, “changing that amino acid is very likely to be damaging,” Frazer said. âWe have known this for a long time.
More recently, advances in computer science have enabled scientists to examine variations on a much larger scale.
Supervised vs unsupervised
The most advanced current models, however, use a supervised learning approach. That is, scientists train an algorithm to recognize disease-causing variants using known variants.
In their work, Frazer, Dias, Notin and their colleagues used an unsupervised approach. In a two-step procedure, they trained their model on approximately 250 million protein coding sequences from 140,000 species to recognize how immutable a given position was.
In the second step, the model assessed the probability that a given mutation is pathogenic, giving a score between 0 and 1 for each mutation.
The team used the approach to predict whether mutations in about 3,200 disease-associated human proteins were benign or pathogenic, and compared the model’s output with ClinVar, which describes itself as “a public archive of reports on relationships between variations and human phenotypes, with supporting evidence. “
EVE came to the same conclusions as ClinVar about the pathogenicity of these genes, including a set of genes that ClinVar has labeled as “clinically exploitable”.
The team then compared the predictions of EVE to 40,000 variants measured experimentally on 10 proteins and found that EVE worked on par with wet lab experiments in predicting pathogenicity.
EVE cannot determine why a given mutation is pathogenic. âWhat emerges from the model is a probabilistic statement, we are not making a causal statement,â Notin said.
For now, the model is looking at single genes. âThe combination between the variantsâ – both within single genes and between different genes âis something we would like to explore,â Dias said. BioWorld Science.
Another avenue for further work is to adapt the model to be able to make predictions from several human sequences, rather than from several species.
âWhat we’re modeling is the distribution of the evolving sequences,â Dias said. However, some proteins are specific to humans and some variants, such as some splicing variants, have adverse effects in humans but not in other organisms.
For the vast majority of proteins, however, an evolutionary lens can provide valuable information about their variants. In their article, the team noted that their study is “a small but unusually straightforward demonstration of how the diversity of life on Earth benefits human health.” More than 10% of the organisms the team used in their comparisons are on the International Union for the Conservation of Nature’s Red List of Threatened Species, of which 21 are downright extinct and another 10 are extinct. Savage.
“The gradual disappearance of species,” they write, “is a threat to the diversity on which this work is based.”