PopEVE and the Tree of Life: Evolutionary AI for Rare Disease Diagnosis

Authors: IJHSB editorial board

Publish Date: 27.11.2025

Abstract

Missense variants create a major bottleneck in clinical genome interpretation, because most individuals carry many protein-altering substitutions while only a minority substantially affect health. Evolutionary deep learning approaches address this challenge by learning sequence constraints from large protein datasets and using them to score the compatibility of amino acid changes with protein function. PopEVE builds on this principle by combining evolutionary models with human population variation to generate a proteome-wide, calibrated scale of missense deleteriousness. This short communication outlines the conceptual foundations of popEVE, its integration of evolutionary and population data, and early applications to rare disease diagnosis, with particular emphasis on developmental disorders and the methodological and ethical implications of deploying such models in clinical genomics.

Introduction

Genome and exome sequencing are now widely used to investigate rare, often severe, genetic disorders. In many cohorts, however, a substantial fraction of sequenced individuals does not receive a definitive molecular diagnosis, and a large proportion of protein-altering variants are classified as variants of uncertain significance. The difficulty is particularly acute for missense variants, whose effects are subtle, context-dependent and distributed along a spectrum rather than partitioning naturally into “pathogenic” and “benign” classes [1–3].

A diverse ecosystem of variant effect predictors has emerged over the past decade, using conservation metrics, supervised learning on clinical labels, high-dimensional features derived from protein structure, and, more recently, deep neural networks. Benchmarking studies and broad reviews emphasise that although many of these tools perform well on curated datasets, performance can degrade when moving to realistic diagnostic settings that involve ranking many rare variants across the genome, or when analysing individuals from ancestries under-represented in training data [4–9].

One recurrent limitation is the lack of proteome-wide calibration. Many models are trained and evaluated gene by gene; their scores cannot readily be compared across different proteins. Yet clinical interpretation often requires precisely this cross-gene comparison: when a patient harbours multiple rare missense variants, clinicians must decide which variants and which genes are most likely to explain the phenotype [1,4–9].

PopEVE was proposed as a proteome-wide deep generative model that addresses this gap by integrating information from the evolutionary “tree of life” with large-scale human population data to place missense variants on a unified scale of predicted deleteriousness [1].

Evolutionary models of variant effect

The basic premise of evolutionary variant effect prediction is that evolution itself constitutes a vast mutagenesis experiment. Over billions of years, proteins have explored many amino acid substitutions; deleterious changes tend to be purged, while tolerated changes persist. By learning the statistical regularities of protein families from multiple sequence alignments or large sequence databases, deep generative models can approximate the constraints that maintain protein structure and function [2,3,7–9].

EVE (Evolutionary model of Variant Effects) is a representative alignment-based model. It employs a Bayesian variational autoencoder trained on multiple sequence alignments for individual protein families. The model learns a latent distribution over functional sequences and assigns each variant a score reflecting how compatible it is with the learned evolutionary patterns. In diverse proteins, EVE scores are strongly correlated with deep mutational scanning measurements and with clinical labels for known disease genes, despite not being trained on any explicit pathogenicity labels [2].

Protein language models such as ESM-1v extend this strategy to unaligned sequence corpora. These models are trained with masked-language objectives on hundreds of millions of protein sequences and estimate the log-likelihood of each amino acid given its sequence context. When evaluated in a zero-shot manner, changes in log-likelihood induced by single amino acid substitutions track experimental fitness measurements and clinical pathogenicity across a wide range of proteins [3,7–9].

Unsupervised evolutionary models therefore provide informative, label-free scores for missense variants. However, their outputs are typically gene-specific and are not calibrated to reflect the relative severity of variants across different proteins or across the spectrum of human disease [1,2,4–9].

The popEVE framework

PopEVE addresses the calibration problem by combining deep evolutionary scores with human population variation in a single generative framework. Conceptually, it has two layers. First, it aggregates outputs from an alignment-based model (EVE) and an alignment-free protein language model (ESM-1v), which jointly capture “deep” evolutionary constraints across the tree of life. Second, it uses a latent Gaussian process to relate these evolutionary scores to observed missense variation in large human cohorts such as UK Biobank and gnomAD, producing a calibrated score that reflects human-specific constraint [1–5].

Population resources quantifying the mutational constraint spectrum in tens to hundreds of thousands of individuals reveal which genes and positions are highly intolerant of variation [9,10]. PopEVE learns, for each protein, how the distribution of evolutionary fitness scores corresponds to patterns of observed presence or absence of missense variants in these cohorts. It thereby rescales and ensembles the underlying evolutionary models into a continuous “popEVE score” that has a consistent quantitative meaning across proteins [1].

A key design principle is that population data are used to recalibrate scores across genes while largely preserving the ranking of variants within each gene. This allows popEVE to interface coherently with existing annotation pipelines, where allele frequency is often treated as an independent line of evidence. On standard benchmarks – including ClinVar-based binary classification and correlations with deep mutational scanning data – popEVE performs comparably or better than leading variant effect predictors, while offering the additional benefit of proteome-wide comparability [1,6–8].

In practical use, rare missense variants from an exome or genome are extracted, popEVE scores are computed for each variant, and variants are ranked by predicted deleteriousness. The same framework can be applied at the gene level by aggregating high-scoring variants, thereby highlighting genes with an excess of predicted damaging variation in specific cohorts [1].

_{Figure 1. Schematic overview of the popEVE workflow for rare disease diagnosis. Evolutionary models (EVE and the protein language model ESM-1v) learn sequence constraints from multiple sequence alignments and protein databases to assign gene-specific fitness scores to missense variants. Human population datasets from large reference cohorts provide allele frequencies and patterns of constraint. A Gaussian process calibration layer combines the evolutionary scores with population information to produce a unified popEVE deleteriousness score across the proteome. In a clinical setting, a patient’s exome or genome is scored with popEVE and the highest-scoring rare missense variants are prioritised for follow-up through segregation analysis, functional studies and expert curation.}

Applications in rare disease diagnosis

The initial evaluation of popEVE focused on cohorts with severe developmental disorders, where causal variants are expected to have large effects and where diagnostic yield from sequencing remains incomplete. Using a metacohort of families with such disorders, popEVE identified hundreds of genes showing an excess of highly deleterious de novo missense variants, including a substantial set of genes not previously linked to developmental disorders. Many of these candidate genes display high expression in the developing brain and participate in networks with known neurodevelopmental disease genes, supporting their biological plausibility [1].

For diagnosis at the individual level, popEVE has been tested in scenarios that mirror clinical workflows, including cases in which a child presents with a severe developmental phenotype and carries a previously unseen protein-altering variant. Reports from early evaluations indicate that in a large proportion of such families, popEVE correctly ranked the known causal missense variant as the most damaging variant in the child’s genome, substantially narrowing the search space for clinicians [1,4,7,11,12].

Because popEVE calibrates scores across genes and is trained to capture a spectrum of severity, it can distinguish, to some extent, between variants associated with early-onset severe phenotypes and those more compatible with survival into adulthood. This gradation is clinically relevant when multiple plausible candidate variants are present, particularly in genes that have both mild and severe alleles [1,4–8].

An additional strength is performance in “singleton” cases where parental DNA is not available, which is common in many health systems and in low-resource settings. PopEVE does not rely on trio information; instead, it uses proteome-wide calibrated scores to prioritise variants based solely on the proband sequence. Early implementations highlight successful use in clinical contexts outside well-resourced centres, in part because popEVE’s inference stage is computationally modest once the underlying models have been trained [1,11,12].

Beyond diagnosis, popEVE’s gene-level outputs can inform discovery of disease genes and potential therapeutic targets. Genes that harbour an unexpected burden of high-scoring missense variants in affected individuals may represent novel disease loci or pathways involved in particular developmental processes. Such candidates can then be prioritised for functional follow-up in cellular or animal models [1,6–8].

Limitations and perspectives

Despite its promising performance, popEVE has clear limitations. It currently focuses on missense variants in protein-coding regions and does not directly evaluate nonsense, frameshift, splice-disrupting or noncoding variants, although its calibration layer could in principle be extended to incorporate additional variant classes [1–3,7,9]. The development of unified models that jointly compare the severity of missense and truncating variants remains an open research problem.

PopEVE also inherits biases and uncertainties from its input datasets. Evolutionary models are limited by the completeness and diversity of sequence databases, while human population components depend on cohorts that still under-represent many ancestries [1,4,5,7–9]. Benchmarking work on variant effect predictors underscores that models can behave differently across genes, variant types and ancestry groups, and that performance often drops when moving from curated test sets to prospective clinical scenarios [4–8].

Interpretability and trust are additional concerns. PopEVE’s Gaussian process calibration layer improves calibration but does not make the model transparent in a mechanistic sense. Discussions on the clinical use of machine-learning-based variant effect predictors emphasise the need for clear statements of appropriate use, explicit description of training data, rigorous external validation and continuous post-deployment monitoring [4–8]. Variant effect predictors should therefore be treated as one component of a multi-source evidence framework, alongside segregation, phenotype–gene matching, functional assays and expert review, rather than as stand-alone decision tools.

From a methodological perspective, popEVE exemplifies a modular strategy in which improved components – for example, new protein language models, refined evolutionary datasets or more representative population cohorts – can be incorporated without redesigning the entire system. This modularity is aligned with calls for computational tools that can be updated as new data accumulate while maintaining continuity in clinical pipelines [1,4–8].

Future developments are likely to involve integrating structural context, protein–protein interaction networks and gene-regulatory information, extending coverage to noncoding variants and developing more explicit models of penetrance and expressivity. At the same time, issues of equitable access, computational cost and appropriate governance will need to be addressed through technical design and policy frameworks [1,4–8,11,12].

Conclusion

PopEVE demonstrates how deep generative models trained on evolutionary sequence data can be combined with large-scale human population variation to yield a proteome-wide, calibrated measure of missense variant deleteriousness. By integrating information from EVE, protein language models such as ESM-1v and population resources such as large reference cohorts, popEVE provides scores that are comparable across genes and that perform strongly on clinically oriented benchmarks and cohort analyses [1–5,7–9].

In developmental disorder cohorts and early clinical pilots, popEVE has improved prioritisation of candidate variants, highlighted new disease gene candidates and shown potential utility in settings where trio sequencing or extensive computational infrastructure is not available [1,11,12]. Nevertheless, its outputs must be embedded within broader interpretive frameworks that account for limitations in data, model bias and the complex relationship between protein perturbation and disease phenotypes.

As sequencing becomes more widespread globally, tools like popEVE represent an important step towards scalable, evolution-informed interpretation of rare variation. They also illustrate the broader promise and challenges of using evolutionary AI as a bridge between the history of life and individual patients’ genomes.

References

Barcelona Institute of Science and Technology. AI learns from the tree of life to support rare disease diagnosis. Press release. 2025.
Orenbuch R, Shearer CA, Kollasch AW, Spinner HD, Hopf TA, van Niekerk L, Franceschi D, Dias M, Frazer J, Marks DS. Proteome-wide model for human disease genetics. Nature Genetics. 2025.
Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, Gal Y, Marks DS. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599:91–95.
Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems. 2021;34.
Dias M, Orenbuch R, Marks DS, Frazer J. Toward trustable use of machine learning models of variant effects in the clinic. American Journal of Human Genetics. 2024.
Bromberg Y, Prabakaran R, Kabir A, Shehu A. Variant effect prediction in the age of machine learning. Cold Spring Harbor Perspectives in Biology. 2024.
Radjasandirane R, Diharce J, Gelly J-C, de Brevern AG. Insights for variant clinical interpretation based on a benchmark of 65 variant effect predictors. Genomics. 2025;117(3):111036.
Livesey BJ, Marsh JA. Updated benchmarking of variant effect predictors using deep mutational scanning. Molecular Systems Biology. 2023;19:e11474.
Livesey BJ, Marsh JA. Interpreting protein variant effects with computational predictors and deep mutational scanning. Disease Models and Mechanisms. 2022;15:dmm049510.
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Birnbaum DP, Ganna A, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443.
Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291.
Broad Institute of MIT and Harvard. New AI model could speed rare disease diagnosis. News release. 2025.