How AlphaGenome Tackles Variant Effect Prediction

The human genome has 3 billion base pairs. Less than 2% of them code for proteins ¹. The remaining 98% contains regulatory instructions that control when and where genes turn on ². Most disease-associated variants identified by genome-wide association studies sit in these non-coding regions, affecting gene expression rather than protein sequence ³. The problem: given a single-nucleotide change somewhere in those 3 billion letters, predict what breaks.

This is variant effect prediction (VEP). Each person carries roughly 4 to 5 million single nucleotide polymorphisms ⁴. Scientists have catalogued over 600 million globally ⁵. Most have no effect on health ⁶, but the ones that matter can alter regulatory logic in ways no short-context model can detect.

AlphaGenome, published in Nature in January 2026 by Google DeepMind, changes the calculus. It processes up to 1 million DNA base pairs and outputs predictions about thousands of molecular properties related to gene regulation ⁷. It scored state-of-the-art on 22 of 24 DNA sequence prediction tasks and matched or exceeded top-performing models on 25 of 26 variant-effect evaluations ⁸. Variant scoring takes roughly one second ⁹.

AlphaGenome processes 1M base pairs of DNA sequence to predict thousands of molecular properties Source: Google DeepMind, "AlphaGenome: AI for better understanding the genome"

Why Context Length Matters

Gene regulation operates over enormous distances. Enhancers can activate gene expression from thousands or even millions of base pairs away ¹⁰. If your model only sees a small window, it will miss these interactions entirely.

SpliceAI, one of the strongest specialized tools, evaluates 10,000 nucleotides of flanking sequence context ¹¹ and achieves 95% accuracy predicting splice junctions ¹². Excellent for local splicing, but blind to long-range regulation. Enformer pushed context to 200,000 base pairs, five times Basenji2's 40,000 base-pair capability ¹³¹⁴. It was significantly more accurate at predicting variant effects on gene expression ¹⁵, but 200kb still covers only a fraction of known regulatory distances.

SpliceAI processes 10,000 nucleotides of flanking context per position SpliceAI's 10k context window captures local splice signals but misses long-range regulatory interactions. Source: Illumina / Jaganathan et al., Cell 2019

Foundation models pushed context windows further. HyenaDNA reached 1 million tokens at single nucleotide resolution ¹⁶, training 160x faster than Transformer-based approaches ¹⁷. DNABERT-2 achieved comparable performance with 21x fewer parameters and approximately 92x less GPU time ¹⁸. But a benchmarking study found that current genomic language models do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences ¹⁹. Task-specific supervised training may still be necessary for high-performance regulatory genomics prediction ²⁰.

AlphaGenome takes the supervised route. Rather than pre-train and hope the model captures regulatory grammar, it trains directly on experimental data from ENCODE ²¹ and predicts across all major regulatory modalities in a single pass.

Architecture: Three Components at Scale

AlphaGenome's design follows from two requirements: "long sequence context is important for covering regions regulating genes from far away and base-resolution is important for capturing fine-grained biological details" ²². The architecture combines:

Convolutional layers for detecting short genomic patterns like transcription factor binding motifs and splice signals ²³
Transformer layers for communicating information across sequence positions, capturing long-range enhancer-promoter interactions ²⁴
Distributed computation across multiple TPUs to make the 1M base-pair input tractable ²⁵

Training required four hours and consumed half the computational resources used for the earlier Enformer model ²⁶. More context, more output modalities, less compute.

AlphaGenome architecture: DNA input flows through convolutional layers, transformer layers, and task-specific prediction heads AlphaGenome processes 1M base pairs through convolutional layers (local pattern detection), transformer layers (long-range context via multi-head self-attention across TPUs), and six task-specific prediction heads to produce single-nucleotide resolution output across 7,000+ genomic tracks. Source: Google DeepMind, AlphaGenome blog

For the first time, the model explicitly predicts RNA splice-junction locations and expression levels directly from DNA sequence ²⁷. This matters because cryptic splice variants comprise approximately 9-11% of pathogenic mutations in neurodevelopmental disorders ²⁸, and previous tools either ignored splicing or handled it with a separate model.

Variant Scoring: Feed In, Change, Subtract

The scoring workflow is simple in concept. Feed the reference sequence through the model. Feed the same sequence with one nucleotide changed. Subtract the output vectors. The difference across all 7,000+ tracks is the variant effect profile. The system evaluates a variant's impact across all predicted properties within approximately one second ⁹.

Delta-score variant effect scoring shown for a pathogenic MYBPC3 variant Delta-score calculation illustrated with SpliceAI's assessment of a pathogenic MYBPC3 variant. AlphaGenome extends this principle across 7,000+ prediction tracks simultaneously. Source: Illumina / Jaganathan et al., Cell 2019

from alphagenome.data import genome
from alphagenome.models import dna_client
import numpy as np

model = dna_client.create("YOUR_API_KEY")
interval = genome.Interval(chromosome="chr1", start=47_183_000, end=48_183_000)
variant = genome.Variant(
    chromosome="chr1", position=47_683_800,
    reference_bases="T", alternate_bases="C",  # TAL1 regulatory SNP
)

outputs = model.predict_variant(interval=interval, variant=variant)
ref_tracks = outputs.reference.tracks  # shape: (seq_len, num_tracks)
alt_tracks = outputs.alternate.tracks
delta = np.mean(np.abs(ref_tracks - alt_tracks), axis=0)

top5_idx = np.argsort(delta)[::-1][:5]
for rank, idx in enumerate(top5_idx, 1):
    print(f"  {rank}. {outputs.track_metadata.names[idx]:.<50s} delta={delta[idx]:.4f}")

The model demonstrated clinical relevance by predicting that certain cancer-associated mutations would activate the TAL1 gene by introducing a MYB DNA binding motif, successfully replicating the known disease mechanism ²⁹. Not just a pathogenicity score, but an explanation of how the variant acts.

How AlphaGenome Compares

The comparison set includes specialized models that each excel at one modality. Here is how the field stacks up:

Model	Context	Key Capability	Trade-off
SpliceAI	10k nt	95% splice junction accuracy ¹²	Splicing only, no long-range
Enformer	200k bp	Expression + chromatin VEP ¹⁵	No splicing, limited context
HyenaDNA	1M tokens	Sub-quadratic, 30M params ³⁰	Foundation model, not VEP-specific
DNABERT-2	Variable	21x fewer params, 92x less GPU ¹⁸	General tasks, short context
AlphaMissense	Protein-level	89% of 71M missense variants classified ³¹	Coding regions only (2% of genome)
AlphaGenome	1M bp	7,000+ tracks, all modalities, ~1s/variant ⁷⁹	TPU-dependent, API-only access

AlphaGenome's key differentiator: it is the only model that jointly predicts all assessed regulatory modalities. AlphaMissense covers the 2% of the genome encoding proteins. AlphaGenome covers the other 98%, the non-coding regulatory regions containing numerous disease-linked variants ³².

AlphaGenome benchmark results across DNA sequence and variant effect prediction tasks AlphaGenome's relative improvements on DNA sequence and variant effect prediction tasks compared to existing state-of-the-art methods. Source: Google DeepMind

Limitations

AlphaGenome is honest about its boundaries. Capturing very distant regulatory elements, beyond 100,000 DNA letters away, remains challenging ³³. The 1M base-pair context window is large but not infinite. Some enhancers operate across megabase distances.

The API throughput caps matter too. Query rates fluctuate based on demand, making it suitable for medium-scale analyses requiring up to thousands of predictions rather than million-scale studies ³⁴. Scoring all 4-5 million SNPs in one genome is not yet practical through the API.

The access model also limits flexibility. The software uses Apache 2.0 licensing and documentation employs CC-BY 4.0 ³⁵, but the model runs on DeepMind infrastructure. Researchers who need to run ablations, fine-tune for new cell types, or process millions of variants face constraints that open-weight models like HyenaDNA or DNABERT-2 do not impose.

Comparison of MaxEntScan vs. SpliceAI predictions for the CFTR gene Traditional vs. deep learning splice prediction for the CFTR gene. SpliceAI (bottom) provides far more accurate exon boundary prediction than MaxEntScan (top), illustrating the trajectory AlphaGenome extends further. Source: Illumina / Jaganathan et al., Cell 2019

What This Means

For the first time, a single model covers long-range context (1M bp), multi-modal output (7,000+ tracks), and fast inference (~1 second per variant). A clinical geneticist can take a non-coding variant of uncertain significance, score it against splicing, expression, chromatin, and transcription factor binding predictions in one query, and get a mechanistic hypothesis for how the variant might cause disease.

The tool helps scientists understand how single-letter mutations and distant DNA regions influence gene activity, shaping health and disease risk ³⁶. The limitations are real: the >100k blind spot, API throughput caps, and the lack of local deployment all constrain adoption. But the 98% of the genome that does not encode proteins is no longer invisible to deep learning.

AlphaGenome is one model, not the last model.

References

Avsec et al., Google DeepMind Enformer blog. ↩
Avsec et al., Google DeepMind Enformer blog. ↩
NHGRI, Gene Expression and Regulation. ↩
NHGRI, Single Nucleotide Polymorphisms. ↩
NHGRI, Single Nucleotide Polymorphisms. ↩
NHGRI, Single Nucleotide Polymorphisms. ↩
Google DeepMind, AlphaGenome blog post, January 2026. https://deepmind.google/blog/alphagenome-ai-for-better-understanding-the-genome/ ↩ ↩²
Google DeepMind / Avsec et al., Nature (2026). DOI: 10.1038/s41586-025-10014-0 ↩
Google DeepMind, AlphaGenome blog. ↩ ↩² ↩³
NHGRI, Gene Expression and Regulation. ↩
Jaganathan et al., Cell (2019). ↩
Jaganathan et al., Cell (2019). ↩ ↩²
Avsec et al., Nature Methods (2021). ↩
Avsec et al., Nature Methods (2021). ↩
Avsec et al., Nature Methods (2021). ↩ ↩²
Nguyen et al., "HyenaDNA," NeurIPS (2023). ↩
Nguyen et al., "HyenaDNA," NeurIPS (2023). ↩
Ji et al., "DNABERT-2," ICLR (2024). ↩ ↩²
Tang, Somia, Yu, and Koo, bioRxiv (2024). DOI: 10.1101/2024.02.29.582810 ↩
Tang, Somia, Yu, and Koo, bioRxiv (2024). ↩
ENCODE Project. ↩
Google DeepMind, AlphaGenome blog. ↩
Google DeepMind, AlphaGenome blog. ↩
Google DeepMind, AlphaGenome blog. ↩
Google DeepMind, AlphaGenome blog. ↩
Google DeepMind, AlphaGenome blog. ↩
Google DeepMind, AlphaGenome blog. ↩
Jaganathan et al., Cell (2019). ↩
Google DeepMind, AlphaGenome blog. ↩
Nguyen et al., "HyenaDNA," NeurIPS (2023). ↩
Cheng et al., "AlphaMissense," Science (2023). ↩
Google DeepMind, AlphaGenome blog. ↩
Google DeepMind, AlphaGenome blog. ↩
Google DeepMind, AlphaGenome GitHub. https://github.com/google-deepmind/alphagenome ↩
Google DeepMind, AlphaGenome GitHub. ↩
Saey, T.H., Science News, January 28, 2026. ↩

How AlphaGenome Tackles Variant Effect Prediction

How AlphaGenome Tackles Variant Effect Prediction

Why Context Length Matters

Architecture: Three Components at Scale

Variant Scoring: Feed In, Change, Subtract

How AlphaGenome Compares

Limitations

What This Means

References

Related Articles

How AlphaGenome Models Gene Regulation: 2D Embeddings, Splicing, and the Race to Read Non-Coding DNA

A Bioinformatician's Guide to Choosing Genomic Foundation Models

EDEN: 28 Billion Parameters for Programming Biology

How AlphaGenome Tackles Variant Effect Prediction

Why Context Length Matters

Architecture: Three Components at Scale

Variant Scoring: Feed In, Change, Subtract

How AlphaGenome Compares

Limitations

What This Means

References

Footnotes

Related Articles

How AlphaGenome Models Gene Regulation: 2D Embeddings, Splicing, and the Race to Read Non-Coding DNA

A Bioinformatician's Guide to Choosing Genomic Foundation Models

EDEN: 28 Billion Parameters for Programming Biology