What is MIMIC and how does it differ from models like AlphaFold or ESM-2?

MIMIC is a roughly 1-billion-parameter generative multimodal foundation model for biomolecules. Unlike single-modality models, it conditions on arbitrary subsets of observed modalities (sequence, structure, evolution, regulation, and experimental context) and reconstructs the missing ones across genome, transcriptome, and proteome scales.

Does multimodal conditioning actually improve predictions?

Yes, but selectively. MIMIC's ablations show that conditioning on additional modalities beats sequence-only inputs, with the largest gains on targets that are hard to predict from sequence alone and little benefit where the sequence already determines the answer. The honest read is that multimodality helps most exactly where you would expect it to.

How does MIMIC perform on splicing prediction?

MIMIC reaches state-of-the-art splicing prediction with isoform-aware inference, improving area under the precision-recall curve over AlphaGenome, SpliceAI, and NTv3 at both gene and transcript level. Coding-region gains are modest at a few percent, while the large non-coding multiples are gains over a weaker baseline on a rare-event metric.

What is MIMIC's split-track encoder-decoder architecture?

MIMIC pairs a 20-layer encoder of about 566 million parameters with a 12-layer decoder, using a hidden width of 1536, 24 attention heads, register tokens, and rotary position embeddings. The encoder reads whatever modalities are observed and the decoder reconstructs the masked or missing targets, trained across roughly 25 heuristic input and target pathways.

Can MIMIC design corrective RNA edits or protein binders?

MIMIC proposes candidate corrective RNA edits for clinical mutations and designs protein binders, reaching high-confidence iPTM for 35 of 37 PD-L1 designs. The results are in-silico and uneven, with only 2 of 20 hACE2 binders passing, so they are starting points for experimental validation rather than validated therapeutics.

MIMIC: One 1B Model for the Central Dogma

Q: What is the LORE dataset that MIMIC is trained on?

LORE is the aligned multimodal dataset behind MIMIC. It links nucleic-acid, protein, evolutionary, structural, regulatory, and semantic modalities within partially observed biomolecular states, spanning roughly 15.5 million proteins, 13 million RNA transcripts, and more than 6,000 organisms.

A 1-billion-parameter model conditioned on RNA chemical-probing reactivity folds a held-out transcript to an F1 of 0.987 against an experimentally guided reference. The same transcript folded from sequence alone scores 0.404.¹ Across 1,000 held-out RNAs the improvement is robust, with a one-sided paired Wilcoxon signed-rank test at P = 6.25e-47.² That is the cleanest single demonstration of what MIMIC is built to do: treat sequence, structure, regulation, evolutionary constraint, and experimental context as different observations of one underlying molecular state, then fill in whichever observation you withhold.

MIMIC, from Golkar et al. at Polymathic AI, is a generative multimodal foundation model posted to arXiv on 27 April 2026.³⁴ It trains on a new aligned dataset called LORE, which links nucleic acid, protein, evolutionary, structural, regulatory, and semantic modalities within partially observed biomolecular states.³ A split-track encoder-decoder conditions on arbitrary subsets of observed modalities and reconstructs or generates the missing components across genome, transcriptome, and proteome.⁵ The headline is not the parameter count. The authors state the thesis plainly: at a comparable size to single-modality foundation models, around 1 billion parameters, MIMIC demonstrates the value of diverse and heterogeneous data as an alternative to pure scaling.⁶ If that holds, it is a different growth axis than the sequence-token-count race that produced Evo 2 and the ESM line.

Figure 1: The MIMIC framework. (A) heterogeneous molecular data spanning genomic, transcriptomic, and proteomic modalities; (B) the LORE dataset aligning these into partially observed examples; (C) the split-track encoder-decoder; (D) representation, generative, and design applications. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026).

The problem: forward predictors model the wrong distribution

You already know the zoo. ESM and ESM-2 for protein sequence, RNA-FM for non-coding RNA, SpliceAI for splice sites, AlphaGenome for regulatory genomics, AlphaFold for structure. Each is excellent inside its lane. The field has accumulated, in the authors' words, single-modality models that master one molecular modality but lack the capacity to operate across the broader central dogma.⁷

MIMIC's critique is sharper than "models are siloed." It is that forward predictors model the wrong object. By mapping a sequence to a specific phenotype, these models fail to capture the full joint distribution of biological states; to understand a system, the argument goes, you model the full joint probability distribution of its molecular states, not just the forward flow of information.⁸⁹ That reframing is what makes inverse problems tractable. Given a desired protein structure, mRNA stability, and splicing pattern, what upstream nucleotide sequence most likely produced them?¹⁰ A forward predictor cannot answer that; a generative model over the joint distribution can, by sampling the conditional.

This is a different bet than AlphaFold3, which is also a unified framework but a forward, structure-focused one: it predicts the joint structure of complexes including proteins, nucleic acids, small molecules, ions, and modified residues.¹¹ AlphaFold3 unifies outputs, the structures of complexes. MIMIC unifies both inputs and outputs across the central dogma, and it is generative over sequence as well as structure. It also departs from the sequence-only RNA and protein language models it competes with. RNA-FM, for instance, is a 12-layer transformer trained on 23 million non-coding RNA sequences through self-supervised learning.¹² MIMIC's question is what happens when you stop treating non-sequence signals as downstream tasks and start treating them as conditioning inputs.

LORE: the dataset is the contribution

The data is the part that took the most work. LORE holds roughly 15.5 million proteins and 13 million RNA transcripts from over 6,000 organisms spanning all domains of life, plus over 4 billion tokens of biomedical and contextual text.¹³ Those are post-clustering numbers. Before deduplication, the nucleic-acid side spans over 25 million transcripts and 123 billion nucleotides, from full human and mouse genomes with GENCODE annotations plus NCBI RefSeq for the rest.¹⁴¹⁵

The central design principle is alignment, not volume, which distinguishes LORE from sequence-only corpora such as OpenGenome2 or BFD that do not provide aligned or multimodal samples.¹⁶ Each row is an observation of a transcript, a protein, or a pairing of the two, containing whatever subset of modalities is available; mapping UniProt IDs to RefSeq/Ensembl IDs yields a core subset of roughly 2 million aligned transcript/protein pairs.¹⁷¹⁸ Because real data is incomplete, LORE is intentionally partially observed: no example is required to contain all modalities, which across the dataset produces about 150 distinct modality presence signatures.¹⁹²⁰ That partial observation is not a flaw worked around later; it is the training signal, and it dictates the architecture below.

The per-modality counts make the raggedness concrete.

Track	Modality	Samples in LORE
Nucleic acid	Nucleic acid sequence	12,967,153²¹
Nucleic acid	Promoter usage (CAGE)	15,073,298²²
Nucleic acid	Chromatin accessibility (ATAC-seq)	14,021,301²²
Nucleic acid	RNA chemical probing (RASP2)	1,644,404²³
Nucleic acid	Evolutionary conservation (phyloP)	518,480²⁴
Protein	AA sequence / DSSP / backbone / SASA	15,607,838 each²⁵
Protein	Chemical surface (MaSIF)	1,671,206²⁶
Protein	Protein abundance	1,803,028²⁶
Protein	Functional captions	176,626²⁶
Semantic	Biomedical corpus	3,797,568 (4B tokens)²⁷

Two orders of magnitude separate the common modalities (sequence, amino acids, epigenetic peaks) from the scarce ones (phyloP, MaSIF, functional captions). A model requiring all modalities present would train on the intersection, which is nearly empty. Two provenance details matter downstream. phyloP conservation comes from 100-way alignments for human and 30-way for mouse, capturing lineage-specific constraint.²⁸ MaSIF surface fingerprints were computed on only about 10% of structures because of compute cost, roughly 50,000 CPU hours for 1.6M datasets. That is why the surface modality is two orders of magnitude smaller than the protein sequence count.²⁹

Leakage control is handled with MMseqs2 clustering, run with -c 0.8 and --min-seq-id 0.3 or 0.7.³⁰ Protein clustering at 30% identity collapses 299,566,796 members into 42,943,944 clusters; RNA collapses 25,001,445 into 10,861,904.³¹ Splits are made at the cluster level so no validation sequence shares significant similarity with training, with 30% clusters defining train/validation/test splits and 70% clusters ensuring diversity.³² For computational biologists, that is the line that determines whether the benchmark numbers below are believable, and it is the right line to draw. Tertiary structure is tokenized with the ESM3 structure tokenizer, a VQ-VAE with 4,096 classes based on local backbone geometry, applied to roughly 200M AlphaFoldDB v4 structures filtered to pLDDT greater than 70.³³³⁴

Architecture: split-track summation and an asymmetric window

The core engineering problem is that biological modalities have wildly different lengths and alignment structures: a single-value abundance measurement, a 10,000-nucleotide transcript, and a backbone track for a 400-residue protein all coexist in one model. MIMIC's answer is a split-track architecture, built on the 4M/AION encoder-decoder backbone, that sums aligned modalities while concatenating distinct molecular entities.³⁵³⁶

Figure 2: The split-track encoder-decoder. Aligned modalities are element-wise summed within each track; the encoder processes the unified token sequence plus register tokens; random tokens are dropped and a cross-attention decoder reconstructs target modalities. Figure 2: MIMIC's split-track architecture. Left: token inputs grouped into a nucleic-acid track (sequence + structural data + conservation, element-wise summed), a protein track (MaSIF surface + amino acids + backbone, element-wise summed), plus register, taxonomy, and context tokens. Right: the masked encoder-decoder. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026), Supplementary Figure S1.

The nucleic-acid track element-wise sums sequence embeddings, structural data, and conservation scores into one representation per position; the protein track does the same for surface features, amino-acid residues, and backbone structure.³⁷ Because aligned signals are summed rather than concatenated, total length stays tractable no matter how many modalities are present. A per-position phyloP score and the nucleotide at that position occupy the same token slot rather than doubling the sequence.³⁸

The encoder processes these track groups alongside semantic and contextual token groups and a set of learnable register tokens, which act as information sinks pooling global state into a compact vector summary.³⁹⁴⁰ All text-based modalities (context, functional captions, taxonomy) share one semantic embedding space with the biomedical corpus.⁴¹ Positional encoding uses RoPE with a local group-reset strategy: the position index restarts at zero for each track group, so attention reflects within-track distances rather than absolute offsets. That is the detail that makes summing two differently-scaled tracks coherent.⁴²

The decoupling of observed context from predicted output is the design's center of gravity. MIMIC uses an asymmetric context window, 10,000 tokens for the encoder and 1,000 for the decoder.⁴³ The wide encoder receptive field ingests arbitrary observed context; the focused 1,000-token cross-attention decoder queries the encoder's latent to reconstruct or generate any target subset, with no architectural modification or task-specific output heads.⁴⁴ That makes "condition on any subset, generate any subset" a single forward pass rather than a family of fine-tuned models. The cross-attention bottleneck descends from Perceiver IO, which replaces the usual modality-specific encoder plus fusion plus readout recipe with a flexible querying mechanism.⁴⁵

The full configuration: 20 encoder layers, 12 decoder layers, hidden width 1,536, 24 attention heads per stack, 5 register tokens, RoPE fraction 0.75.⁴⁶ Self-attention uses a 50% bidirectional, 25% causal, 25% anti-causal mask, following the finding that causal attention can outperform bidirectional on certain tasks.⁴⁷ Training ran by default across 92 NVIDIA H200 GPUs over NCCL, using AdamW (betas 0.9/0.95, weight decay 0.05), cosine decay, and bfloat16.⁴⁸⁴⁹ Only the 566M-parameter encoder generates embeddings for downstream tasks, the relevant size when comparing representation quality against other models.⁵⁰

Training: 25 pathways, a curriculum, and crop-versus-drop

There is no monolithic loss. Training is a stochastic sampling process over roughly 25 heuristic pathways, each specifying a required and optional set of input and target modalities.⁵¹ This is how MIMIC handles about 150 presence signatures without enumerating them: on each step a pathway is sampled, inputs are chosen from available modalities, and the model reconstructs the targets. The same machinery yields "predict masked sequence from sequence plus conservation" and "generate amino-acid sequence from backbone plus surface" without separate heads.

Three engineering choices encode real biology.

Context curriculum. A staged schedule scales the encoder window from 1k to 2k to 4k to 8k to 10k tokens, so the model learns local feature associations before resolving long-range regulatory and structural dependencies.⁵²
Crop versus drop. RNA and DNA sequences are randomly cropped to fit the current window, which stays valid because RNA folding is largely driven by local base pairing. Proteins are dropped, not cropped: a truncated protein cannot reliably predict structure because long-range contacts go missing, so any protein whose structure track exceeds the budget is discarded entirely.⁵³⁵⁴
Target packing. Many biological targets are sparse or short (splice-junction tracks, single-value abundance), so the decoder packs multiple into its 1,000-token budget.⁵⁵ This acts as a regularizer, forcing the model to learn joint dependencies between, say, splicing and conservation even when the explicit task was only splicing.⁵⁶

Register tokens are trained via reconstruction with 0 to 10% random token dropout across all input tracks, which prevents them from degenerating into trivial copy mechanisms.⁵⁷ Mixing short peptides with long RNAs in one batch is also a real systems problem: naive padding wastes compute on heavy-tailed length distributions and naive packing truncates, so MIMIC uses dynamic workload balancing that routes short sequences into large batches and long sequences into small ones across GPUs.⁵⁸

Reconstruction: a 1B model holding its own against larger ones

The headline reconstruction result uses a sequence-completion (inpainting) task. A 100-token window in the center of a sequence is masked and reconstructed in a single pass by sampling from a near-zero-temperature softmax over the model's logits.⁵⁹ Random per-token accuracy is about 0.25 for nucleotides and about 0.04 for amino acids, which sets the floor.⁶⁰

For protein sequence completion, MIMIC conditioned with structural descriptors reaches the highest reconstruction accuracy against ProtBERT, ESM-2, ESM-C, and ESM3-open.⁶¹ On unspliced transcripts, with masking stratified into exonic and intronic regions, MIMIC with the full set of aligned genomic modalities beats two long-context genomic models trained on OpenGenome2 in both regions: NTv3 (masked reconstruction) and Evo 2 (autoregressive).⁶² Even sequence-only it is competitive with the top sequence-only models, and conditioning on more modalities yields consistent gains.⁶³ Beating a much larger autoregressive genomic model with a 1B model is the cleanest support for the diversity-over-scale thesis.

Figure 3: Sequence-property benchmarks. MIMIC (1B) against ESM and ProtBERT on protein inpainting, against Evo 2 and NTv3 on RNA inpainting, plus PFMBench, mRNABench, and variant scores. Figure 3: State-of-the-art sequence property prediction. (A) protein amino-acid inpainting: MIMIC vs ESM3-open, ESM-C, ESM-2 (650M), ProtBERT. (B-C) RNA intron/exon inpainting vs Evo 2 (7B) and NTv3 (650M). (D-E) PFMBench and mRNABench. (F) complex and Mendelian variant scores. Baseline names follow the paper's figure; the parenthetical parameter counts are each model's own published size. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026).

Representation quality holds up on standard probes. On PFMBench (protein function, structure, interaction, and developability across 11 tasks), MIMIC matches or exceeds ESM3, ESM-C, ProTrek, and SaProt, winning at least 7 of 11 tasks against each competitor.⁶⁴ It is strongest on protein-ligand binding (BindingDB, PDBbind) and best on every developability task, which the authors attribute to jointly pretraining on surface chemistry.⁶⁵

On mRNABench, MIMIC beats Evo 2 and Orthrus on 4 of 7 tasks and a Dilated ResNet on 6 of 7. Its largest gains are on aligned-modality tasks: GO function prediction (+2% vs Evo 2, +9% vs Orthrus) and protein localization (+2% vs Evo 2, +5% vs Orthrus).⁶⁶⁶⁷ The honest negatives sit in the same table: MIMIC is 8% ahead of Orthrus on the Sugimoto half-life set but 8% behind on the LBKWK half-life set, a benchmark from synthetic mRNA constructs where stability effects likely dominate.⁶⁸⁶⁹ All of this comes from the 566M encoder, one of the smaller networks in the comparison.⁵⁰

Ablations: does multimodality actually help, and when?

The thesis is that diverse data beats scaling, so the ablations are the load-bearing evidence. A skeptic's first question is whether the extra modalities do real work or just add parameters. They do real work, but unevenly.

Figure 4: Conditioning ablations. Inpainting accuracy sorted by added modality for amino acids, introns, and exons. Sequence-only at the bottom, all modalities at the top. Figure 4: Ablation on conditioning inputs, sorted by inpainting accuracy. (A) amino acids: sequence-only rising to all structural/chemical modalities. (B) intronic and (C) exonic RNA: effect of adding ATAC, phyloP, splice, CAGE, RASP and combinations. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026), Supplementary Figure S3.

For amino acids, the most effective single modalities are MaSIF hydrophobicity and tokenized backbone structure; for intronic sequence, evolutionary information (phyloP), splicing, and epigenetics (ATAC-seq or CAGE) drive the gains. In both cases combining all of them beats any single addition.⁷⁰⁷¹ The cross-cutting rule, and the one that protects the work from hype: multimodal conditioning is most useful where the auxiliary modality is ambiguous from sequence alone, rather than easily predictable from it.⁷²

That point is sharpened in a separate analysis that bins the uplift by how predictable each modality is from sequence.

Figure 5: Uplift versus predictability. The gain from adding a modality peaks when the modality is informative but not trivially inferable from sequence. Figure 5: Performance gains depend on modality predictability. (A) amino-acid inpainting uplift from hydrophobicity, binned by how well hydrophobicity is predicted from sequence. (B) RNA inpainting uplift from RASP2. Uplift peaks when a modality is informative but not trivially predictable from sequence. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026), Supplementary Figure S4.

This is the kind of result that earns credibility rather than spends it. If sequence already determines the modality, adding it is redundant; if the modality is noise, it adds nothing. The value lives in between, the expected signature of genuine information transfer. The scarce modalities in LORE, phyloP and MaSIF, are also the ones least predictable from sequence, which is exactly where the data table and the ablation meet: the rarest tracks carry the most non-redundant signal.

Splicing: matching the dedicated predictors

Splice-site detection is the cleanest head-to-head against specialized tools. Because splice sites are rare, with about 99.85% of nucleotide positions not splice sites, the metric is AUPR.⁷³ To compare fairly, each model gets the context its design needs: SpliceAI receives 5kb flanks each side, AlphaGenome a 16,384 bp interval, MIMIC the full sequence up to 10kb.⁷⁴ MIMIC's splice head outputs a 5-class distribution over non-splice, donor, acceptor, TSS, and TES at each position.⁷⁵ Because many genes exceed the context length, evaluation is restricted to genes under 25kb, with genes over 10kb scored on a common 10kb region carrying the most junctions.⁷⁶

Level	Region	vs AlphaGenome	vs SpliceAI	vs NTv3
Gene	coding	+1%	+3%	+11%
Gene	non-coding	+12%	+14%	+311%
Transcript	coding	+6%	+5%	+11%
Transcript	non-coding	+26%	+16%	+146%

Splice-site AUPR deltas, MIMIC versus baselines. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026), Figure 3a-b.⁷⁷⁷⁸

At gene level MIMIC edges ahead of the dedicated forward predictors; at transcript level, where the joint generative formulation enables isoform-aware inference, it is clearly the strongest model.⁷⁷⁷⁸ One caveat on reading the table: the large non-coding multiples versus NTv3 (+311% and +146%) are gains over a weaker baseline on a rare-event metric, so they look more dramatic than the coding-region deltas against tuned splice predictors. Feeding the transcript start and end sites (TSS/TES) as explicit input improves recovery of internal splice structure by +3% coding and +5% non-coding, consistent with evidence that start-site usage shapes isoform selection.⁷⁹

Figure 6: Splicing prediction and corrective RNA design. AUPR versus AlphaGenome, SpliceAI, and NTv3, the SPRY1 conditioning example, and the HBB IVS-II-654 design pipeline. Figure 6: Splice-site prediction and RNA design. (A) gene-level AUPR (coding/non-coding) vs AlphaGenome, SpliceAI, NTv3. (B) transcript-conditioned AUPR. (C) SPRY1 prediction, unconditioned vs conditioned. (D-E) designing to revert the HBB IVS-II-654 C>T cryptic exon. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026).

The SPRY1 example shows the mechanism. Before conditioning, MIMIC predicts an extra exon not in the ground truth plus several spurious donor sites; after conditioning on transcript boundaries, all false-positive junctions drop to low probability and only the true exons survive.⁸⁰ The constraint is doing real work, not just rescoring.

Corrective RNA edits without reverting the mutation

The clinical demonstration targets the intronic beta-globin variant HBB c.316-197C>T (IVS-II-654 C>T), a canonical splice-altering mutation that promotes pseudoexon inclusion through aberrant splice-site gain.⁸¹ The constraint that makes it hard is fixing the splicing defect without reverting the causal mutation. That is the regime for RNA-level therapeutics, where A-to-I editing by ADAR enzymes can alter splice sites without touching the genome.⁸²

MIMIC's phyloP variant-effect score distinguishes the pathogenic C>T from a matched non-splice-altering C>A control, localizing signal to the cryptic acceptor and donor.⁸³ The design task prompts MIMIC with both the wild-type splice pattern and the pathogenic sequence with the C>T held fixed, generating edits in 30- or 50-nucleotide windows at varying distances. SpliceAI serves as an independent oracle, since scoring MIMIC's designs with MIMIC would be circular.⁸⁴⁸⁵ MIMIC produces designs that substantially reduce pathogenic splicing while keeping the causal mutation intact.⁸⁶ The result that matters most: effective corrections appear even when the editable window does not overlap the cryptic splice sites, so the model is doing indirect regulatory rewiring through exonic splicing regulatory elements rather than just patching the obvious site.⁸⁷

Protein binder design

For binder design the targets are PD-L1 (PDB 4ZQK chain A) and the larger hACE2 (6VW1 chain A), the latter against the SARS-CoV-2 spike RBD.⁸⁸ The team generated sequence ensembles under five conditioning strategies, 20 per strategy: MaSIF surface features alone (40% or 100% of residues near the binding site), backbone geometry alone, or backbone plus partial or full surface.⁸⁹ Generation is iterative, up to 10 cycles, with MIMIC used for both generation and verification; the top candidates are then folded with AlphaFold2, keeping designs at average pLDDT above 85 and scoring fidelity by TM-score against native.⁹⁰

Figure 7: Binder design. pLDDT across conditioning strategies, TM-score and MaSIF similarity to wild type, AlphaFold3 cofolding iPTM, and sequence-diversity heatmaps. Figure 7: Target-binder design. (A) the target complex (SARS-CoV-2-RBD and hACE2, PDB 6VW1). (C) pLDDT across conditioning strategies. (D-E) TM-score, MaSIF similarity, and AlphaFold3 cofolding iPTM. (F-G) sequence-identity diversity heatmaps (PD-L1 n=37, hACE2 n=20). Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026).

Design confidence peaked when both backbone and surface chemistry were provided, and the benefit scaled with target complexity: surface chemistry alone reached pLDDT >= 85 for the smaller PD-L1, while the larger hACE2 leaned much harder on added backbone conditioning.⁹¹ Designed sequences recovered the wild-type fold (median TM-score > 0.85) and surface chemistry (median MaSIF similarity > 0.90).⁹² Binding evidence comes from a separate step, in silico cofolding with AlphaFold3: for PD-L1, 35 of 37 high-confidence designs hit iPTM > 0.75 (median 0.81); for the harder hACE2, only 2 of 20 high-fold designs were predicted to bind the spike RBD.⁹³⁹⁴ Two tools do two jobs here: AlphaFold2 for fold confidence and TM-score, AlphaFold3 for cofolding iPTM. Diversity matrices show designs at only about 50% sequence identity to wild type, so this is not memorization, and the all-in-silico evidence is exactly why the independent oracle and cofolding checks matter.⁹⁵⁹⁶

RNA chemical probing as semantic conditioning

The most conceptually interesting result treats experimental context as a modality. Rather than fixed per-assay output heads, MIMIC processes natural-language descriptions of experimental conditions alongside molecular data, using context as semantic conditioning for assay-dependent chemical probing.⁹⁷

Figure 8: RNA reactivity and structure. Context-conditioned RASP2 prediction, probe differentiation (icSHAPE vs DMS), and the F1 jump from 0.404 to 0.987 when MIMIC-predicted reactivity guides 2D folding. Figure 8: RNA chemical-probing reactivity and structure. (A) Pearson r for predicted vs measured RASP2, sequence-only vs sequence+context. (B) condition-specific reactivity profiles. (C) MIMIC-guided 2D structure: sequence alone (F1=0.404) vs MIMIC reactivity (F1=0.987) vs reference. (D) distribution of F1 improvement. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026).

Conditioning on the correct experimental context significantly improves transcriptome-wide RASP2 reactivity prediction over sequence-only models.⁹⁸ MIMIC distinguishes assay chemistry, separating the broad sensitivity of in vivo icSHAPE (NAI-N3 probe) from the A/C-restricted sensitivity of DMS-seq.⁹⁹ Feeding predicted reactivity into ViennaRNA takes the illustrated example from F1 = 0.404 (sequence alone) to F1 = 0.987 (MIMIC reactivity), a structure nearly identical to the experimentally guided reference.¹⁰⁰ Across 1,000 held-out transcripts, with three RNAfold (v2.7.2) predictions each, the improvement is robust: median delta F1 = +0.061, mean +0.095 +/- 0.194 s.d., with a one-sided paired Wilcoxon signed-rank test at P = 6.25e-47.¹⁰¹¹⁰² That Wilcoxon result is a statistically significant difference favoring MIMIC, not statistical equivalence; "nearly identical" describes the single illustrated transcript, not the whole distribution. Conditioning on phyloP conservation also lifts RASP2 accuracy, a structure-conservation synergy the authors liken to MSA coevolution in protein folding.¹⁰³

Variant effect: generation beats embeddings

The variant-effect work shows a related subtlety: how you read MIMIC matters as much as what it knows. MIMIC can expose task-relevant signal directly through generation rather than forcing every task through a generic embedding.¹⁰⁴ Its phyloP-based VEP scores a variant by the mean absolute perturbation in a 30-nucleotide window around the position, a low-dimensional regressor that is more interpretable and performs substantially better than embedding-based alternatives.¹⁰⁵

Figure 9: phyloP variant-effect prediction. (A) wild-type allele recovery from ClinVar pathogenic/benign SNVs, sequence-only vs sequence+phyloP. (B) normalized AUPRC split by sign of phyloP, raw phyloP vs MIMIC phylo-vep. Source: Golkar et al., MIMIC (arXiv:2604.24506, 2026), Supplementary Figure S5.

On complex variants the embedding-based VEP already leads (+33% vs SpliceBERT, the next-best), but the phyloP-based VEP vastly outperforms all embedding approaches (+82% over embedding-based MIMIC).¹⁰⁶ On Mendelian variants, switching from embedding-based to phyloP-based scoring improves MIMIC by 90%.¹⁰⁷ Underneath, MIMIC preferentially regenerates the wild-type allele when a pathogenic position is masked, and phyloP conditioning improves that recovery most for pathogenic variants, where evolutionary constraint is an especially strong signal.¹⁰⁸

Limitations

LORE captures only a subset of the measurements that shape molecular phenotype; many important facets remain absent or weakly represented.¹⁰⁹ The context scale is small relative to the biological processes one would want to capture, and the 1,000-token decoder window limits direct generation over long outputs.¹¹⁰ Splicing evaluation was restricted to genes under 25kb because many exceed the context length, so the strong AUPR numbers do not yet speak to long genes.⁷⁶ The MaSIF surface modality covers only about 10% of structures, so surface-conditioned protein results lean on a thin slice of the data. Tellingly, the scarce-modality counts (phyloP 518,480, MaSIF 1,671,206, captions 176,626) are exactly the tracks the ablation found carry the most non-redundant signal.²⁹²⁴²⁶

The clinical and design results are entirely in silico: the HBB corrections validated against SpliceAI as an oracle, the binders against AlphaFold3 cofolding, not wet-lab assays. The honest negatives, the LBKWK half-life result where MIMIC trails by 8% and the hACE2 binder yield of 2 of 20, sit alongside the wins rather than being smoothed over.⁶⁹⁹⁴

There is also a reproducibility caveat. As of the preprint, MIMIC code and weights, plus LORE with its clustering, splits, and precomputed tokenization, were being prepared for release on the Polymathic AI GitHub under the MIT license, not yet fully downloadable.¹¹¹¹¹² The benchmark numbers above are the authors' own; the comparisons are set up carefully, with cluster-level splits and independent oracles, but independent replication will be the real test.

What it means

MIMIC's contribution is less any single benchmark and more a usable abstraction: treat sequence, structure, regulation, evolutionary constraint, and experimental context as different observations of one molecular state, then learn the joint distribution over it.¹¹³ The +1% to +3% gains over AlphaGenome and SpliceAI on coding-region splicing are modest; the value is that the same weights also predict splice sites, redesign an HBB transcript, fold an RNA, score a ClinVar variant, and produce a PD-L1 binder, collapsing five systems into one.⁷⁷⁸⁷⁹³

The broader claim is the one to watch. If heterogeneous aligned data is genuinely an alternative growth axis to raw scale, the next biological foundation models may compete on how many modalities they align rather than how many tokens they ingest. A 1B model beating a larger one on reconstruction is a single data point, not a law. For a practitioner with the artifacts in hand, the cheapest useful experiment is to feed MIMIC the extra aligned modalities you already have. Then check whether conditioning helps where your inputs are ambiguous from sequence alone. And when LORE ships, the most reusable artifact may not be the model at all. It may be the aligned dataset that made it possible.¹¹²

References

Golkar, S., Kovalic, J., Espejo Morales, I., et al. (2026). MIMIC: A Generative Multimodal Foundation Model for Biomolecules. arXiv:2604.24506. RNA 2D structure: sequence-alone F1 = 0.404 vs MIMIC-guided F1 = 0.987, nearly identical to the experimentally guided reference. https://arxiv.org/abs/2604.24506 ↩
Golkar et al. (2026), MIMIC. n = 1,000 RNA sequences; one-sided paired Wilcoxon signed-rank test, P = 6.25e-47. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, abstract: LORE links nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. Submitted 27 April 2026, arXiv:2604.24506 [cs.AI]. https://arxiv.org/abs/2604.24506 ↩ ↩²
Polymathic AI (2026). MIMIC project entry and code/data release plan. https://github.com/PolymathicAI ; HuggingFace: https://huggingface.co/polymathicai/mimic ↩
Golkar et al. (2026), MIMIC, abstract: split-track encoder-decoder conditions on arbitrary subsets of observed modalities to reconstruct or generate missing components across genome, transcriptome, and proteome. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: at a comparable size to single-modality foundation models (1 billion parameters), MIMIC demonstrates the value of diverse and heterogeneous data as an alternative to pure scaling. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: the field has seen a proliferation of single-modality models that master one molecular modality but lack the capacity to operate across the broader central dogma. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: by focusing almost exclusively on forward prediction, these models fail to capture the full joint distribution of biological states. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: to understand a biological system, one must model the full joint probability distribution of its molecular states. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: given a desired protein structure, mRNA stability, and splicing pattern, what upstream nucleotide sequence is most likely to have produced them? arXiv:2604.24506. ↩
Abramson, J., Adler, J., Dunger, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630(8016):493-500. AlphaFold3 predicts the joint structure of complexes including proteins, nucleic acids, small molecules, ions, and modified residues within a single unified framework. https://doi.org/10.1038/s41586-024-07487-w ↩
Chen, J., Hu, Z., Sun, S., et al. (2022). Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions. arXiv:2204.00300. RNA-FM is a 12-layer transformer trained on 23 million ncRNA sequences via self-supervised learning. https://arxiv.org/abs/2204.00300 ↩
Golkar et al. (2026), MIMIC: approximately 15.5 million proteins and 13 million RNA transcripts from over 6,000 organisms, plus over 4 billion tokens of biomedical, functional, and contextual text. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: collected organisms span all domains of life and contain over 25 million transcripts and 123 billion nucleotides. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: full human and mouse genomes with GENCODE annotations, plus over 6,000 additional organisms from NCBI RefSeq. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: the central design principle of LORE is alignment, distinct from sequence-only corpora such as OpenGenome2 or BFD. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: each LORE row is an observation of a transcript, a protein, or a pairing of the two, containing the available subset of modalities. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: UniProt-to-RefSeq/Ensembl mapping yields a core subset of approximately 2 million aligned transcript/protein pairs. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: LORE is intentionally constructed as a partially observed multimodal dataset; no example is required to contain all modalities. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: approximately 150 distinct modality presence signatures across the dataset. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Table S1: nucleic acid sequence 12,967,153 samples. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Table S1: CAGE promoter usage 15,073,298 samples; ATAC-seq chromatin accessibility 14,021,301 samples. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC, Table S1: RNA chemical probing (RASP2) 1,644,404 samples. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Table S1: evolutionary conservation (phyloP) 518,480 samples. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC, Table S1: amino acid sequence, DSSP secondary structure, backbone tertiary structure, and SASA each 15,607,838 samples. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Table S1: MaSIF chemical surface 1,671,206 samples; protein abundance 1,803,028; functional captions 176,626. arXiv:2604.24506. ↩ ↩² ↩³ ↩⁴
Golkar et al. (2026), MIMIC, Table S1: biomedical corpus 3,797,568 samples; 4 billion tokens after tokenization. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: phyloP conservation from 100-way alignments for human and 30-way for mouse. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: MaSIF surface features computed on about 10% of structures, roughly 50,000 CPU hours for 1.6M datasets. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC: MMseqs2 clustering with -c 0.8 and --min-seq-id 0.3 or 0.7, one representative per cluster. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Table S6: protein 30% identity 42,943,944 clusters from 299,566,796 members; RNA 30% identity 10,861,904 clusters from 25,001,445 members. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: cluster-level train/validation/test splits ensure no validation sequence shares significant similarity with training; 30% clusters define splits, 70% clusters ensure diversity. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: approximately 200M AlphaFoldDB v4 structures retained at pLDDT > 70. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: ESM3 structure tokenizer (VQ-VAE) with 4,096 classes based on local backbone geometry. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: built upon the 4M/AION encoder-decoder backbone. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: the split-track architecture sums aligned modalities while concatenating distinct molecular entities. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: the nucleic acid track element-wise sums sequence, structural data, and conservation scores into a single representation. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: split-track summation keeps total sequence length tractable regardless of modality count while preserving positional alignment between co-localized signals. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: encoder processes a unified token sequence of a nucleic acid track, an amino acid track, semantic/contextual token groups, and learnable register tokens. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: register tokens serve as information sinks pooling features across DNA, protein, and context tracks into a compact vector summary. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: all text-based modalities (context, functional captions, taxonomy) share a single embedding space with the biomedical corpus. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: RoPE with a local group-reset strategy; the position index restarts at zero for each track group. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: asymmetric context window, 10,000 tokens for the encoder and 1,000 for the decoder. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: decoupling observed context from predicted output allows conditioning on any observed subset and generating any target subset without architectural modification or task-specific output heads. arXiv:2604.24506. ↩
Jaegle, A., Borgeaud, S., Alayrac, J.-B., et al. (2022). Perceiver IO: A General Architecture for Structured Inputs & Outputs. ICLR 2022, arXiv:2107.14795. Flexible querying mechanism produces outputs of various sizes while scaling linearly with input and output size. https://arxiv.org/abs/2107.14795 ↩
Golkar et al. (2026), MIMIC, Table S7: encoder depth 20, decoder depth 12, hidden width 1,536, 24 attention heads per stack, 5 register tokens, RoPE fraction 0.75. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Table S7: self-attention masks are 50% bidirectional, 25% causal, 25% anti-causal, following the finding that causal attention can outperform bidirectional on certain tasks. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: training distributed by default across 92 NVIDIA H200 GPUs using NCCL. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Table S8: AdamW, betas (0.9, 0.95), weight decay 0.05, cosine learning-rate decay, bfloat16 precision. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: only the 566M-parameter encoder is used to generate embeddings; one of the smaller networks evaluated. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC: training defined as a stochastic sampling process over approximately 25 heuristic pathways, each specifying required and optional input and target modalities. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: staged curriculum scaling context from 1k to 2k to 4k to 8k to 10k tokens, learning local features before long-range dependencies. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: RNA/DNA are randomly cropped to fit the window because RNA folding is largely driven by local base-pairing. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: a protein whose structure track exceeds the budget is dropped rather than cropped, since a truncated protein cannot reliably predict structure. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: target-packing strategy fills the 1,000-token decoder budget for short or sparse targets. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: target packing acts as a regularizer, forcing the model to learn joint dependencies between splicing and conservation. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: register tokens trained via reconstruction with 0 to 10% random token dropout across all input tracks, preventing them from degenerating into copy mechanisms. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: dynamic workload balancing across GPUs to handle heavy-tailed sequence-length distributions, routing short sequences into large batches and long sequences into small ones (training systems, Section C.4). arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: a 100-token center window is masked and reconstructed in a single pass by sampling from a softmax at temperature 1e-8. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: random per-token reconstruction is about 0.25 for nucleotides and about 0.04 for amino acids. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: against ProtBERT, ESM-2, ESM-C, and ESM3-open, MIMIC conditioned with structural descriptors achieves the highest reconstruction accuracy. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: against NTv3 (masked) and Evo 2 (autoregressive) trained on OpenGenome2, MIMIC achieves the highest reconstruction accuracy in both intronic and exonic regions. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: even sequence-only, MIMIC is competitive with top sequence-only models, and conditioning on more modalities yields consistent improvements. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: on PFMBench (protein function, structure, interaction, developability; 11 tasks), MIMIC matches or exceeds ESM3, ESM-C, ProTrek, and SaProt, winning at least 7 of 11 tasks against each. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: strongest on protein-ligand binding (BindingDB, PDBbind) and best on all developability tasks, attributed to jointly pretraining on surface chemistry. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: on mRNABench, MIMIC outperforms Evo 2 and Orthrus on 4 of 7 tasks, a Dilated ResNet on 6 of 7, and all other methods on the remaining tasks. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: largest mRNABench gains on GO function (+2% vs Evo 2, +9% vs Orthrus) and protein localization (+2% vs Evo 2, +5% vs Orthrus). arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: Sugimoto half-life set, +8% relative improvement vs Orthrus. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: LBKWK half-life set, MIMIC trails by 8% vs Orthrus; benchmark from synthetic mRNA constructs where stability effects likely dominate. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC, Figure S3a: most effective single AA modalities are MaSIF hydrophobicity and tokenized backbone structure; combining all structural and chemical modalities beats any single one. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Figure S3b: largest intronic gains come from phyloP, splicing, and epigenetics (ATAC-seq or CAGE peaks). arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC, Figure S4: multimodal conditioning is most useful where the auxiliary modality is ambiguous from sequence alone rather than easily predictable; uplift peaks for informative-but-not-trivial modalities. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: about 99.85% of nucleotide positions are not splice sites, motivating AUPR. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: SpliceAI receives 5kb flanking on each side, AlphaGenome a 16,384-bp interval, MIMIC the full sequence up to 10kb. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: splice junction head outputs a 5-class distribution over non-splice, donor, acceptor, TSS, TES at each position. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: evaluation restricted to genes shorter than 25kb; genes over 10kb scored on a common 10kb region with the most splice junctions, because many genes exceed MIMIC's context length. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC, Figure 3a: gene-level splice AUPR +1% vs AlphaGenome, +3% vs SpliceAI, +11% vs NTv3 (coding); +12% vs AlphaGenome, +14% vs SpliceAI, +311% vs NTv3 (non-coding). arXiv:2604.24506. ↩ ↩² ↩³
Golkar et al. (2026), MIMIC, Figure 3b: transcript-level splice AUPR +6% vs AlphaGenome, +5% vs SpliceAI, +11% vs NTv3 (coding); +26% vs AlphaGenome, +16% vs SpliceAI, +146% vs NTv3 (non-coding). arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC: providing TSS and TES as input improves internal splice-structure recovery (+3% coding, +5% non-coding). arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: SPRY1, before conditioning MIMIC predicts an extra exon and alternative donor sites; after conditioning, false-positive junctions drop and only true exons remain. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: HBB c.316-197C>T (IVS-II-654 C>T), a canonical splice-altering mutation promoting pseudoexon inclusion through aberrant splice-site gain. arXiv:2604.24506. ↩
Nishikura, K. (2016). A-to-I editing of coding and non-coding RNAs by ADARs. Nature Reviews Molecular Cell Biology 17(2):83-96. A-to-I editing by ADAR enzymes can alter splice sites and recode coding sequence. https://doi.org/10.1038/nrm.2015.4 ↩
Golkar et al. (2026), MIMIC: phyloP VEP score distinguishes the pathogenic C>T from a matched non-splice-altering C>A control, localizing signal to the cryptic acceptor and donor. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: prompted with both the wild-type splice pattern and the pathogenic sequence with C>T fixed, generating designs for 30- or 50-nucleotide editable windows at varying distances. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: designs evaluated with SpliceAI as an independent oracle to avoid circularity with MIMIC-predicted splicing. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: produces designs that substantially reduce the likelihood of pathogenic splicing while maintaining the causal mutation. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: corrections found even when the editable window does not overlap the cryptic splice sites, demonstrating indirect regulatory rewiring through exonic splicing regulatory elements. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC: design targets PD-L1 (PDB 4ZQK chain A) and hACE2 (PDB 6VW1 chain A), the latter against the SARS-CoV-2 spike RBD. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: five conditioning strategies, 20 designs per strategy (backbone only; MaSIF surface for 40% or 100% of binding-site residues; backbone plus partial or full surface). arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: iterative design (up to 10 cycles, MIMIC for generation and verification); 200 hyperparameter sets per case filtered by ESMFold to top 20, then AlphaFold2 predicts folds, filtering to average pLDDT > 85 and scoring TM-score against native. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: design confidence peaks when both backbone structure and surface chemistry are provided; surface alone reaches pLDDT >= 85 for PD-L1, while the larger hACE2 relies more heavily on backbone conditioning. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: designed sequences recover the wild-type fold (median TM-score > 0.85) and surface chemistry (median MaSIF similarity > 0.90). arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: in silico co-folding with AlphaFold3, PD-L1, 35 of 37 high-confidence designs reach iPTM > 0.75 (median iPTM = 0.81). arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC: AlphaFold3 cofolding, hACE2, only 2 of 20 high-confidence-fold designs predicted to bind the SARS-CoV-2 spike RBD at iPTM > 0.75. arXiv:2604.24506. ↩ ↩²
Golkar et al. (2026), MIMIC: generated designs maintain about 50% sequence identity to wild type. arXiv:2604.24506. ↩
Abramson et al. (2024), AlphaFold3: generative models are prone to hallucination, inventing plausible-looking structure even in unstructured regions. Nature 630. ↩
Golkar et al. (2026), MIMIC: instead of fixed output heads, MIMIC processes natural language descriptions of experimental conditions directly alongside molecular data, using context as semantic conditioning for assay-dependent chemical probing. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: correct experimental context significantly improves transcriptome-wide RASP2 predictions over sequence alone. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: MIMIC differentiates the broad sensitivity of in vivo icSHAPE (NAI-N3) from the A/C-restricted sensitivity of DMS-seq, generating condition-specific reactivity tracks. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: feeding MIMIC-predicted reactivity into ViennaRNA yields F1 = 0.987 vs F1 = 0.404 sequence-alone on the illustrated case, nearly identical to the experimentally guided reference. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: 1,000 held-out transcripts, three RNAfold (v2.7.2) predictions per transcript; median delta F1 = +0.061, mean +0.095 +/- 0.194 s.d. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: structural improvement significant at P = 6.25e-47 by a one-sided paired Wilcoxon signed-rank test. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: conditioning on phyloP improves RASP2 prediction, a structure-conservation synergy analogous to MSA coevolution in protein structure prediction. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: MIMIC exposes task-relevant signal directly through generation rather than forcing every task through a generic embedding. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: the phyloP-based VEP scores a variant by mean absolute perturbation in a 30-nucleotide window; the low-dimensional regressor is more interpretable and performs substantially better in practice. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: on complex variants the embedding-based VEP leads (+33% vs SpliceBERT), but the phyloP-based VEP outperforms it (+82% over embedding-based MIMIC). arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: on Mendelian variants, switching from embedding- to phyloP-based VEP improves performance by 90%. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: phyloP conditioning significantly increases correct wild-type allele recovery from ClinVar SNVs, more pronounced for pathogenic variants. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: LORE captures only a subset of measurements shaping molecular phenotype; many facets remain absent or weakly represented. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: current context scale remains limited, and the restricted 1,000-token decoder window limits direct generation over long outputs. arXiv:2604.24506. ↩
Golkar et al. (2026), MIMIC: code and weights, and LORE with clustering, splits, and precomputed tokenization, are being prepared for public release under the MIT license. arXiv:2604.24506. ↩
Polymathic AI (2026). MIMIC and LORE release on the Polymathic AI GitHub (MIT license). https://github.com/PolymathicAI/MIMIC ; HuggingFace: https://huggingface.co/polymathicai/mimic ↩ ↩²
Golkar et al. (2026), MIMIC: jointly modeling the interconnected layers of molecular biology through flexible multimodal representations of aligned molecular state. arXiv:2604.24506. ↩

MIMIC: One 1B Model Across the Central Dogma, and Why Multimodality Beats Scale

The problem: forward predictors model the wrong distribution

LORE: the dataset is the contribution

Architecture: split-track summation and an asymmetric window

Training: 25 pathways, a curriculum, and crop-versus-drop

Reconstruction: a 1B model holding its own against larger ones

Ablations: does multimodality actually help, and when?

Splicing: matching the dedicated predictors

Corrective RNA edits without reverting the mutation

Protein binder design

RNA chemical probing as semantic conditioning

Variant effect: generation beats embeddings

Limitations

What it means

References

References

Frequently asked

Comments and feedback

The problem: forward predictors model the wrong distribution

LORE: the dataset is the contribution

Architecture: split-track summation and an asymmetric window

Training: 25 pathways, a curriculum, and crop-versus-drop

Reconstruction: a 1B model holding its own against larger ones

Ablations: does multimodality actually help, and when?

Splicing: matching the dedicated predictors

Corrective RNA edits without reverting the mutation

Protein binder design

RNA chemical probing as semantic conditioning

Variant effect: generation beats embeddings

Limitations

What it means

References

Footnotes

References

Frequently asked

Comments and feedback

Keep reading

Protein Design Foundation Models in 2026: Two Ledgers, One Pipeline, Fourteen Models

Protein Language Models in 2026: Two Ledgers, Eighteen Models, One Honest Recommendation

Fifteen Ways to Fold a Protein