Benchmarks
An independent, held-out evaluation of foundation models for sequence biology — built to measure what survives a domain shift, not what tops a clean leaderboard.
Public leaderboards reward models that fit the test. This benchmark is being built to do the opposite: measure where models break when they meet the messy, shifting reality of real biological data.
It is in preparation. If you would like to be told when the first results are published — or have a model or task you would like covered — get in touch.
hello@rewire.it →Held-out test sets
Evaluation on data the models could not have memorised — low-homology splits, time-based holdouts, and genuinely external references.
Domain-shift robustness
How accuracy moves when a model is taken off its training distribution: different assays, organisms, and clinical sites.
Hardware & inference cost
Parameters, memory, and throughput — the operational reality of running a model, not just its headline accuracy.
Uncertainty & graceful degradation
Whether a model knows when it is wrong, and how its confidence behaves on the hardest, most novel inputs.