Reading Between the Genes: When AI Cracks the 98%

The shift in four numbers

1,000,000

DNA base pairs read in a single pass — at single-letter resolution, with no resolution-range trade-off

Biological modalities predicted simultaneously — replacing a fragmented ecosystem of dozens of specialist tools

5,930

Human genome tracks predicted across diverse cell types — a comprehensive functional map of genomic activity

100+

Countries where thousands of scientists are already using the model since its research preview release

The control code, finally legible

When the human genome was first sequenced at the turn of the millennium, scientists expected that reading the complete sequence of DNA would rapidly unlock the genetic basis of disease. That expectation proved dramatically optimistic. Sequencing the genome turned out to be considerably easier than understanding it — because the vast majority of DNA, the 98% that does not directly encode proteins, remained functionally opaque. Google DeepMind's AlphaGenome, published in Nature in January 2026, changes that fundamental equation.

AlphaGenome processes up to one million DNA letters simultaneously, predicting thousands of molecular properties at single base-pair resolution across eleven distinct biological modalities — something previously requiring a fragmented collection of two dozen or more specialist tools. It matched or outperformed state-of-the-art specialist models across 25 of 26 benchmark evaluations while delivering all outputs from a single, unified framework.

// The thesis in one paragraph

More than 98% of human DNA does not code for proteins. This regulatory genome controls when, where, and how much of each gene gets expressed — and most disease-linked variants identified in large genetic studies sit precisely in these non-coding regions. The field has known these regions matter for decades. What it lacked was the ability to reliably predict what any given variant in them actually does. That gate has now opened. If the protein structure model required roughly five years to translate from research breakthrough to widespread pharmaceutical deployment, the regulatory genome model may be on a similar trajectory — with implications for drug discovery timelines, diagnostic precision, and the economics of genomic medicine that are only beginning to be mapped.

Sequencing the genome gave science the letters of the instruction manual. The regulatory genome is the grammar, the punctuation, and the syntax — and for most of the past two decades, we could read neither with any reliability.

// Section 01 of 06

The code that controls the code

The sequencing of the human genome was one of the great scientific achievements of the twentieth century. It was also the beginning of a humbling realisation: having the complete sequence of three billion DNA letters told researchers considerably less than they had hoped about what those letters actually do.

The reason lies in what the genome actually contains. Only a small fraction — approximately 2% — consists of genes in the conventional sense: sequences that are directly transcribed and translated into the proteins that perform most of biology's physical work. The remaining 98% was long dismissed as evolutionary detritus, labelled "junk DNA" and largely set aside as researchers focused on the protein-coding minority.

// The genomic reality — two domains, one ignored

The "junk" turns out to be the control system for everything else

Domain	Share of genome	What it does	Historical readability
Protein-coding DNA	~2%	Builds the proteins — the biological machines. Source of most approved drugs.	Well-studied for decades
Regulatory (non-coding) DNA	~98%	The control system — enhancers, silencers, insulators, splice signals, chromatin organisers. Determines which genes are active, in which cell type, at which time, in what quantity. Most disease-genetics-study variants sit here.	Largely unreadable at scale until now

The regulatory genome is not junk. It is the most complex operating system in nature — a layer of control logic that determines whether a stem cell becomes a neuron or a liver cell, whether a tumour suppressor gene fires when it should, whether a developmental programme completes correctly. The consequences of regulatory malfunction are as severe as any protein-coding mutation, and in many cases more subtle and therefore harder to detect.

The practical consequence for medicine has been significant. Large-scale genetic studies — scanning hundreds of thousands of individuals to find variants statistically associated with disease — have identified thousands of genetic risk factors for conditions ranging from cancer to heart disease to autoimmune disorders. The overwhelming majority of these variants sit in the regulatory 98%. Knowing a variant is associated with risk is not the same as knowing what it does. Without the ability to predict the functional consequence of a regulatory variant, identifying it provides only limited clinical or scientific utility.

// Section 02 of 06

The fragmentation era — why dozens of specialist tools could not solve the problem

The research community's response to the regulatory genome's complexity was to divide it into smaller problems, each tractable in isolation. The result was an ecosystem of increasingly powerful specialist tools — each solving one part of the problem with great precision, none capable of addressing the whole.

Splicing prediction tools identified where RNA is edited. Chromatin accessibility models mapped which regions of DNA are physically available for regulatory proteins to bind. Three-dimensional genome architecture tools predicted how distant genomic regions physically contact one another — contacts that allow an enhancer located hundreds of thousands of base pairs away from a gene to control that gene's activity. Each tool was excellent at its specific task. But biology does not compartmentalise in the way these models did.

The resolution-range trade-off

Earlier models were forced to choose between reading short DNA sequences at high resolution — capturing fine-grained base-pair-level detail — or reading longer sequences at lower resolution that could capture regulatory elements far from the gene they control. A variant's effect might depend on regulatory elements 100,000 base pairs or more from the gene it influences. Previous models simply could not see both the variant and its distant target simultaneously with sufficient precision to make meaningful predictions.

The breadth-depth trade-off

Specialist tools that achieved high accuracy on one biological modality — splice site prediction, for example — provided no information about the other regulatory mechanisms a mutation might simultaneously disrupt. A variant that shifts chromatin accessibility, alters transcription factor binding, changes histone modification patterns, and modifies RNA splicing all at once was interpretable only by running multiple separate models and attempting to integrate their outputs manually — a process that was both slow and subject to inconsistencies between models trained on different datasets.

The fragmentation of the genomics modelling toolkit reflected a genuine computational constraint. Processing one million DNA base pairs at single-nucleotide resolution, simultaneously predicting multiple biological properties, across multiple cell types and tissues, requires enormous compute — memory and processing demands that scale with sequence length in ways that made the unified approach impractical until recent advances in AI architecture and hardware made it tractable.

// Section 03 of 06

The architecture of the breakthrough — both trade-offs solved at once

AlphaGenome resolves both the resolution-range and breadth-depth trade-offs through a technical architecture that combines convolutional neural networks — which excel at detecting local sequence patterns with base-pair precision — with transformer layers that model long-range dependencies between distant genomic regions. The DeepMind team trained the model on massive publicly funded genomics datasets — including ENCODE, GTEx, and FANTOM5 — covering gene regulation across hundreds of cell types in human and mouse.

The combination is not conceptually novel: hybrid CNN-transformer architectures have been explored in genomics for several years. What is new is the scale at which it operates and the efficiency with which the compute challenge has been addressed. Distributed training across specialised AI accelerators enables the model to process sequences five times longer than its predecessor while running on approximately half the compute budget — a dramatic improvement in the effective efficiency of the approach.

// Resolution vs. range — the trade-off resolved

Sequence length and resolution across generations of genomic AI models

Model class	Sequence length	Resolution	Modalities predicted
Prior unified (e.g. Enformer)	~200,000 bp	32 bp bins	Limited multi-modal
Specialist tools (e.g. SpliceAI)	<10,000 bp	1 bp (single)	One task at a time
AlphaGenome	1,000,000 bp	1 bp (single)	11 simultaneously

The eleven biological modalities predicted simultaneously represent a near-comprehensive view of gene regulation — the equivalent of replacing two dozen specialist instrument players with a single orchestra capable of performing all their parts together. The model predicts: gene expression levels across cell types; transcription initiation sites; where and how RNA is spliced; chromatin accessibility (whether DNA is physically available for regulatory binding); histone modification patterns; transcription factor binding sites; and three-dimensional chromatin contact maps that reveal which distant regulatory elements are physically interacting with which genes.

// Gene expression & transcription initiation

Predicts how much RNA a gene produces (the most direct measure of whether and how actively it is switched on across different cell types) and identifies exactly where in the DNA sequence transcription begins — critical for understanding which regulatory variants affect gene activation from the start.

// RNA splicing patterns

Predicts where pre-messenger RNA is cut and joined to create mature RNA, including both junction-site identification and splicing usage levels. Splicing errors cause many rare genetic diseases, and predicting them was previously among the hardest mutation-effect tasks in genomics.

// Chromatin accessibility & histone marks

DNA wrapped tightly around protein spools is inaccessible to regulatory molecules. The model predicts which regions are physically open, and which carry the chemical tags on histones that determine whether the surrounding genes are active or silent — a critical layer of epigenetic regulation beyond DNA sequence itself.

// TF binding & 3D chromatin contacts

Regulatory proteins that bind DNA and control gene expression, plus the three-dimensional folding pattern that determines which enhancers can physically interact with which genes. Contacts across hundreds of thousands of base pairs — exactly the long-range interactions previous models could not capture at single-letter resolution.

// Section 04 of 06

The leukaemia demonstration — what it can do that nothing before could

The power of a unified model that simultaneously predicts multiple regulatory mechanisms becomes most visible in concrete examples. The most striking demonstration published by the Google DeepMind team alongside the AlphaGenome paper involves a specific type of blood cancer — and what it reveals about how the model works in practice is more informative than any benchmark statistic.

In T-cell acute lymphoblastic leukaemia, a specific genetic mutation activates a cancer-driving gene called TAL1. This activation was known — years of painstaking laboratory work had established the connection. What AlphaGenome demonstrated was that it could identify this mechanism from nothing but the raw DNA sequence: predicting not only that the mutation was functionally significant, but specifically how it acts — through a particular transcription factor binding site, disrupting the normal regulatory logic that keeps TAL1 activity controlled.

AlphaGenome showed the same mutation simultaneously affecting chromatin accessibility, altering transcription factor binding patterns, and changing histone modification profiles — the three-dimensional regulatory context that had required years of separate experiments to establish. Arriving at this interpretation computationally, from sequence alone, in the time it takes to submit an API request, represents a qualitative shift in what is possible in genomic research.

The model correctly identified not just that a mutation was damaging, but precisely why — through which molecular mechanisms, via which regulatory intermediaries, with what downstream consequences for gene activity. That level of mechanistic resolution from sequence alone is genuinely without precedent.

This example illustrates the practical value in clinical and research contexts. In cancer, where tumours may carry hundreds or thousands of mutations, knowing which ones are functionally significant — and why — is the difference between having a target and not. In rare disease, where a patient may have a variant in a regulatory region that no previous study has documented, a model that can predict its functional consequence provides diagnostic insight that the existing database-lookup approach cannot. In drug discovery, identifying the precise molecular mechanism by which a regulatory variant drives disease pathology is a prerequisite for designing therapeutic interventions that address cause rather than symptom.

// Model performance — against specialist benchmarks

Comparative accuracy across the major genomic prediction tasks where specialist models previously led

Task	Previous best (specialist)	AlphaGenome (unified)	Outcome
Overall benchmark ranking	Multiple specialist leaders	Matched or exceeded on 25 of 26 evaluations	Near-universal best-in-class
Gene expression QTL prediction	State-of-the-art specialist	+25.5% accuracy gain	Substantial improvement
Chromatin accessibility QTL	ChromBPNet and specialists	+8% accuracy gain	Meaningful improvement
Transcription factor binding sites	Dedicated TFBS models	Outperformed across panel	More accurate than specialist
RNA splicing prediction	SpliceAI	Matched or exceeded; new dual junction+usage prediction	New capability layer
Multi-modal simultaneous prediction	Not available — required separate tools	11 modalities from one input, one model	Novel capability class
Long-range regulatory influence (≤100K bp)	Limited by short sequence windows	Captures distant elements at single-bp resolution	New reach; >100K bp limitation remains

// Section 05 of 06

The application landscape — where regulatory genomics creates value

The ability to reliably predict the functional consequences of regulatory DNA variants opens a set of application domains that were previously either entirely inaccessible or accessible only at prohibitive cost and timescale. The breadth of these applications reflects the centrality of gene regulation to virtually all of biology and medicine.

// Cancer diagnostics & research

Tumours carry many mutations, but most diagnostic frameworks classify them primarily by the protein-coding mutations they contain. A significant fraction of cancer-driving mutations act through regulatory mechanisms — activating oncogenes, silencing tumour suppressors, or disrupting the transcriptional programmes that control cell growth. The ability to predict regulatory variant effects enables tumour stratification by regulatory disruption profile, potentially revealing therapeutic vulnerabilities invisible to protein-coding analysis alone. The leukaemia demonstration is the direct proof of concept.

// Rare disease diagnosis

Many patients with rare diseases that have clear genetic underpinnings remain undiagnosed after comprehensive protein-coding gene sequencing — because the causative variant sits in the regulatory genome where current diagnostic tools cannot reliably interpret it. A model that predicts regulatory variant function enables clinical teams to investigate non-coding variants in patients with unresolved diagnoses, potentially identifying the molecular cause of conditions that have remained clinically mysterious for years.

// Drug discovery & target validation

Genome-wide association studies have identified thousands of genetic variants statistically linked to disease risk. The pharmaceutical industry has spent billions attempting to convert these statistical associations into actionable drug targets — a process severely hampered by the inability to predict what the associated variants actually do. A model that can score the regulatory effects of thousands of GWAS variants simultaneously, identifying the precise molecular mechanisms through which they confer risk, transforms the efficiency of this target identification process.

// Gene therapy & synthetic biology

The ability to predict the regulatory consequences of any DNA sequence is, in principle, also the ability to design regulatory sequences with specific desired properties. For gene therapy, this means designing promoters and enhancers that activate therapeutic genes only in the cell types and tissues where they are needed — precisely the tissue-specific control that current gene therapy approaches struggle to achieve. For synthetic biology, it enables the rational design of genetic circuits with defined expression profiles.

An important caveat applies across all of these applications: the model is a research tool, not a clinical instrument. Its predictions are probabilistic rather than deterministic, and its outputs represent hypotheses to be tested experimentally rather than conclusions to be acted upon directly. The researchers who developed it have explicitly noted that its results cannot be straightforwardly applied to individual clinical decisions in the current form. The value lies in dramatically accelerating the earlier stages of the research and discovery pipeline — identifying which of the thousands of regulatory variants in any given dataset are most likely to be functionally significant, and in what direction.

// Section 06 of 06

The broader arc — biological AI's systematic conquest of molecular biology

The regulatory genome model does not arrive in isolation. It is the latest chapter in a systematic programme of applying large-scale AI to the foundational problems of molecular biology — a programme whose earlier chapters have already demonstrated that the translation from research breakthrough to broad scientific and commercial impact can occur within a five-to-ten-year horizon.

// The Alpha series — biological AI by problem

Sequencing of breakthroughs from Google DeepMind, with the trajectory each implies

Model	Year	Problem solved	Impact trajectory
AlphaFold	2020–2021	3D protein folding from amino acid sequence — a 50-year structural biology problem	Already embedded in drug discovery pipelines globally; 2M+ researchers using it
AlphaProteo	2022–2024	How proteins bind DNA — enabling design of new protein variants with modified DNA-binding specificity	Enabling next-generation gene editing and synthetic biology tools
AlphaGenome	Published in Nature, Jan 2026	What non-coding DNA does — predicting regulatory function across 11 modalities at single-bp resolution	Translation to clinical and commercial use: 3–7 years estimated

The trajectory from AlphaFold to widespread pharmaceutical integration took approximately four to six years — from the initial research publication through the gradual integration into drug discovery workflows that is now essentially universal in the industry. AlphaGenome is roughly at the same stage as AlphaFold was in 2021: a demonstrated research capability of extraordinary power, available to the scientific community via API access, with commercial applications not yet fully defined but clearly substantial.

What the AlphaFold timeline suggests

If AlphaGenome follows a comparable adoption trajectory to AlphaFold, commercial pharmaceutical and diagnostic applications should begin to appear in meaningful form between 2028 and 2031. Google DeepMind has noted it is still determining the commercial availability framework — suggesting an active internal evaluation of how to structure access for clinical and industrial applications beyond the current research API.

Where the analogy has limits

The regulatory genome problem is in some respects more complex than protein folding. AlphaGenome's outputs are probabilistic predictions of biological activity rather than deterministic physical structures, and their translation into drug targets requires additional experimental validation steps that protein structures do not. The commercial trajectory may therefore be longer and more mediated by clinical validation requirements — but the eventual impact on genomic medicine, rare disease diagnosis, and cancer stratification could be equally transformative.

// Investment & strategic implications — regulatory genomics AI

Where value accrues across the ecosystem as regulatory genome interpretation capability develops

Domain	Near-term (2026–2028)	Medium-term (2028–2032)	Key consideration
Pharmaceutical drug discovery	GWAS-to-target conversion accelerated; regulatory variant prioritisation integrated into early discovery workflows	Mechanism-guided drug design; multi-indication expansion from single target discovery	Requires experimental validation; regulatory submissions will need evidence standards
Genomic diagnostics companies	Non-coding variant interpretation added to existing sequencing workflows; diagnostic yield improvements for unresolved rare disease	Standard incorporation into clinical genomics reporting; reimbursement pathway development	Clinical validity standards must be established; regulatory approval required
Gene therapy developers	Improved promoter/enhancer design for tissue specificity; reduced off-target expression risk	Regulatory sequence design as standard part of gene therapy development pipeline	IND-enabling studies still required; safety demonstration unchanged
AI compute infrastructure	Training and inference demand from research users; TPU/GPU demand from genomics labs	Commercial deployment of regulatory genome APIs at scale; inference infrastructure grows with clinical adoption	Compute concentrated at training; inference relatively light per query
Genomics data providers	Value of large, well-annotated genomic datasets increases as model training requires high-quality regulatory data	Biobank data partnerships; longitudinal regulatory genomics datasets become commercially valuable	Data consent and governance frameworks remain a constraint
Synthetic biology platforms	Improved regulatory sequence design reduces iteration cycles in cell-line and organism engineering	Rational design of full gene regulatory circuits; acceleration across agri, industrial, pharma biotech	Biosafety regulation and public acceptance remain constraints

Life had an operating system all along. Now we can read it.

The sequencing of the human genome was completed in 2003 after more than a decade of international effort and three billion dollars of investment. Scientists at the time expected that having the complete DNA sequence would rapidly transform medicine, enabling a genetic understanding of disease that would accelerate drug discovery and personalise treatment. That promise proved slow to materialise — not because the genome was unimportant, but because the sequence alone was insufficient. The regulatory logic that governs how the genome is read, in which cell types and at what times, remained opaque.

The new generation of AI models capable of interpreting regulatory DNA at scale and single-base-pair resolution represents the fulfilment, at least in research form, of the original promise of genomics. The ability to predict what a regulatory variant does — not just that it sits in a disease-associated region, but how it alters chromatin structure, transcription factor binding, RNA splicing, and gene expression, simultaneously and at single-letter precision across a million base pairs of DNA — is a capability that changes what questions can be asked, and how quickly.

The limitations are real and should not be minimised. Predictions require experimental validation. The model currently struggles with regulatory elements more than 100,000 base pairs from their target gene. Clinical translation will require regulatory frameworks that do not yet exist. And the transition from research tool to clinical and commercial application will take years of careful validation work. These are the expected constraints of an early-stage scientific tool at the beginning of its deployment cycle — not fundamental barriers to its eventual impact.

The protein structure prediction model was a research curiosity when first published. Within five years it had become an industry-standard component of the pharmaceutical drug discovery pipeline, used by researchers in virtually every major biotech and pharmaceutical company in the world. Regulatory genome interpretation is at the equivalent stage in its development arc. The questions for investors, pharmaceutical strategists, and clinical innovators are not whether the capability will translate to commercial and clinical utility — but how fast, through which applications, and with what regulatory and validation requirements along the way.

// The closing thought

For twenty-five years, the genome was a complete instruction manual that nobody could fully read. The 2% that codes for proteins was legible. The 98% that controls them was not. The interpreter has now arrived — and the biological sciences will not look the same in a decade as they do today.

Lualdi Advisors is a quantitative research firm. We build predictive models, AI systems, and operational ontologies. We publish working notes on the topics that intersect with the firm's practice — biological AI, drug discovery economics, decision engineering. Open a conversation if you want the firm's view on AlphaGenome's commercial trajectory, the regulatory-genomics value chain, or comparative analysis of the biological AI landscape.

Source notes. Google DeepMind, "AlphaGenome: unified prediction of regulatory variant effects across the human genome," published in Nature (January 2026; doi: 10.1038/s41586-025-10014-0). Accompanying commentary in Science (AAAS), Scientific American, IEEE Spectrum, ScienceBlog, and Euronews. Predecessor model context drawn from published technical reports for AlphaFold and AlphaProteo. Performance figures reflect published benchmarks. Investment and strategic implications are Lualdi Advisors' analytical framework, illustrative rather than prescriptive. Actual outcomes may differ materially. This material does not constitute medical, clinical, scientific, investment, legal, or financial advice.

Reading between the genes: when AI cracks the 98%

The shift in four numbers

The control code, finally legible

The code that controls the code

The fragmentation era — why dozens of specialist tools could not solve the problem

The architecture of the breakthrough — both trade-offs solved at once

The leukaemia demonstration — what it can do that nothing before could

The application landscape — where regulatory genomics creates value

The broader arc — biological AI's systematic conquest of molecular biology

Life had an operating system all along. Now we can read it.

From fluency to understanding: how AI is learning the logic of the world

Application first: Europe's emerging lead in AI's most valuable layer

Need the firm's view on AlphaGenome's commercial trajectory or the regulatory-genomics value chain?
Open a conversation.

The shift in four numbers

The control code, finally legible

The code that controls the code

The fragmentation era — why dozens of specialist tools could not solve the problem

The architecture of the breakthrough — both trade-offs solved at once

The leukaemia demonstration — what it can do that nothing before could

The application landscape — where regulatory genomics creates value

The broader arc — biological AI's systematic conquest of molecular biology

Life had an operating system all along. Now we can read it.

If this is your beat.

From fluency to understanding: how AI is learning the logic of the world

Application first: Europe's emerging lead in AI's most valuable layer

Need the firm's view on AlphaGenome's commercial trajectory or the regulatory-genomics value chain? Open a conversation.

Need the firm's view on AlphaGenome's commercial trajectory or the regulatory-genomics value chain?
Open a conversation.