Why Gene Signatures Fail in Another Cohort

Written by:

You have probably seen some version of this before.

A gene signature looks convincing in one cohort and then starts to lose its power in another. The prognostic signal becomes weaker. The diagnostic classification becomes less reliable. Patient groups that separated nicely before no longer separate so clearly.

And after some time, you begin to ask yourself: was the original result really so solid as it looked?

I have worked in bioinformatics and gene regulation research for nearly four decades, and during that time I have watched the field change many times.

We moved from early sequence-analysis methods to deep learning. From candidate-gene studies to whole-genome analysis. From pathway diagrams drawn by hand to models integrating several omics layers simultaneously.

The tools have changed enormously.

But one problem has never really gone away.

You build a gene signature and, at first, it looks solid.

The statistics are clean. The signal is strong. You run the analysis, and you do the cross-validation,  check the assumptions and everything appears robust.

Then you apply the gene signature to another case. To another biological system.

A different cell type. A slightly different inflammatory state. Another patient cohort. A disease sample instead of a healthy one.

And suddenly the signature begins to drift.

Not always dramatically. Sometimes it fails cleanly. More often, it partially holds, partially changes, and leaves you with an uncomfortable feeling that something is going on, but you do not yet see what exactly.

So you do what all of us do.

You go through the usual checklist. Batch effects. Normalisation. Data quality. Sample composition. Pipeline settings.

Everything looks reasonable.

And this is usually the moment when another thought begins to surface.

What if the data is actually fine?

What if the signature was genuinely real in the first biological context, but biology solved the same problem differently in the second one?

In my experience, this point is very easy to underestimate.

In machine learning, everybody understands the importance of independent test sets, validation data, cross-validation, and overfitting. The problem is not that researchers have forgotten the basic rules of model evaluation.

The deeper issue is different.

Biological systems are not fixed engineering systems.

The same biological output can be produced through different regulatory mechanisms depending on the cell type, tissue, stimulus, disease state, or treatment condition.

And here even a very powerful model such as AlphaGenome becomes an interesting example.

Imagine using AlphaGenome to analyse a non-coding regulatory region near an inflammatory gene. The model predicts changes in transcription factor binding, chromatin accessibility, or gene expression in a particular biological context.

Such a prediction may be very useful.

But now imagine taking the same regulatory region and placing it into another biological condition.

In one immune cell type, regulation may be driven mainly by NF-κB. In another, the same local sequence may participate in NFAT-dependent regulation. In a myeloid context, C/EBP may become more important. Under glucocorticoid treatment, repression may involve interference with AP-1 or NF-κB activity.

The DNA sequence may remain identical.

The predicted feature may even look similar.

But the active regulatory grammar has changed.

So what does this regulatory element actually mean?

The answer is not universal.

It depends on which regulatory programme is active in this particular biological context.

This is why replication in biology is not only a technical question.

It is also a mechanistic question.

A gene signature can replicate statistically and still mean something slightly different. And sometimes it fails to replicate, not because the original result was wrong, but because the second biological system came to a superficially similar state through another route.

Two cell types may activate inflammation through overlapping genes but through different upstream regulators. Two tumours may express the same marker while depending on different master regulators. Two patients may show a similar pathway signature but arrive there through different combinations of mutations, epigenetic states, and microenvironmental signals.

In my experience, this happens much more often than we openly admit.

You present a result and the familiar question appears almost immediately:

Does it hold in another dataset?

It is a fair question.

But I think the more important question is one level deeper.

Not only whether the statistics transfer.

But whether the mechanism transfers.

The model learned a pattern. But did it learn what is behind this pattern?

Most AI models in biology are built on correlations.

This is not criticism. This is what they are designed to do, and many of them do it remarkably well.

Give them enough RNA-seq data, enough omics layers, enough samples, enough genomic tracks, and they will detect signals.

Often a real signal. Often very useful signals.

But correlation is still not a mechanism.

A pattern that holds in one cell type may stop holding in another because the underlying regulatory logic has changed.

Different transcription factors. Different enhancer usage. Different chromatin states. Different signalling activity. Different network rewiring. Different disease constraints.

The model recognised the pattern.

But it may not have learned why this pattern exists.

And when the biological context changes, the pattern may begin to drift. Unless a model is grounded in mechanism, it has limited possibility to know whether the same signature should still mean the same thing.

This is not only a noise problem.

It is not only a sample-size problem.

It is not only a preprocessing problem.

It is, to a large extent, a structural property of biology.

Living systems are degenerate. Similar outputs can emerge through different mechanisms.

They are adaptive. Under disease, treatment, or stress, they rewire.

They are context-dependent. The same molecule can play very different roles depending on cellular state.

And they are combinatorial. Transcription factors, enhancers, and signalling pathways rarely act alone.

This difference matters enormously in drug discovery.

A signature that does not transfer across biological contexts is not simply an academic inconvenience.

It can become a candidate-selection decision made on unstable ground.

A biomarker that appears valid until the underlying mechanism changes.

And then months of work may follow a signal that was correct in one context, but misleading in the next.

So perhaps the question worth asking is not only:

Did this replicate?

Perhaps the better question is:

Do we understand why it should replicate — and under which biological conditions it should not?

Answering this question requires a different kind of knowledge than expression patterns alone can provide.

If AI is grounded in causal biological structure — transcription factor binding, promoter and enhancer logic, signalling pathways, master regulators, and disease mechanisms — it gains access not only to patterns, but also to possible explanations.

This is important because explanations can travel across biological contexts better than correlations.

They help us to go from a gene signature to the possible causal mechanism behind this signature.

And this is exactly what we need in biology: not only a list of genes, but a mechanistic explanation.

If this failure mode sounds familiar – if you have seen a result hold in one context and quietly shift meaning or fall apart in another – I would genuinely love to hear your thoughts in the comments.

Have you seen this happen in your own work? Or do you think we are looking at the problem the wrong way?

We have been exploring questions at the intersection of biology and AI across recent posts on this page – from mechanistic interpretation to why models fail across contexts. 

If this topic interests you, you may find the earlier discussions useful as well.