Taking the Mask Off: What Protein Binder RL Is Actually Learning
Part 1: An investigation into the collapse
This post builds on ideas from Nick Boyd’s essay on RL and PDB-derived reward signals, particularly his analysis of structural diversity and reward bias.
The models have been getting better at generating first-pass binders. Brian Naughton’s guide walks you through designing a VHH from scratch: pick a target, run your generative model of choice, filter by your in-silico structure metric of choice, and send it away to the lab for in vitro testing.
As Nick Boyd(founder of Escalante) mentions, there are currently two main approaches to computational protein binder design: optimization (exemplified by BindCraft) and generative models (like BoltzGen). At the risk of repeating something already mentioned, BoltzGen is faster, but the per-binder design quality is much lower than that of something like BindCraft, which has the opposite properties. So the net computational cost is roughly the same. In his essay, Boyd shows how you can improve upon BoltzGen by borrowing the LLM posttraining playbook — finetune on high-quality hallucinated binders from optimization-based methods, apply GRPO, and watch your binder quality improve.
It would seem as if the binder generation problem has been solved, but that’s far from the case.
Leash Bio showed that when we analyze what ChEMBL-trained models actually learn, it’s less about molecular binding and more about which medicinal chemists tend to make which kinds of molecules. Their model Hermes — a lightweight 50M parameter sequence-only transformer trained entirely on combinatorially synthesized molecules with no human design intent, outperforms Boltz2 on out-of-distribution chemistry despite being several orders of magnitude smaller.
There are uncomfortable parallels to be observed. ChEMBL is biased by what medicinal chemists chose to synthesize: their preferences, their intuitions, their career-long optimization toward molecules that look drug-like to a human expert. PDB has the same problem. It’s biased by what structural biologists chose to crystallize: proteins that are stable enough to survive purification, well-behaved enough to form crystals, interesting enough to justify the grant money. Both are human-curated snapshots of a tiny corner of an astronomically vast space.
As Nick Boyd emphasizes in his essay, BoltzGen, Protenix, and ProteinMPNN are all downstream of PDB— structures of what structural biologists chose to crystallize. When you apply GRPO with a reward signal that's just PDB all the way down, can you confidently say you’re truly learning what makes a good binder in a general sense? Or are we optimizing for something a structural biologist would have crystallized?
In this article, we explore whether applying reinforcement learning causes a collapse in generated binders. We generated 200 structures per condition across three model checkpoints — base BoltzGen, an SFT checkpoint finetuned on high-quality Mosaic hallucinated binders, and an RL checkpoint trained on top of that — and evaluated structural diversity using Foldseek across five targets spanning in-distribution (ACE2, CCL2, PDL1), near-OOD (EGFR), and far-OOD (KRAS) regimes.
Methods
Trimming the targets
For PDL1, CCL2, and KRAS, full protein sequences were used. For EGFR and ACE2, truncations were necessary to fit within GPU memory constraints on an RTX 5090. For EGFR, residues 190–505 were retained, preserving domains II and III — the core dimerization arm and primary ligand-binding domain. For ACE2, the full M2 peptidase domain was kept, covering the entire SARS-CoV-2 binding interface. Both truncated structures were refolded with AlphaFold3 to confirm iPTM did not drop meaningfully.
Evaluating Structural Diversity amongst the Binders
Once we had binders for all targets, we used Foldseek to perform three steps: database construction, intra-set structural clustering at TM-score threshold 0.5, and easy-search against PDB100. We report Shannon entropy of the cluster distribution as our primary diversity metric, cluster diversity (unique clusters / N) as a secondary metric, and mean TM-score to nearest PDB hit as our PDB sociology signal.
Measuring Collapse: Two Lenses on the Same Problem
To measure structural diversity within each generated set, we compute the Shannon entropy of the Foldseek cluster distribution:
where p_i is the fraction of structures belonging to cluster i. Higher entropy means the generated set spans many distinct structural solutions. Lower entropy means the set has collapsed toward a small number of dominant folds --- in the extreme case of PDL1 baseline, a single cluster, giving H = 0.
To measure how PDB-like the generated structures are, we run each binder against PDB100 using Foldseek’s structural search and record the TM-score of the nearest hit. TM-score ranges from 0 to 1, where scores above 0.5 indicate the same overall fold and scores above 0.8 indicate near-identical structures.
Results
PDL1 - Clones all the way down
On PDL1, base BoltzGen produced 200 structurally distinct binders that Foldseek collapsed into a single cluster with a Shannon entropy of 0.0. We also see the Mean TM-score to nearest PDB hit: 0.946. The model had already memorized the answer before any RL ever happened.
This could be due to its possible overrepresentation: it’s one of the most studied drug targets in structural biology, with drugs like atezolizumab and durvalumab, all having crystal structures in PDB. This reinforces our hypothesis that the training data is so saturated with examples that the model can’t generate anything else.
For less PDB-saturated targets, RL induces measurable collapse.
On KRAS, an oncology target with far fewer known binders in PDB, we see base BoltzGen generates a wide range of binders: 29 clusters from 200 structures. After RL finetuning, this collapses to 10 clusters and a 68% reduction in structural entropy. We see the pattern in CCL2 as well, where the model generates fewer diverse structures.
On EGFR, finetuning increases structural diversity relative to base BoltzGen (entropy 1.06 → 2.14), before RL collapses it back down. The RL PDB TM-score actually decreases on EGFR, same as PDL1. Finetuning breaks that specific memorized mode and recovers diversity, before RL imposes a new, different form of collapse. Whether this recovered diversity is meaningful or simply a different memorized mode is an open question we can't resolve without experimental validation.
Conclusion
For targets heavily represented in PDB, the base model is already collapsed, and RL can’t push further, and may slightly, if at all, break the memorized mode. It is possible that RL’ing for longer horizons may change this. For targets less represented (KRAS, CCL2), the base model does seem to have diversity that RL destroys. This is counterintuitive to what RL should be doing. The reward signal, Protenix iPTM, itself PDB-trained, reinforces structures that resemble PDB proteins, and the severity of collapse tracks how well the target is represented in that training data.
This may have a grim therapeutic implication for hard-to-drug targets. KRAS is one of the most clinically important undruggable oncology targets because its shallow binding pocket requires non-helical contacts — beta-strand mimetics, cyclic peptides, and designed loops that can engage the flat RAS surface.1 An RL-finetuned model that has collapsed toward helical bundle solutions is less likely to design binders specific to KRAS and rather generate its favorite motif, hoping it sticks.
As for whether to use RL, it might be worth doing depending on how similar your target is to the data your model has been trained on.
What comes next
It seems evident, then, that the next task to focus on is harnessing signals that are qualitatively different from adding another PDB-derived term to your loss function, or looking at entropy regularizers. As Nick Boyd discusses, one alternative is to explore physics-based reward signals (e.g., Rosetta energy). Whether integrating that distinction into the RL loop resolves the mode collapse is what I want to explore next.
Thanks to Brian Naughton and Dr Aaron Ring for feedback on an early version of this post.




