Reinforce Adjoint Matching | Andreas Bergmeister

Andreas Bergmeister¹ Stefanie Jegelka^1,2 Nikolas Nüsken³ Carles Domingo-Enrich^4,† Jakiw Pidstrigach^5,†

¹TU Munich, MCML ²MIT CSAIL ³King's College London ⁴Microsoft Research New England ⁵University of Oxford

^†Joint last authors

Paper Code

Stable Diffusion 3.5M base vs RAM: a red zebra. — Stable Diffusion 3.5M base (left) vs the same model post-trained with RAM (right). Prompt: *a photo of a red zebra*.

0 reward gradients

0 SDE rollouts

0 reward hacking

50× fewer training steps

Two ways to learn

There are two ways to acquire a skill. The first is learning from examples: collect samples of others doing the thing and train a model to match the distribution of those samples. The second is learning from outcomes: try something, score the result, and adjust based on the score.

The two regimes have very different ingredients. Learning from examples needs a dataset of good examples. Learning from outcomes needs only a way to score candidate outputs. But it has a catch. If the learner’s attempts are random to begin with, nearly every output is bad and scores carry no useful signal. Learning from outcomes is hopeless without a strong prior that focuses exploration on plausible candidates.

This is how humans learn most skills. We start by copying others, then refine through practice and feedback. Pure imitation never gets you past your teachers; pure trial-and-error never gets off the ground. The same recipe powers today’s reasoning LLMs. They are pretrained on the web, then post-trained with RL against verifier and preference rewards.

Continuous generative modeling

The same recipe should apply to continuous generative modeling. The scope is even broader than the discrete case. Beyond images, video, and 3D, continuous generative models now drive protein structure prediction and design, trajectory generation in robotics, and world models for simulation and planning.

Diffusion and flow-matching models are a fundamental reduction of continuous generative modeling to the simplest supervised learning problem, regression. This mirrors how autoregressive models reduce discrete sequence modeling to classification.

Pretraining a diffusion or flow-matching model is extremely well understood and scales effortlessly: a clean sample is corrupted by Gaussian noise, and a model regresses against a closed-form target. That is the entire training procedure. Post-training, however, has lacked a comparably clean algorithm. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining’s regression structure.

We show that the regression structure of pretraining extends to RL post-training.

Setting up post-training

We have a pretrained diffusion or flow-matching model. In the simplest and now standard case, it is parameterized by a velocity field $v^{\mathrm{ref}}$. Starting from Gaussian noise, we integrate $v^{\mathrm{ref}}$ to produce samples $X_0 \sim p^{\mathrm{ref}}$. We also have a scalar reward $r$ that scores samples.

The canonical post-training objective is to maximize expected reward while staying close to the pretrained model in a KL sense,

\[\mathbb E_{x \sim p}[r(x)] \;-\; D_{\mathrm{KL}}(p \,\|\, p^{\mathrm{ref}}).\]

The optimum is proportional to $p^{\mathrm{ref}}(x)\exp(r(x))$ and tilts the pretrained distribution toward high-reward samples. We have the pretrained model and the reward, but no samples from this tilted distribution.

Stochastic optimal control casts post-training as a problem on the velocity field: find $v^\ast$ that generates the tilted distribution.

The RAM regression target

From the stochastic optimal control problem we derive a fixed-point equation for the optimal velocity field $v^\ast$. The equation expresses $v^\ast$ at $(t, X_t)$ as a conditional expectation of a target built from an on-policy endpoint $X_0$, its reward $r(X_0)$, and the noise $\epsilon$ that connects $X_0$ to $X_t$. The optimum has a structural property that makes this expectation cheap to estimate: the conditional law $X_t \mid X_0 \sim \mathcal N\bigl((1-t)\,X_0,\, t^2 I\bigr)$ is the same as in pretraining; only the marginal of $X_0$ changes.

So we sample $X_0$ from the current model with any off-the-shelf sampler, sample $\epsilon \sim \mathcal N(0, I)$ and $t \in (0, 1)$ independently, and form $X_t = (1-t)\,X_0 + t\,\epsilon$. RAM minimizes the consistency loss

\[\mathcal L_{\mathrm{RAM}}(\theta) \;=\; \mathbb E\left[\,\Bigl\|v^\theta_t(X_t) \;-\; \mathrm{sg}\left(v^{\mathrm{ref}}_t(X_t) \;+\; r(X_0)\,\bigl((\epsilon - X_0) - v^\theta_t(X_t)\bigr)\right)\Bigr\|^2\,\right],\]

where $\mathrm{sg}(\cdot)$ is stop-gradient. The residual $(\epsilon - X_0) - v^\theta_t(X_t)$ is the difference between the pretraining target and the model’s current prediction at $X_t$. High reward pulls the model toward the pretraining target; zero reward leaves it at the reference; low reward pushes it away. The reward enters only as a scalar, so $\nabla r$ is never used. And because $v^{\mathrm{ref}}$ stays in the target throughout training, the model is anchored to the pretrained reference, which is what prevents reward hacking.

Each on-policy endpoint is reused across many $(\epsilon, t)$ draws, amortizing the cost of ODE sampling and reward evaluation. The resulting training states are conditionally independent given the endpoint. That gives more gradient signal per step than SDE-based methods, which produce correlated states from a single rollout.

A training step in code

x0 = ode_sample(model)                     # on-policy endpoint
r  = reward_fn(x0)                         # scalar (no grad)

for _ in range(K):                         # in parallel
    t   = sample_t()                       # random t in (0, 1)
    eps = randn_like(x0)                   # gaussian noise
    xt  = (1 - t) * x0 + t * eps           # analytic noising
    v_hat = model_ref(xt, t) + r * ((eps - x0) - model(xt, t))
    loss  = (model(xt, t) - v_hat.detach()) ** 2
    loss.mean().backward()

Post-training Stable Diffusion 3.5M

We post-train Stable Diffusion 3.5M on three tasks: compositional generation (GenEval), visual text rendering (OCR), and human-preference alignment (PickScore). RAM achieves the highest reward on each task, without reward hacking or visible quality degradation. It matches Flow-GRPO’s peak reward in 50×, 48×, and 34× fewer training steps, at lower per-step compute cost.

Compositional Image Generation

Generate images that satisfy complex prompts: the right objects, with the right attributes, in the right counts and spatial relations. Scored by the GenEval verification pipeline.

SD3.5M

RAM

a photo of three fire hydrants

SD3.5M

RAM

a photo of a truck left of a refrigerator

Visual Text Rendering

Render text into the scene that is legible and matches the prompt. The reward is how accurately an OCR model reads the requested string from the generated image.

SD3.5M

RAM

zoo sign reading "Lion Habitat Zone"

SD3.5M

RAM

medieval shield reading "Defender Of The Realm"

Human Preference Alignment

Produce images that humans prefer for a given prompt. Scored by PickScore, a preference model trained on a large dataset of human pairwise comparisons.

SD3.5M

RAM

a Ferrari car made out of wood

SD3.5M

RAM

cyberpunk Casablanca, futuristic street

The simple regression that scales pretraining now scales post-training too.