HY-SOAR (Self-Correction for Optimal Alignment and Refinement) is a reward-free post-training method for rectified-flow diffusion models. It targets exposure bias in the denoising trajectory: standard SFT trains the denoiser on ideal forward-noising states from real data, while inference conditions on states produced by the model's own earlier predictions. Once an early denoising step drifts, later steps must recover from states that were not directly optimized, so errors can compound across the trajectory.
Instead of waiting for a terminal reward after a full rollout, SOAR teaches the model to correct its own trajectory errors at the timestep where they occur. Given a clean latent, a noise endpoint, and a condition, SOAR: (1) samples an on-trajectory noisy state and performs one stop-gradient CFG rollout step with the current model; (2) re-noises the resulting off-trajectory state toward the same noise endpoint to create auxiliary states; (3) supervises the denoiser with the analytical correction target. This gives SOAR an on-policy, dense, and reward-free training signal.
SOAR provides a principled, reward-free approach to trajectory-level correction
Directly addresses the mismatch between ground-truth training states and model-induced inference states — the root cause of compounding denoising failures.
Off-trajectory states are produced by the current model's own rollout, so the training distribution co-evolves with the model instead of staying fixed.
Requires no reward model, preference labels, or negative samples. Provides per-timestep correction supervision and avoids terminal-reward credit assignment.
Re-noising uses the same noise endpoint as the base flow-matching pair, keeping auxiliary states near the original transport ray.
The SOAR loss extends the standard flow-matching objective and can replace SFT as a stronger first stage, remaining compatible with later RL alignment.
Visual comparisons of SOAR vs Flow-GRPO vs SFT across different reward objectives
Comparison across training steps, optimizing for aesthetic quality on diverse prompts — historical scenes, fantasy art, and character portraits.
Comparison on design and poster generation prompts, optimizing for text-image alignment. SOAR demonstrates stronger text rendering and compositional fidelity.
SOAR results on web UI and graphic design generation, showing accurate layout, typography, and visual hierarchy.
Main results on DrawBench and GenEval/OCR test sets
Following Flow-GRPO, we evaluate image quality and human preference scores on DrawBench prompts, and task-specific metrics on the GenEval/OCR test sets. All models are trained at 512×512 with cfg=4.5.
| Model | #Iter | GenEval | OCR | PickScore | ClipScore | HPSv2.1 | Aesthetic | ImgRwd |
|---|---|---|---|---|---|---|---|---|
| SD-XL (1024²) | – | 0.55 | 0.14 | 22.42 | 0.287 | 0.280 | 5.60 | 0.76 |
| SD3.5-L (1024²) | – | 0.71 | 0.68 | 22.91 | 0.289 | 0.288 | 5.50 | 0.96 |
| FLUX.1-Dev | – | 0.66 | 0.59 | 22.84 | 0.295 | 0.274 | 5.71 | 0.96 |
| SD3.5-M | – | 0.63 | 0.59 | 22.34 | 0.285 | 0.279 | 5.36 | 0.85 |
| + SFT | 10k | 0.70 | 0.64 | 22.71 | 0.295 | 0.284 | 5.35 | 1.04 |
| + SOAR (Ours) | 10k | 0.78 | 0.67 | 22.86 | 0.295 | 0.289 | 5.46 | 1.09 |
SOAR raises SD3.5-Medium's GenEval score from 0.70 to 0.78 (+11% relative) and OCR accuracy from 0.64 to 0.67, while simultaneously improving every DrawBench quality and preference metric — all without any reward model during training.
In head-to-head comparisons on DrawBench Aesthetic Score and ClipScore, SOAR's final scores not only surpass SFT but also outperform Flow-GRPO, which explicitly uses these metrics as its reward signal (Aesthetic: 5.94 vs. SFT 5.74 / Flow-GRPO 5.87; ClipScore: 0.300 vs. SFT 0.297 / Flow-GRPO 0.296).
If you find SOAR useful in your research, please cite our work