HY-SOAR: Self-Correction for Optimal Alignment
and Refinement in Diffusion Models

Tencent HY Team

Code is now available on GitHub.

📖Abstract

HY-SOAR (Self-Correction for Optimal Alignment and Refinement) is a reward-free post-training method for rectified-flow diffusion models. It targets exposure bias in the denoising trajectory: standard SFT trains the denoiser on ideal forward-noising states from real data, while inference conditions on states produced by the model's own earlier predictions. Once an early denoising step drifts, later steps must recover from states that were not directly optimized, so errors can compound across the trajectory.


Instead of waiting for a terminal reward after a full rollout, SOAR teaches the model to correct its own trajectory errors at the timestep where they occur. Given a clean latent, a noise endpoint, and a condition, SOAR: (1) samples an on-trajectory noisy state and performs one stop-gradient CFG rollout step with the current model; (2) re-noises the resulting off-trajectory state toward the same noise endpoint to create auxiliary states; (3) supervises the denoiser with the analytical correction target. This gives SOAR an on-policy, dense, and reward-free training signal.

Key Features

SOAR provides a principled, reward-free approach to trajectory-level correction

🧭

Exposure-Bias Correction

Directly addresses the mismatch between ground-truth training states and model-induced inference states — the root cause of compounding denoising failures.

🔁

On-Policy Off-Trajectory Supervision

Off-trajectory states are produced by the current model's own rollout, so the training distribution co-evolves with the model instead of staying fixed.

🎯

Reward-Free Dense Objective

Requires no reward model, preference labels, or negative samples. Provides per-timestep correction supervision and avoids terminal-reward credit assignment.

📐

Geometric Correction Target

Re-noising uses the same noise endpoint as the base flow-matching pair, keeping auxiliary states near the original transport ray.

🔧

Compatible Post-Training Stage

The SOAR loss extends the standard flow-matching objective and can replace SFT as a stronger first stage, remaining compatible with later RL alignment.

🖼Showcases

Visual comparisons of SOAR vs Flow-GRPO vs SFT across different reward objectives

Showcase 1: Aesthetic Reward Optimization

Comparison across training steps, optimizing for aesthetic quality on diverse prompts — historical scenes, fantasy art, and character portraits.

Aesthetic Showcase

Showcase 2: CLIPScore Reward Optimization

Comparison on design and poster generation prompts, optimizing for text-image alignment. SOAR demonstrates stronger text rendering and compositional fidelity.

CLIPScore Showcase

Showcase 3: WebUI / Design Generation

SOAR results on web UI and graphic design generation, showing accurate layout, typography, and visual hierarchy.

WebUI Showcase

📊Evaluation

Main results on DrawBench and GenEval/OCR test sets

Following Flow-GRPO, we evaluate image quality and human preference scores on DrawBench prompts, and task-specific metrics on the GenEval/OCR test sets. All models are trained at 512×512 with cfg=4.5.

Model #Iter GenEval OCR PickScore ClipScore HPSv2.1 Aesthetic ImgRwd
SD-XL (1024²) 0.550.1422.420.2870.2805.600.76
SD3.5-L (1024²) 0.710.6822.910.2890.2885.500.96
FLUX.1-Dev 0.660.5922.840.2950.2745.710.96
SD3.5-M 0.630.5922.340.2850.2795.360.85
+ SFT10k 0.700.6422.710.2950.2845.351.04
+ SOAR (Ours)10k 0.780.6722.86 0.2950.2895.461.09

SOAR raises SD3.5-Medium's GenEval score from 0.70 to 0.78 (+11% relative) and OCR accuracy from 0.64 to 0.67, while simultaneously improving every DrawBench quality and preference metric — all without any reward model during training.

Reward-Specific Training Dynamics

In head-to-head comparisons on DrawBench Aesthetic Score and ClipScore, SOAR's final scores not only surpass SFT but also outperform Flow-GRPO, which explicitly uses these metrics as its reward signal (Aesthetic: 5.94 vs. SFT 5.74 / Flow-GRPO 5.87; ClipScore: 0.300 vs. SFT 0.297 / Flow-GRPO 0.296).

Reward Curves

📚Citation

If you find SOAR useful in your research, please cite our work

@article{hy-soar, title={SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models}, author={Qin, You and Wang, Linqing and Fei, Hao and Zimmermann, Roger and Bo, Liefeng and Lu, Qinglin and Wang, Chunyu}, journal={arXiv preprint arXiv:2604.12617}, year={2026}, eprint={2604.12617}, archivePrefix={arXiv}, primaryClass={cs.LG} }