HY-SOAR: Self-Correction for Optimal Alignment
and Refinement in Diffusion Models

Tencent HY Team
HY-SOAR method overview

We propose HY-SOAR: a scalable, reward-free post-training framework for trajectory-level self-correction in rectified-flow diffusion models, which targets exposure bias in the denoising trajectory by sampling on-trajectory noisy states, performing one stop-gradient CFG rollout with the current model, re-noising the resulting off-trajectory states toward the same noise endpoint, and supervising the denoiser with analytical correction targets to provide an on-policy, dense, and reward-free training signal.

Key Features

SOAR provides a principled, reward-free approach to trajectory-level correction

🧭

Exposure-Bias Correction

Directly addresses the mismatch between ground-truth training states and model-induced inference states — the root cause of compounding denoising failures.

🔁

On-Policy Off-Trajectory Supervision

Off-trajectory states are produced by the current model's own rollout, so the training distribution co-evolves with the model instead of staying fixed.

🎯

Reward-Free Dense Objective

Requires no reward model, preference labels, or negative samples. Provides per-timestep correction supervision and avoids terminal-reward credit assignment.

📐

Geometric Correction Target

Re-noising uses the same noise endpoint as the base flow-matching pair, keeping auxiliary states near the original transport ray.

🔧

Compatible Post-Training Stage

The SOAR loss extends the standard flow-matching objective and can replace SFT as a stronger first stage, remaining compatible with later RL alignment.

🖼Showcases

Visual comparisons of SOAR vs Flow-GRPO vs SFT across different reward objectives

Showcase 1: Aesthetic Reward Optimization

Comparison across training steps, optimizing for aesthetic quality on diverse prompts — historical scenes, fantasy art, and character portraits.

Aesthetic Showcase

Showcase 2: CLIPScore Reward Optimization

Comparison on design and poster generation prompts, optimizing for text-image alignment. SOAR demonstrates stronger text rendering and compositional fidelity.

CLIPScore Showcase

Showcase 3: WebUI / Design Generation

SOAR results on web UI and graphic design generation, showing accurate layout, typography, and visual hierarchy.

WebUI Showcase

📊Evaluation

Main results on DrawBench and GenEval/OCR test sets

Following Flow-GRPO, we evaluate image quality and human preference scores on DrawBench prompts, and task-specific metrics on the GenEval/OCR test sets. All models are trained at 512×512 with cfg=4.5.

Model #Iter GenEval OCR PickScore ClipScore HPSv2.1 Aesthetic ImgRwd
SD-XL (1024²) 0.550.1422.420.2870.2805.600.76
SD3.5-L (1024²) 0.710.6822.910.2890.2885.500.96
FLUX.1-Dev 0.660.5922.840.2950.2745.710.96
SD3.5-M 0.630.5922.340.2850.2795.360.85
+ SFT10k 0.700.6422.710.2950.2845.351.04
+ SOAR (Ours)10k 0.780.6722.86 0.2950.2895.461.09

SOAR raises SD3.5-Medium's GenEval score from 0.70 to 0.78 (+11% relative) and OCR accuracy from 0.64 to 0.67, while simultaneously improving every DrawBench quality and preference metric — all without any reward model during training.

Reward-Specific Training Dynamics

In head-to-head comparisons on DrawBench Aesthetic Score and ClipScore, SOAR's final scores not only surpass SFT but also outperform Flow-GRPO, which explicitly uses these metrics as its reward signal (Aesthetic: 5.94 vs. SFT 5.74 / Flow-GRPO 5.87; ClipScore: 0.300 vs. SFT 0.297 / Flow-GRPO 0.296).

Reward Curves

📚Citation

If you find SOAR useful in your research, please cite our work

@article{hy-soar, title={SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models}, author={Qin, You and Wang, Linqing and Fei, Hao and Zimmermann, Roger and Bo, Liefeng and Lu, Qinglin and Wang, Chunyu}, journal={arXiv preprint arXiv:2604.12617}, year={2026}, eprint={2604.12617}, archivePrefix={arXiv}, primaryClass={cs.LG} }