Training Diffusion Transformers with Muon

Muon ¹ allows us to train LLMs faster, requiring up to 50% less compute to reach a given loss². By now, Muon has gone beyond a research experiment and established itself in large-scale production training runs, such as DeepSeek-V4³.

For diffusion transformers (DiT), the picture is less clear. Although several papers this year have started using Muon instead of Adam⁴,⁵, systematic comparisons between the two optimizers are lacking. This post fills that gap with experiments on ImageNet 256$\times$256: I first tune the Adam baseline learning rate, then evaluate two strategies for tuning Muon, and finally scale the best configurations for each optimizer to a 465M parameter model.

To keep the experiments self-contained without relying on external pretrained models, I will not use a VAE but instead directly predict in pixel space⁶.

Setup & Baseline

For all experiments, I use a modernized version of the DiT architecture⁷ with AdaLN for timestep and class conditioning, SwiGLU activations, RoPE, QK-Norm, no bias parameters, and a patch size of 16. I train the baseline models using Adam without weight decay with $\beta_1=0.9$ and $\beta_2=0.95$. All models are trained with a batch size of 256. All FID scores are computed without classifier-free guidance.

The target model configuration is a 465M-parameter DiT-L with 24 layers and a width of 1024. Sweeping the learning rate on this model would be too computationally expensive. Instead, I sweep on a smaller proxy 31M-parameter model with a width of 256 and the same number of layers, then use the maximal update parametrization (muP)⁸ to transfer the best-performing learning rate to the large target model.

Sweeping the learning rate over 2e-4, 4e-4, and 8e-4, the model achieves a significantly lower FID at 4e-4 than at 2e-4. However, increasing the learning rate further to 8e-4 leads to unstable training.

Adam learning rate sweep on the small proxy model.

Introducing Muon

Let’s start with a quick recap of Muon. The key difference between Muon and Adam is the orthogonalization of gradient update matrices. However, a full singular value decomposition performed at every step would massively slow down training. Instead, Muon only approximates it through the much faster iterative Newton-Schulz algorithm. A common theory for why orthogonalization helps is that Adam’s updates tend to be dominated by a few large singular directions, while orthogonalization equalizes the singular values and thus spreads the update energy more evenly across all directions.

I apply Muon only to the hidden layers in the transformer blocks. The patch embedding, output layer, timestep MLP and AdaLN modulation layers are still optimized using Adam.

Using Muon, we can choose between two methods for scaling the learning rate $\gamma$. The method originally proposed by Muon’s authors scales the update such that its RMS-to-RMS operator norm is bounded by the learning rate:

$$ \gamma \leftarrow \gamma \sqrt{\frac{\text{fan-out}}{\text{fan-in}}} $$

Although this requires separately tuning the learning rate for Muon, it enables learning rate transfer across model widths without any additional muP learning rate scaling⁹. This means the two optimizers transfer their learning rate to larger models differently: the Adam learning rate tuned on the proxy is transferred via muP, whereas Muon’s original scaling already handles the width transfer intrinsically, so no separate muP scaling is applied. With this original scaling method, Muon requires a larger learning rate than Adam; a common rule of thumb is to start with a rate roughly 10 times higher.

At small scale, Muon’s FID curves were noticeably noisier than Adam’s, so I report Muon results averaged over 3 random seeds; the Adam curves were clean enough that a single seed was sufficient. Averaged over these 3 seeds, Muon consistently outperforms Adam across the entire learning rate sweep. After 400k steps, the optimal learning rate of 4e-3 achieves the best FID, yielding a 10% lower FID than Adam.

Muon learning rate sweep on the small proxy model using Muon’s original learning rate scaling. The black line shows the FID curve from the best Adam training run.

The alternative method, proposed by Moonshot AI², scales the update so its element-wise RMS matches the RMS of Adam’s update:

$$ \gamma \leftarrow 0.2\gamma \sqrt{\max(\text{fan-in}, \text{fan-out})} $$

This scaling allows us to reuse the learning rate tuned for Adam. The FID after 400k steps is almost on par with the original scaling method. Since the original scaling achieves a slightly better FID, I use it for all further experiments.

Muon with RMS scaling — Muon with its update RMS matched to Adam evaluated on the small proxy model.

Comparison at scale

The advantage of Muon holds at larger scales. Doubling the model width to 512, the FID of the Muon-trained model is still about 10% better than the model using Adam after 400k steps. Moving even further to the target 465M parameter model with a width of 1024, Muon reaches a 15% lower FID than Adam after 200k steps. I did not train the full-sized model further due to a lack of compute budget.

Muon vs Adam at model width 512 — Comparing Muon to Adam using the model with a width of 512.

Muon vs Adam at model width 1024 — Comparing Muon to Adam using the DiT-L model.

Training time

Muon reaches a lower FID than Adam in the same number of training steps in these experiments. However, each Muon step is slower due to the orthogonalization, making a step-count comparison unfair. To minimize the overhead, I use Tri Dao’s optimized Gram Newton-Schulz kernel¹⁰.

At full model size, the per-step overhead of Muon is ~5%. It still reaches a lower FID than Adam in the same wall-clock time. The overhead would also shrink further with larger batch sizes.

Training time at model width 512 — Training time comparison using a model with a width of 512 trained for 400k steps.

Training time at model width 1024 — Training time comparison using the DiT-L model trained for 200k steps.

Conclusion

We have seen that Muon significantly outperforms Adam for training diffusion transformers in these experiments. What remains is to evaluate how Muon compares to Adam at billion-parameter scale, and whether Adam eventually catches up with enough training steps.

The training code can be found here: https://github.com/sven-luepke/pixel-dit-muon/tree/main

References

Jordan, K. et al. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/ ↩︎
Liu, J. et al. (Kimi Team). Muon is Scalable for LLM Training. https://arxiv.org/abs/2502.16982 ↩︎ ↩︎
DeepSeek AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro ↩︎
Lu, Y. et al. One-step Latent-free Image Generation with Pixel Mean Flows. https://arxiv.org/abs/2601.22158 ↩︎
Akiti, C. et al. Nucleus-Image: Sparse MoE for Image Generation. https://arxiv.org/abs/2604.12163 ↩︎
Li, T. & He, K. Back to Basics: Let Denoising Generative Models Denoise. https://arxiv.org/abs/2511.13720 ↩︎
Peebles, W. & Xie, S. Scalable Diffusion Models with Transformers. https://arxiv.org/abs/2212.09748 ↩︎
Yang, G. et al. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. https://arxiv.org/abs/2203.03466 ↩︎
Bernstein, J. Deriving Muon. https://jeremybernste.in/writing/deriving-muon ↩︎
Dao, T. Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon. https://tridao.me/blog/2026/gram-newton-schulz/ ↩︎

Setup & Baseline#

Introducing Muon#

Comparison at scale#

Training time#

Conclusion#

References#