[{"content":"Muon 1 allows us to train LLMs faster, requiring up to 50% less compute to reach a given loss2. By now, Muon has gone beyond a research experiment and established itself in large-scale production training runs, such as DeepSeek-V43.\nFor diffusion transformers (DiT), the picture is less clear. Although several papers this year have started using Muon instead of Adam4,5, systematic comparisons between the two optimizers are lacking. This post fills that gap with experiments on ImageNet 256$\\times$256: I first tune the Adam baseline learning rate, then evaluate two strategies for tuning Muon, and finally scale the best configurations for each optimizer to a 465M parameter model.\nTo keep the experiments self-contained without relying on external pretrained models, I will not use a VAE but instead directly predict in pixel space6.\nSetup \u0026amp; Baseline For all experiments, I use a modernized version of the DiT architecture7 with AdaLN for timestep and class conditioning, SwiGLU activations, RoPE, QK-Norm, no bias parameters, and a patch size of 16. I train the baseline models using Adam without weight decay with $\\beta_1=0.9$ and $\\beta_2=0.95$. All models are trained with a batch size of 256. All FID scores are computed without classifier-free guidance.\nThe target model configuration is a 465M-parameter DiT-L with 24 layers and a width of 1024. Sweeping the learning rate on this model would be too computationally expensive. Instead, I sweep on a smaller proxy 31M-parameter model with a width of 256 and the same number of layers, then use the maximal update parametrization (muP)8 to transfer the best-performing learning rate to the large target model.\nSweeping the learning rate over 2e-4, 4e-4, and 8e-4, the model achieves a significantly lower FID at 4e-4 than at 2e-4. However, increasing the learning rate further to 8e-4 leads to unstable training.\nAdam learning rate sweep on the small proxy model.\nIntroducing Muon Let\u0026rsquo;s start with a quick recap of Muon. The key difference between Muon and Adam is the orthogonalization of gradient update matrices. However, a full singular value decomposition performed at every step would massively slow down training. Instead, Muon only approximates it through the much faster iterative Newton-Schulz algorithm. A common theory for why orthogonalization helps is that Adam\u0026rsquo;s updates tend to be dominated by a few large singular directions, while orthogonalization equalizes the singular values and thus spreads the update energy more evenly across all directions.\nI apply Muon only to the hidden layers in the transformer blocks. The patch embedding, output layer, timestep MLP and AdaLN modulation layers are still optimized using Adam.\nUsing Muon, we can choose between two methods for scaling the learning rate $\\gamma$. The method originally proposed by Muon\u0026rsquo;s authors scales the update such that its RMS-to-RMS operator norm is bounded by the learning rate: $$ \\gamma \\leftarrow \\gamma \\sqrt{\\frac{\\text{fan-out}}{\\text{fan-in}}} $$ Although this requires separately tuning the learning rate for Muon, it enables learning rate transfer across model widths without any additional muP learning rate scaling9. This means the two optimizers transfer their learning rate to larger models differently: the Adam learning rate tuned on the proxy is transferred via muP, whereas Muon\u0026rsquo;s original scaling already handles the width transfer intrinsically, so no separate muP scaling is applied. With this original scaling method, Muon requires a larger learning rate than Adam; a common rule of thumb is to start with a rate roughly 10 times higher.\nAt small scale, Muon\u0026rsquo;s FID curves were noticeably noisier than Adam\u0026rsquo;s, so I report Muon results averaged over 3 random seeds; the Adam curves were clean enough that a single seed was sufficient. Averaged over these 3 seeds, Muon consistently outperforms Adam across the entire learning rate sweep. After 400k steps, the optimal learning rate of 4e-3 achieves the best FID, yielding a 10% lower FID than Adam.\nMuon learning rate sweep on the small proxy model using Muon\u0026rsquo;s original learning rate scaling. The black line shows the FID curve from the best Adam training run.\nThe alternative method, proposed by Moonshot AI2, scales the update so its element-wise RMS matches the RMS of Adam\u0026rsquo;s update: $$ \\gamma \\leftarrow 0.2\\gamma \\sqrt{\\max(\\text{fan-in}, \\text{fan-out})} $$ This scaling allows us to reuse the learning rate tuned for Adam. The FID after 400k steps is almost on par with the original scaling method. Since the original scaling achieves a slightly better FID, I use it for all further experiments.\nMuon with its update RMS matched to Adam evaluated on the small proxy model.\nComparison at scale The advantage of Muon holds at larger scales. Doubling the model width to 512, the FID of the Muon-trained model is still about 10% better than the model using Adam after 400k steps. Moving even further to the target 465M parameter model with a width of 1024, Muon reaches a 15% lower FID than Adam after 200k steps. I did not train the full-sized model further due to a lack of compute budget.\nComparing Muon to Adam using the model with a width of 512.\nComparing Muon to Adam using the DiT-L model.\nTraining time Muon reaches a lower FID than Adam in the same number of training steps in these experiments. However, each Muon step is slower due to the orthogonalization, making a step-count comparison unfair. To minimize the overhead, I use Tri Dao\u0026rsquo;s optimized Gram Newton-Schulz kernel10.\nAt full model size, the per-step overhead of Muon is ~5%. It still reaches a lower FID than Adam in the same wall-clock time. The overhead would also shrink further with larger batch sizes.\nTraining time comparison using a model with a width of 512 trained for 400k steps.\nTraining time comparison using the DiT-L model trained for 200k steps.\nConclusion We have seen that Muon significantly outperforms Adam for training diffusion transformers in these experiments. What remains is to evaluate how Muon compares to Adam at billion-parameter scale, and whether Adam eventually catches up with enough training steps.\nThe training code can be found here: https://github.com/sven-luepke/pixel-dit-muon/tree/main\nReferences Jordan, K. et al. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nLiu, J. et al. (Kimi Team). Muon is Scalable for LLM Training. https://arxiv.org/abs/2502.16982\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDeepSeek AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nLu, Y. et al. One-step Latent-free Image Generation with Pixel Mean Flows. https://arxiv.org/abs/2601.22158\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAkiti, C. et al. Nucleus-Image: Sparse MoE for Image Generation. https://arxiv.org/abs/2604.12163\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nLi, T. \u0026amp; He, K. Back to Basics: Let Denoising Generative Models Denoise. https://arxiv.org/abs/2511.13720\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPeebles, W. \u0026amp; Xie, S. Scalable Diffusion Models with Transformers. https://arxiv.org/abs/2212.09748\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYang, G. et al. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. https://arxiv.org/abs/2203.03466\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nBernstein, J. Deriving Muon. https://jeremybernste.in/writing/deriving-muon\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDao, T. Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon. https://tridao.me/blog/2026/gram-newton-schulz/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://sven-luepke.github.io/blog/2026-05-31-dit-muon/","summary":"How fast can we train DiTs with the Muon optimizer?","title":"Training Diffusion Transformers with Muon"},{"content":"This was my guided research project at the Chair for Computer Aided Medical Procedures at Technical University of Munich. The resulting paper was presented at the 5th International Workshop on Multiscale Multimodal Medical Imaging at MICCAI 2024.\nProject page Code Paper ","permalink":"https://sven-luepke.github.io/projects/physics-informed-latent-diffusion-mri/","summary":"Paper presented at the 5th International Workshop on Multiscale Multimodal Medical Imaging at MICCAI 2024","title":"Physics-Informed Latent Diffusion for Multimodal Brain MRI Synthesis"},{"content":"Interdisciplinary project at the Chair for Data-driven Materials Modeling at Technical University of Munich.\nCode Project report ","permalink":"https://sven-luepke.github.io/projects/generative-modeling-inverse-molecular-design/","summary":"Interdisciplinary project at the Chair for Data-driven Materials Modeling at Technical University of Munich","title":"Generative Modeling for Inverse Molecular Design"},{"content":"Advanced graphics effect injector for The Witcher 3.\nCode Nexus Mods ","permalink":"https://sven-luepke.github.io/projects/blitz-fx/","summary":"Graphics effect injector for The Witcher 3","title":"BlitzFX"}]