Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation

Abstract

Text-to-image generative models have achieved remarkable breakthroughs in recent years. However, their application in medical image generation still faces significant challenges, including small dataset sizes, and scarcity of medical textual data. To address these challenges, we propose Med-Art, a framework specifically designed for medical image generation with limited data. Med-Art leverages vision-language models to generate visual descriptions of medical images which overcomes the scarcity of applicable medical textual data. Med-Art adapts a large-scale pre-trained text-to-image model, PixArt-$\alpha$, based on the Diffusion Transformer (DiT), achieving high performance under limited data. Furthermore, we propose an innovative Hybrid-Level Diffusion Fine-tuning (HLDF) method, which enables pixel-level losses, effectively addressing issues such as overly saturated colors. We achieve state-of-the-art performance on two medical image datasets, measured by FID, KID, and downstream classification performance.

Hybrid-Level Diffusion Fine-tuning (HLDF)

To address unnaturally saturated colors, we penalize deviations of the color distribution in pixel space. We minimize the difference in the mean $\mu_c$ and standard deviation $\sigma_c$ of each color channel $c$ between the input image $x$ and the generated image $\tilde{x}$. The loss is defined as:

\[ \mathcal{L}_{\text{color}} := \mathbb{E}_x\left[\sum_{c} \left( \|\mu_c(\tilde{x}) - \mu_c(x)\|_2^2 + \|\sigma_c(\tilde{x}) - \sigma_c(x)\|_2^2 \right)\right] \]

In order to compute this loss, we need to generate images during training. For this, we use classifier-free guidance with a guidance weight of 4.5, and for sampling, we utilize the DPM-Solver++, which enables the generation of high-quality images in 15–20 steps.

Even with this relatively small number of steps required, this remains a computationally and memory-intensive process to differentiate. To lower the required memory, at the cost of additional compute, we utilize gradient checkpointing. However, we drastically lower the required computation and training time by adopting an interval optimization strategy, where we only compute the loss every $N$ steps. The loss is then:

\[ \mathcal{L}_{\text{HLDF}} = \mathcal{L}_{\text{Pixart-}\alpha\text{+LoRA}} + \begin{cases} \frac{1}{M} \cdot \mathcal{L}_{\text{color}}, & \text{if step} \equiv 0 \mod N \\ 0, & \text{otherwise} \end{cases} \]

Here, $ M $ denotes the number of steps employed by DPM Solver++ for image generation during training, where the default value is 20, and $ \frac{1}{M} $ is the weight.

Generation Results

Samples of generated images: each row shows a text prompt and a corresponding image generated by different methods. Fast-DiT generates images using only class labels instead of text prompts. The images generated by the other methods contain unnaturally saturated colors and are less similar to the real image than the result of Med-Art.

BibTeX

@misc{guo2025medartdiffusiontransformer2d,
      title={Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation}, 
      author={Changlu Guo and Anders Nymark Christensen and Morten Rieger Hannemose},
      year={2025},
      eprint={2506.20449},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.20449}, 
}

Acknowledgements

This work was financially supported by the Innovation Fund Denmark (IFD)