Attention Sinks in Diffusion Transformers: A Causal Analysis

Attention sinks in diffusion transformers. (a) In autoregressive LMs, attention sinks often act as stable anchors that attract dominant attention mass. (b) In diffusion transformers, dominant recipients vary across denoising timesteps; we perform a causal test by dynamically identifying sink tokens per step and suppressing them during inference. (c) Sink suppression preserves semantic alignment and preference scores (CLIP-T / ImageReward / HPS-v2), yet can induce perceptual and distributional shifts relative to baseline outputs (LPIPS / FIDshift), consistent with moving samples within the model’s output manifold.
Fangzheng Wu
Brian Summa
Attention sinks—tokens that receive disproportionate attention mass—are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion 3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at k=1; only under stronger interventions (k≥10) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless sink-specific—∼6×larger than equal-budget random masking—revealing an empirical dissociation between trajectory-level perturbation and semantic alignment in diffusion transformers.
TBD
	
						@inproceedings{wu2026attention,
						  title     = {Attention Sinks in Diffusion Transformers: A Causal Analysis},
						  author    = {Wu, Fangzheng and Summa, Brian},
						  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
						  year      = {2026}
						}