A diffusion multimodal LLM that captions any region of an image in parallel. Upload an image and one or more binary masks, then run inference — hover over a region to highlight it, and replay the diffusion decoding to watch each caption emerge token by token.