🎯 PerceptionDLM Region Captioning

A diffusion multimodal LLM that captions any region of an image in parallel. Upload an image and one or more binary masks, then run inference — hover over a region to highlight it, and replay the diffusion decoding to watch each caption emerge token by token.

Model: MSALab/PerceptionDLM · Paper: arXiv:2606.19534 · Code: GitHub

Regions to caption

How to provide masks

Click on the image below to place points, then press Generate mask via SAM 3. Add the resulting region and repeat to caption several regions.

Click type

No regions added yet.

8 128
8 128

Output

0 1
Examples
Image How to provide masks Mask images (binary, ≥1) Prompt