🎯 PerceptionDLM Region Captioning

A diffusion multimodal LLM that captions any region of an image in parallel. Upload an image and one or more binary masks, then run inference — hover over a region to highlight it, and replay the diffusion decoding to watch each caption emerge token by token.

Model: MSALab/PerceptionDLM · Paper: arXiv:2606.19534 · Code: GitHub

Image

Regions to caption

How to provide masks

Generate mask via SAM 3 (click on image) Upload a mask file

Click on the image below to place points, then press Generate mask via SAM 3. Add the resulting region and repeat to caption several regions.

Click type

include exclude

Click to place points

Generated mask (preview)

No regions added yet.

Prompt

Output

Regions (hover to highlight)

Decoding step

0 1

Examples

Image	How to provide masks	Mask images (binary, ≥1)	Prompt