Urban-WORM (Workflow Of Reproducible Multimodal Inference) is a user-friendly high-level interface that is designed for adding rich and meaningful captions for crow-sourced data with geotags using multimodal models.