ModernVBERT is a 250M-parameter vision–language encoder that aligns a text-encoder (Ettin-150M) with a vision-encoder (SigLIP2-B) through a MLM objective. When fine-tuned for document retrieval, ModernVBERT sets a new state of the art for sub-1B models on ViDoRe tasks.