ThinkMorph

[ICLR 2026]Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

🌟 This is the official repository which contains the training and inference code for ThinkMorph.

💥 News

[2026.1.26] 🎉 Our paper has been accepted for Poster on ICLR 2026! See you in Brazil!!
[2025.12.22] The evaluation code for ThinkMorph is now accessible at VLMEvalKit_Thinkmorph.
[2025.10.29] Our model checkpoint and training data are now accessible at Huggingface.
[2025.10.29] Our paper is now accessible at arxiv.

👀 About ThinkMorph

We present ThinkMorph, a unified model fine-tuned on ∼24K high-quality interleaved reasoning traces across tasks, learning to generate progressive text–image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.

Beyond strong vision-benchmark performance and robust out-of-domain generalization, ThinkMorph demonstrates emergent multimodal intelligence, including novel visual manipulation skills and so on. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

🔥 Quick Start

1️⃣ Set up environment

git clone https://github.com/ThinkMorph/ThinkMorph.git
cd ThinkMorph
conda create -n thinkmorph python=3.10 -y
conda activate thinkmorph
pip install -r requirements.txt

2️⃣ Download checkpoint

from huggingface_hub import snapshot_download

save_dir = "models/ThinkMorph-7B"
repo_id = "ThinkMorph/ThinkMorph-7B"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

3️⃣ Use inference.ipynb to play with ThinkMorph!

🔥 Train & Eval

Training Data prepration

We opensource our training data mentioned in our paper containing four tasks: Jigsaw Assembly, Spatial Navigation, Visual Search , and Chart Refocus. Here we show typical examples of four tasks. Training data can be downloaded from Huggingface.

Download the training dataset

 from datasets import load_dataset

 # Jigsaw Assembly
 dataset = load_dataset("ThinkMorph/Jigsaw_Assembly", split="train")

 # Spatial Navigation
 dataset = load_dataset("ThinkMorph/Spatial_Navigation", split="train")

 # Visual Search
 dataset = load_dataset("ThinkMorph/Visual_Search", split="train")

 # Chart Refocus
 dataset = load_dataset("ThinkMorph/Chart_Refocus", split="train")

Convert the downloaded dataset into a data format suitable for model training. We provide a format processing script in here.

Based on Bagel's implementation, we modify the training code to support our interleaved data format, and an easy-to-understand example of a parquet file is shown below:

{
    "image_list": [problem_image_0, reasoning_image_0],
    "instruction_list": [question],
    "output_text_list": [f"<think>{resoning_thought_0}</think><image_start>",f"<image_end><think>{resoning_thought_1}</think><answer>{answer}</answer>"],
}

Edit data/dataset_info.py with your own data path.
Edit configs/example.yaml. Additionally, we provide example configuration files corresponding to the different training settings in data/configs.

Train

We provide script examples for three training settings (interleaved reasoning, text reasoning and thinkmorph) in our paper, in ./script. Here we demonstrate training scripts for interleaved reasoning:

torchrun \
  --nnodes=$num_nodes \
  --node_rank=$node_rank \
  --nproc_per_node=8 \
  --master_addr=$master_addr \
  --master_port=$master_port \
  train/pretrain_unified_navit.py \
  --dataset_config_file ./data/configs/interleaved_reasoning.yaml \
  --model_path $model_path \
  --layer_module Qwen2MoTDecoderLayer \
  --finetune_from_hf True \
  --auto_resume True \
  --finetune-from-ema True \
  --resume-from $model_path \
  --results_dir $output_path \
  --checkpoint_dir $ckpt_path \
  --lr 1e-5 \
  --num_worker 4 \
  --max_latent_size 64  \
  --max_num_tokens 32768 \
  --vit_cond_dropout_prob 0 \ # see details in https://github.com/ByteDance-Seed/Bagel/issues/69
  --text_cond_dropout_prob 0 \
  --mse_weight 1 \
  --ce_weight 1 \
  --total_steps 8000 \

You can replace the variables in the script with your own before running. See Bagel's TRAIN for more details.

Eval

Our evaluation code is open-sourced in VLMEvalKit_Thinkmorph. This repository provides evaluation support for the ThinkMorph model based on VLMEvalKit. And this repo also supports all the benchmarks evaluated in our paper, including: VSP, VisPuzzle, ChartQA, VStar, BLINK-J, MMVP, SAT, BLINK, and CV-Bench.

📊 Benchmarks

| Model | Size | | VSP | VisPuzzle | ChartQA | VStar | BLINK-J | MMVP | SAT | BLINK | CV-Bench | | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | GPT-4o | – | | 33.50 | 43.75 | 76.34 | 61.78 | 72.67 | 84.67 | 28.00 | 60.28 | 75.61 | | GPT-5 | – | | 57.33 | 78.00 | 80.85 | 71.73 | 77.33 | 86.33 | 73.30 | 69.86 | 85.46 | | Gemini 2.5 Flash | – | | 59.33 | 47.00 | 83.79 | 70.68 | 66.00 | 80.33 | 56.00 | 67.49 | 85.07 | | InternVL3.5 | 8B | | 8.17 | 34.75 | 76.26 | 68.59 | 71.33 | 76.33 | 45.33 | 59.60 | 81.99 | | | 38B | | 20.16 | 36.50 | 80.44 | 76.96 | 80.67 | 80.33 | 49.33 | 62.65 | 85.96 | | Qwen2.5-VL | 7B | | 2.16 | 34.75 | 78.12 | 76.44 | 59.33 | 77.33 | 51.33 | 55.92 | 75.20 | | | 72B | | 41.83 | 40.00 | 82.03 | 85.86 | 61.33 | 82.00 | 64.67 | 61.91 | 82.54 | | Janus-pro | 7B | | 0.00 | 33.50 | 43.08 | 38.22 | 50.67 | 63.33 | 22.00 | 38.51 | 67.83 | | Chameleon | 7B | | 0.83 | 30.50 | 5.74 | 28.27 | 0.67 | 47.67 | 10.67 | 16.52 | 36.52 | | Bagel | 7B | | 0.83* | 35.00* | 61.82 | 55.49 | 67.33 | 70.33 | 44.67 | 47.66 | 76.03 | | ThinkMorph | 7B | | 75.83 | 79.00 | 78.10 | 67.02 | 72.00 | 80.33 | 52.67 | 60.07 | 80.82 | | Δ (vs Bagel) | | | +75.00 | +44.00 | +16.28 | +11.53 | +4.67 | +10.00 | +8.00 | +12.41 | +4.79 |

✍️ Citation

@article{gu2025thinkmorph,
  title={ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning},
  author={Gu, Jiawei and Hao, Yunzhuo and Wang, Huichen Will and Li, Linjie and Shieh, Michael Qizhe and Choi, Yejin and Krishna, Ranjay and Cheng, Yu},
  journal={arXiv preprint arXiv:2510.27492},
  year={2025}
}

GitHub Analytics

Quality