Loading...
Loading...
Avg 71.5 stars per repo.
π This is the official repository which contains the training and inference code for ThinkMorph.
We present ThinkMorph, a unified model fine-tuned on βΌ24K high-quality interleaved reasoning traces across tasks, learning to generate progressive textβimage reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.
Beyond strong vision-benchmark performance and robust out-of-domain generalization, ThinkMorph demonstrates emergent multimodal intelligence, including novel visual manipulation skills and so on. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
1οΈβ£ Set up environment
git clone https://github.com/ThinkMorph/ThinkMorph.git
cd ThinkMorph
conda create -n thinkmorph python=3.10 -y
conda activate thinkmorph
pip install -r requirements.txt
2οΈβ£ Download checkpoint
from huggingface_hub import snapshot_download
save_dir = "models/ThinkMorph-7B"
repo_id = "ThinkMorph/ThinkMorph-7B"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
3οΈβ£ Use inference.ipynb to play with ThinkMorph!
We opensource our training data mentioned in our paper containing four tasks: Jigsaw Assembly, Spatial Navigation, Visual Search , and Chart Refocus. Here we show typical examples of four tasks. Training data can be downloaded from Huggingface.
Download the training dataset
from datasets import load_dataset
# Jigsaw Assembly
dataset = load_dataset("ThinkMorph/Jigsaw_Assembly", split="train")
# Spatial Navigation
dataset = load_dataset("ThinkMorph/Spatial_Navigation", split="train")
# Visual Search
dataset = load_dataset("ThinkMorph/Visual_Search", split="train")
# Chart Refocus
dataset = load_dataset("ThinkMorph/Chart_Refocus", split="train")
Convert the downloaded dataset into a data format suitable for model training. We provide a format processing script in here.
Based on Bagel's implementation, we modify the training code to support our interleaved data format, and an easy-to-understand example of a parquet file is shown below:
{
"image_list": [problem_image_0, reasoning_image_0],
"instruction_list": [question],
"output_text_list": [f"<think>{resoning_thought_0}</think><image_start>",f"<image_end><think>{resoning_thought_1}</think><answer>{answer}</answer>"],
}
Edit data/dataset_info.py with your own data path.
Edit configs/example.yaml. Additionally, we provide example configuration files corresponding to the different training settings in data/configs.
We provide script examples for three training settings (interleaved reasoning, text reasoning and thinkmorph) in our paper, in ./script. Here we demonstrate training scripts for interleaved reasoning:
torchrun \
--nnodes=$num_nodes \
--node_rank=$node_rank \
--nproc_per_node=8 \
--master_addr=$master_addr \
--master_port=$master_port \
train/pretrain_unified_navit.py \
--dataset_config_file ./data/configs/interleaved_reasoning.yaml \
--model_path $model_path \
--layer_module Qwen2MoTDecoderLayer \
--finetune_from_hf True \
--auto_resume True \
--finetune-from-ema True \
--resume-from $model_path \
--results_dir $output_path \
--checkpoint_dir $ckpt_path \
--lr 1e-5 \
--num_worker 4 \
--max_latent_size 64 \
--max_num_tokens 32768 \
--vit_cond_dropout_prob 0 \ # see details in https://github.com/ByteDance-Seed/Bagel/issues/69
--text_cond_dropout_prob 0 \
--mse_weight 1 \
--ce_weight 1 \
--total_steps 8000 \
You can replace the variables in the script with your own before running. See Bagel's TRAIN for more details.
Our evaluation code is open-sourced in VLMEvalKit_Thinkmorph. This repository provides evaluation support for the ThinkMorph model based on VLMEvalKit. And this repo also supports all the benchmarks evaluated in our paper, including: VSP, VisPuzzle, ChartQA, VStar, BLINK-J, MMVP, SAT, BLINK, and CV-Bench.
| Model | Size | | VSP | VisPuzzle | ChartQA | VStar | BLINK-J | MMVP | SAT | BLINK | CV-Bench | | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | GPT-4o | β | | 33.50 | 43.75 | 76.34 | 61.78 | 72.67 | 84.67 | 28.00 | 60.28 | 75.61 | | GPT-5 | β | | 57.33 | 78.00 | 80.85 | 71.73 | 77.33 | 86.33 | 73.30 | 69.86 | 85.46 | | Gemini 2.5 Flash | β | | 59.33 | 47.00 | 83.79 | 70.68 | 66.00 | 80.33 | 56.00 | 67.49 | 85.07 | | InternVL3.5 | 8B | | 8.17 | 34.75 | 76.26 | 68.59 | 71.33 | 76.33 | 45.33 | 59.60 | 81.99 | | | 38B | | 20.16 | 36.50 | 80.44 | 76.96 | 80.67 | 80.33 | 49.33 | 62.65 | 85.96 | | Qwen2.5-VL | 7B | | 2.16 | 34.75 | 78.12 | 76.44 | 59.33 | 77.33 | 51.33 | 55.92 | 75.20 | | | 72B | | 41.83 | 40.00 | 82.03 | 85.86 | 61.33 | 82.00 | 64.67 | 61.91 | 82.54 | | Janus-pro | 7B | | 0.00 | 33.50 | 43.08 | 38.22 | 50.67 | 63.33 | 22.00 | 38.51 | 67.83 | | Chameleon | 7B | | 0.83 | 30.50 | 5.74 | 28.27 | 0.67 | 47.67 | 10.67 | 16.52 | 36.52 | | Bagel | 7B | | 0.83* | 35.00* | 61.82 | 55.49 | 67.33 | 70.33 | 44.67 | 47.66 | 76.03 | | ThinkMorph | 7B | | 75.83 | 79.00 | 78.10 | 67.02 | 72.00 | 80.33 | 52.67 | 60.07 | 80.82 | | Ξ (vs Bagel) | | | +75.00 | +44.00 | +16.28 | +11.53 | +4.67 | +10.00 | +8.00 | +12.41 | +4.79 |
@article{gu2025thinkmorph,
title={ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning},
author={Gu, Jiawei and Hao, Yunzhuo and Wang, Huichen Will and Li, Linjie and Shieh, Michael Qizhe and Choi, Yejin and Krishna, Ranjay and Cheng, Yu},
journal={arXiv preprint arXiv:2510.27492},
year={2025}
}