SAT: Spatial Aptitude Training for Multimodal Language Models

Abstract

Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects. Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition. As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to more dynamic tasks. SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions. Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: 23% on CVBench, 9% on the harder BLINK benchmark, and 18% on VSR. When instruction-tuned on SAT, LLaVA-13B matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning.

Approach

We take actions in a 3D simulator and check the 3D locations of assets. We use natural language descriptions of the assets and make QA pairs based on how the 3D nature of the scene changes with the actions taken.

Results

Existing MLMs underperform on SAT Dynamic Tasks

Fine-tuning on SAT matches a 13B MLM with larger models

CV-Bench

BLINK

Our SAT-tuned model remembers pre-training commonsense

Tuning with SAT Dynamic data outperforms other types of spatial tuning

Some examples

Some qualitative results of spatial question answering. We improve from LLaVA baseline on spatial reasoning on real images. We also improve on dynamic capabilities on our test set. Multiview reasoning and egocentric movement remain challenging on real images. While our evaluation is with multiple choices, we show some longer conversational example responses as well.

BibTeX

@misc{ray2024satspatialaptitudetraining,
title={SAT: Spatial Aptitude Training for Multimodal Language Models}, 
author={Arijit Ray and Jiafei Duan and Reuben Tan and Dina Bashkirova and Rose Hendrix and Kiana Ehsani and Aniruddha Kembhavi and Bryan A. Plummer and Ranjay Krishna and Kuo-Hao Zeng and Kate Saenko},
year={2024},
eprint={2412.07755},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.07755}, 
}