Test Task Introduction
This platform evaluates the performance of multimodal large language models as embodied agents on 8 daily household composite tasks, comprehensively measuring model capabilities in object understanding, spatial intelligence, social activities, and more.
Counting Objects
Evaluate the model's ability to identify and count specific objects in a scene.
Preparing Baggage
Test the model's ability to select and organize appropriate items based on travel needs.
Building Blocks
Evaluate the model's spatial reasoning and operational ability to understand and execute block building instructions.
Jigsaw Puzzle
Test the model's visual reasoning ability to recognize patterns and complete puzzle tasks.
Understanding Buttons
Evaluate the model's ability to identify button functions and predict operation results.
Setting Tables
Test the model's ability to arrange items reasonably based on categories and spatial relationships.
Tidying Up Rooms
Evaluate the model's ability to plan cleaning tasks and execute reasonable operation sequences.
Selecting Gifts
Test the model's ability to select appropriate gifts based on personal relationships and scenario requirements.
Model Performance Comparison
| Model | Equal Weighted Average | Counting Objects | Preparing Baggage | Building Blocks | Jigsaw Puzzle | Understanding Buttons | Setting Tables | Tidying Up Rooms | Selecting Gifts |
|---|