TongTest

Test Task Introduction

This platform evaluates the performance of multimodal large language models as embodied agents on 8 daily household composite tasks, comprehensively measuring model capabilities in object understanding, spatial intelligence, social activities, and more.

Counting Objects

Evaluate the model's ability to identify and count specific objects in a scene.

Preparing Baggage

Test the model's ability to select and organize appropriate items based on travel needs.

Building Blocks

Evaluate the model's spatial reasoning and operational ability to understand and execute block building instructions.

Jigsaw Puzzle

Test the model's visual reasoning ability to recognize patterns and complete puzzle tasks.

Understanding Buttons

Evaluate the model's ability to identify button functions and predict operation results.

Setting Tables

Test the model's ability to arrange items reasonably based on categories and spatial relationships.

Tidying Up Rooms

Evaluate the model's ability to plan cleaning tasks and execute reasonable operation sequences.

Selecting Gifts

Test the model's ability to select appropriate gifts based on personal relationships and scenario requirements.

Model Performance Comparison

Model	Equal Weighted Average	Counting Objects	Preparing Baggage	Building Blocks	Jigsaw Puzzle	Understanding Buttons	Setting Tables	Tidying Up Rooms	Selecting Gifts

Test Dimension Introduction

This platform categorizes the 8 tasks into 3 core dimensions, evaluating the comprehensive capabilities of multimodal large language models from different perspectives.

Object Understanding Dimension

Evaluate the model's ability in object recognition, counting, value assessment, and selection. Tasks include "Counting Objects" and "Selecting Gifts".

Spatial Intelligence Dimension

Evaluate the model's spatial reasoning and embodied task planning capabilities. Tasks include "Building Blocks", "Jigsaw Puzzle", and "Understanding Buttons".

Social Activity Dimension

Evaluate the model's planning and execution capabilities in social scenarios. Tasks include "Setting Tables", "Tidying Up Rooms", and "Preparing Baggage".

Model Dimension Capability Comparison

Model	Dimension Weighted Average	Object Understanding	Spatial Intelligence	Social Activity

About Us

TongTest - General Embodied Interaction Testing Platform is committed to providing comprehensive and objective AI model evaluation data for researchers and developers.

Our Mission

Promote the development and application of multimodal large model technology through systematic testing and evaluation.

Testing Methodology

We adopt standardized testing processes and evaluation metrics to ensure the reliability and comparability of test results.

Contact Us

If you have any questions or suggestions, please contact us at research@bigai.ai.

TongTest - The North Star in the New Era of AGI