MME

A Comprehensive Evaluation Benchmark for
Multimodal Large Language Models

Introduction

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 50+ advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.

Leaderboard

Sum of the scores of all perception subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, and OCR. The full score of each subtask is 200, and that of all perception is 2000.

Sum of the scores of all cognition subtasks, including commonsense reasoning, numerical calculation, text translation, and code reasoning. The full score of each subtask is 200, and that of all cognition is 800.

# Model LLM
Params
Score Perception Cognition Existence Count Position Color Poster Celebrity Scene Landmark Artwork OCR Commonsense
Reasoning
Numerical
Calculation
Text
Translation
Code
Reasoning
Qwen-VL-Max - 2433.61 1790.04 643.57 183.33 166.67 176.67 180.00 187.76 184.12 173.00 187.50 166.00 185.00 148.57 155.00 170.00 170.00
TeleMM - 2387.64 1706.21 681.43 195.00 165.00 160.00 185.00 187.07 165.88 161.50 182.00 157.25 147.50 171.43 140.00 200.00 170.000
MiniCPM-V 2.6 8B 2365.88 1668.03 697.86 195.00 160.00 136.67 175.00 179.93 151.18 153.25 177.50 147.00 192.50 157.86 185.00 185.00 170.00
RBDash v1.2 72B 2333.57 1769.05 564.52 200.00 170.00 158.33 190.00 185.71 175.00 169.50 183.75 144.25 192.50 163.57 147.50 80.00 173.45
InternLM-XComposer2-VL 2.6 7B 2242.71 1712.00 530.71 195.00 160.00 163.33 195.00 171.09 153.82 164.75 176.00 185.50 147.50 145.71 137.50 147.50 100.00
JT-VL-Chat-V2.0 - 2204.54 1743.11 461.43 200.00 170.00 173.33 190.00 184.35 176.18 161.50 176.75 148.50 162.50 161.43 117.50 72.50 110.00
InternVL-Chat-V1.5 20B 2187.84 1637.84 550.00 190.00 175.00 166.67 178.33 173.81 138.53 154.75 177.75 143.00 140.00 135.00 125.00 185.00 105.00
Qwen-VL-Plus - 2183.39 1681.25 502.14 175.00 153.33 161.67 180.00 181.63 184.12 151.00 191.00 156.00 147.50 142.14 85.00 185.00 90.00
LLaVA-1.6 34B 2028.61 1631.47 397.14 190.00 170.00 138.33 195.00 169.39 160.00 164.50 165.25 139.00 140.00 152.14 72.50 72.50 100.00
Bunny-8B 8B 2011.64 1644.14 367.50 195.00 165.00 135.00 195.00 167.35 175.29 153.75 170.25 132.50 155.00 140.00 75.00 95.00 57.50
JT-VL-Chat - 1982.15 1642.51 339.64 185.00 173.33 145.00 175.00 168.71 161.47 157.75 173.50 132.75 170.00 122.14 67.50 105.00 45.00
Monkey-Chat 7B 1923.82 1522.39 401.43 185.00 150.00 118.33 185.00 178.91 142.65 161.75 176.50 144.25 80.00 131.43 42.50 137.50 90.00
ShareGPT4V 13B 1921.92 1618.70 303.21 190.00 165.00 153.33 185.00 169.05 153.82 168.00 174.00 128.00 132.50 125.71 45.00 80.00 52.50
InternLM-XComposer-VL 7B 1919.51 1528.44 391.07 190.00 158.33 126.67 165.00 161.90 150.29 159.75 165.25 126.25 125.00 138.57 55.00 112.50 85.00
SPHINX 13B 1870.15 1560.15 310.00 195.00 160.00 153.33 160.00 164.29 177.94 160.00 168.09 134.00 87.50 130.00 55.00 75.00 50.00
Qwen-VL-Chat 7B 1848.28 1487.57 360.71 158.33 150.00 128.33 170.00 178.57 120.59 152.25 164.00 125.50 140.00 130.71 40.00 147.50 42.50
LLaVA 13B 1826.67 1531.31 295.36 185.00 155.00 133.33 170.00 160.54 152.94 161.25 170.50 117.75 125.00 127.86 42.50 77.50 47.50
mPLUG-Owl2 7B 1763.40 1450.19 313.21 185.00 155.00 88.33 150.00 160.20 164.41 153.25 157.25 134.25 102.50 115.71 35.00 102.50 60.00
CogVLM 7B 1752.28 1439.07 313.21 195.00 165.00 103.33 160.00 146.94 115.29 159.25 158.00 88.75 147.50 125.71 60.00 75.00 52.50

Green date indicates the newly added/updated models          - indicates the unknown quantity of parameters

Citation

@article{fu2023mme,
    title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
    author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
    journal={arXiv preprint arXiv:2306.13394},
    year={2023}
    }
}