MME

Leaderboard

Sum of the scores of all perception subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, and OCR. The full score of each subtask is 200, and that of all perception is 2000.

Sum of the scores of all cognition subtasks, including commonsense reasoning, numerical calculation, text translation, and code reasoning. The full score of each subtask is 200, and that of all cognition is 800.

#	Model	LLM Params	Score	Perception	Cognition	Existence	Count	Position	Color	Poster	Celebrity	Scene	Landmark	Artwork	OCR	Commonsense Reasoning	Numerical Calculation	Text Translation	Code Reasoning
#	Model	LLM Params	Score	Perception	Cognition	Existence	Count	Position	Color	Poster	Celebrity	Scene	Landmark	Artwork	OCR	Commonsense Reasoning	Numerical Calculation	Text Translation	Code Reasoning	Qwen-VL-Max	-	2433.61	1790.04	643.57	183.33	166.67	176.67	180.00	187.76	184.12	173.00	187.50	166.00	185.00	148.57	155.00	170.00	170.00
	TeleMM	-	2387.64	1706.21	681.43	195.00	165.00	160.00	185.00	187.07	165.88	161.50	182.00	157.25	147.50	171.43	140.00	200.00	170.000
	MiniCPM-V 2.6	8B	2365.88	1668.03	697.86	195.00	160.00	136.67	175.00	179.93	151.18	153.25	177.50	147.00	192.50	157.86	185.00	185.00	170.00
	RBDash v1.2	72B	2333.57	1769.05	564.52	200.00	170.00	158.33	190.00	185.71	175.00	169.50	183.75	144.25	192.50	163.57	147.50	80.00	173.45
	InternLM-XComposer2-VL 2.6	7B	2242.71	1712.00	530.71	195.00	160.00	163.33	195.00	171.09	153.82	164.75	176.00	185.50	147.50	145.71	137.50	147.50	100.00
	JT-VL-Chat-V2.0	-	2204.54	1743.11	461.43	200.00	170.00	173.33	190.00	184.35	176.18	161.50	176.75	148.50	162.50	161.43	117.50	72.50	110.00
	InternVL-Chat-V1.5	20B	2187.84	1637.84	550.00	190.00	175.00	166.67	178.33	173.81	138.53	154.75	177.75	143.00	140.00	135.00	125.00	185.00	105.00
	Qwen-VL-Plus	-	2183.39	1681.25	502.14	175.00	153.33	161.67	180.00	181.63	184.12	151.00	191.00	156.00	147.50	142.14	85.00	185.00	90.00
	LLaVA-1.6	34B	2028.61	1631.47	397.14	190.00	170.00	138.33	195.00	169.39	160.00	164.50	165.25	139.00	140.00	152.14	72.50	72.50	100.00
	Bunny-8B	8B	2011.64	1644.14	367.50	195.00	165.00	135.00	195.00	167.35	175.29	153.75	170.25	132.50	155.00	140.00	75.00	95.00	57.50
	JT-VL-Chat	-	1982.15	1642.51	339.64	185.00	173.33	145.00	175.00	168.71	161.47	157.75	173.50	132.75	170.00	122.14	67.50	105.00	45.00
	Monkey-Chat	7B	1923.82	1522.39	401.43	185.00	150.00	118.33	185.00	178.91	142.65	161.75	176.50	144.25	80.00	131.43	42.50	137.50	90.00
	ShareGPT4V	13B	1921.92	1618.70	303.21	190.00	165.00	153.33	185.00	169.05	153.82	168.00	174.00	128.00	132.50	125.71	45.00	80.00	52.50
	InternLM-XComposer-VL	7B	1919.51	1528.44	391.07	190.00	158.33	126.67	165.00	161.90	150.29	159.75	165.25	126.25	125.00	138.57	55.00	112.50	85.00
	SPHINX	13B	1870.15	1560.15	310.00	195.00	160.00	153.33	160.00	164.29	177.94	160.00	168.09	134.00	87.50	130.00	55.00	75.00	50.00
	Qwen-VL-Chat	7B	1848.28	1487.57	360.71	158.33	150.00	128.33	170.00	178.57	120.59	152.25	164.00	125.50	140.00	130.71	40.00	147.50	42.50
	LLaVA	13B	1826.67	1531.31	295.36	185.00	155.00	133.33	170.00	160.54	152.94	161.25	170.50	117.75	125.00	127.86	42.50	77.50	47.50
	mPLUG-Owl2	7B	1763.40	1450.19	313.21	185.00	155.00	88.33	150.00	160.20	164.41	153.25	157.25	134.25	102.50	115.71	35.00	102.50	60.00
	CogVLM	7B	1752.28	1439.07	313.21	195.00	165.00	103.33	160.00	146.94	115.29	159.25	158.00	88.75	147.50	125.71	60.00	75.00	52.50

Green date indicates the newly added/updated models - indicates the unknown quantity of parameters

MME

A Comprehensive Evaluation Benchmark for
Multimodal Large Language Models

Introduction

Leaderboard

Citation

MME

A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Introduction

Leaderboard

Citation

A Comprehensive Evaluation Benchmark for
Multimodal Large Language Models