File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TBDATA.2025.3536930
- Scopus: eid_2-s2.0-85217029506
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models
| Title | TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models |
|---|---|
| Authors | |
| Keywords | evaluation method Large vision-language models multimodal evaluation benchmark |
| Issue Date | 1-Jan-2025 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Big Data, 2025, v. 11, n. 3, p. 933-947 How to Cite? |
| Abstract | Large Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements. |
| Persistent Identifier | http://hdl.handle.net/10722/362631 |
| ISSN | 2023 Impact Factor: 7.5 2023 SCImago Journal Rankings: 1.821 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Shao, Wenqi | - |
| dc.contributor.author | Lei, Meng | - |
| dc.contributor.author | Hu, Yutao | - |
| dc.contributor.author | Gao, Peng | - |
| dc.contributor.author | Xu, Peng | - |
| dc.contributor.author | Zhang, Kaipeng | - |
| dc.contributor.author | Meng, Fanqing | - |
| dc.contributor.author | Huang, Siyuan | - |
| dc.contributor.author | Li, Hongsheng | - |
| dc.contributor.author | Qiao, Yu | - |
| dc.contributor.author | Luo, Ping | - |
| dc.date.accessioned | 2025-09-26T00:36:33Z | - |
| dc.date.available | 2025-09-26T00:36:33Z | - |
| dc.date.issued | 2025-01-01 | - |
| dc.identifier.citation | IEEE Transactions on Big Data, 2025, v. 11, n. 3, p. 933-947 | - |
| dc.identifier.issn | 2332-7790 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362631 | - |
| dc.description.abstract | Large Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements. | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Big Data | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | evaluation method | - |
| dc.subject | Large vision-language models | - |
| dc.subject | multimodal evaluation benchmark | - |
| dc.title | TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1109/TBDATA.2025.3536930 | - |
| dc.identifier.scopus | eid_2-s2.0-85217029506 | - |
| dc.identifier.volume | 11 | - |
| dc.identifier.issue | 3 | - |
| dc.identifier.spage | 933 | - |
| dc.identifier.epage | 947 | - |
| dc.identifier.eissn | 2332-7790 | - |
| dc.identifier.issnl | 2332-7790 | - |
