File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)

Article: TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models

TitleTinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models
Authors
Keywordsevaluation method
Large vision-language models
multimodal evaluation benchmark
Issue Date1-Jan-2025
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Big Data, 2025, v. 11, n. 3, p. 933-947 How to Cite?
AbstractLarge Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements.
Persistent Identifierhttp://hdl.handle.net/10722/362631
ISSN
2023 Impact Factor: 7.5
2023 SCImago Journal Rankings: 1.821

 

DC FieldValueLanguage
dc.contributor.authorShao, Wenqi-
dc.contributor.authorLei, Meng-
dc.contributor.authorHu, Yutao-
dc.contributor.authorGao, Peng-
dc.contributor.authorXu, Peng-
dc.contributor.authorZhang, Kaipeng-
dc.contributor.authorMeng, Fanqing-
dc.contributor.authorHuang, Siyuan-
dc.contributor.authorLi, Hongsheng-
dc.contributor.authorQiao, Yu-
dc.contributor.authorLuo, Ping-
dc.date.accessioned2025-09-26T00:36:33Z-
dc.date.available2025-09-26T00:36:33Z-
dc.date.issued2025-01-01-
dc.identifier.citationIEEE Transactions on Big Data, 2025, v. 11, n. 3, p. 933-947-
dc.identifier.issn2332-7790-
dc.identifier.urihttp://hdl.handle.net/10722/362631-
dc.description.abstractLarge Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements.-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Big Data-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjectevaluation method-
dc.subjectLarge vision-language models-
dc.subjectmultimodal evaluation benchmark-
dc.titleTinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models-
dc.typeArticle-
dc.identifier.doi10.1109/TBDATA.2025.3536930-
dc.identifier.scopuseid_2-s2.0-85217029506-
dc.identifier.volume11-
dc.identifier.issue3-
dc.identifier.spage933-
dc.identifier.epage947-
dc.identifier.eissn2332-7790-
dc.identifier.issnl2332-7790-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats