File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TCSVT.2025.3566695
- Scopus: eid_2-s2.0-105004593993
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: Video Understanding with Large Language Models: A Survey
| Title | Video Understanding with Large Language Models: A Survey |
|---|---|
| Authors | |
| Keywords | Large Language Model Multimodality Learning Video Understanding Vision-Language Model |
| Issue Date | 1-Jan-2025 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Circuits and Systems for Video Technology, 2025 How to Cite? |
| Abstract | With the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer × LLM, Video Embedder × LLM, and (Analyzer + Embedder) × LLM. We identify five subtypes based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. This survey also presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methods for Vid-LLMs. Additionally, it explores the extensive applications of Vid-LLMs in various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Additionally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are encouraged to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding. |
| Persistent Identifier | http://hdl.handle.net/10722/362643 |
| ISSN | 2023 Impact Factor: 8.3 2023 SCImago Journal Rankings: 2.299 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Tang, Yunlong | - |
| dc.contributor.author | Bi, Jing | - |
| dc.contributor.author | Xu, Siting | - |
| dc.contributor.author | Song, Luchuan | - |
| dc.contributor.author | Liang, Susan | - |
| dc.contributor.author | Wang, Teng | - |
| dc.contributor.author | Zhang, Daoan | - |
| dc.contributor.author | An, Jie | - |
| dc.contributor.author | Lin, Jingyang | - |
| dc.contributor.author | Zhu, Rongyi | - |
| dc.contributor.author | Vosoughi, Ali | - |
| dc.contributor.author | Huang, Chao | - |
| dc.contributor.author | Zhang, Zeliang | - |
| dc.contributor.author | Liu, Pinxin | - |
| dc.contributor.author | Feng, Mingqian | - |
| dc.contributor.author | Zheng, Feng | - |
| dc.contributor.author | Zhang, Jianguo | - |
| dc.contributor.author | Luo, Ping | - |
| dc.contributor.author | Luo, Jiebo | - |
| dc.contributor.author | Xu, Chenliang | - |
| dc.date.accessioned | 2025-09-26T00:36:40Z | - |
| dc.date.available | 2025-09-26T00:36:40Z | - |
| dc.date.issued | 2025-01-01 | - |
| dc.identifier.citation | IEEE Transactions on Circuits and Systems for Video Technology, 2025 | - |
| dc.identifier.issn | 1051-8215 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362643 | - |
| dc.description.abstract | With the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer × LLM, Video Embedder × LLM, and (Analyzer + Embedder) × LLM. We identify five subtypes based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. This survey also presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methods for Vid-LLMs. Additionally, it explores the extensive applications of Vid-LLMs in various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Additionally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are encouraged to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding. | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Circuits and Systems for Video Technology | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | Large Language Model | - |
| dc.subject | Multimodality Learning | - |
| dc.subject | Video Understanding | - |
| dc.subject | Vision-Language Model | - |
| dc.title | Video Understanding with Large Language Models: A Survey | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1109/TCSVT.2025.3566695 | - |
| dc.identifier.scopus | eid_2-s2.0-105004593993 | - |
| dc.identifier.eissn | 1558-2205 | - |
| dc.identifier.issnl | 1051-8215 | - |
