End-to-End Video Text Spotting with Transformer

Wu, W; Cai, Y; Shen, C; Zhang, D; Fu, Y; Zhou, H; Luo, P

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.48550/arXiv.2203.10539

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: End-to-End Video Text Spotting with Transformer

Title	End-to-End Video Text Spotting with Transformer
Authors	Wu, W Cai, Y Shen, C Zhang, D Fu, Y Zhou, H Luo, P
Issue Date	2022
Publisher	Ortra Ltd.
Citation	European Conference on Computer Vision (Hybrid), Tel Aviv, Israel, October 23-27, 2022. In Proceedings of the European Conference on Computer Vision (ECCV), 2022 How to Cite? DOI: http://dx.doi.org/10.48550/arXiv.2203.10539
Abstract	Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. These methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines. In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes each text implicitly by the different query termed text query over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments in four video text datasets (i.e.,ICDAR2013 Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to demonstrate that TransDETR achieves state-of-the-art performance with up to around 8.0% improvements on video text spotting tasks.
Persistent Identifier	http://hdl.handle.net/10722/315806

DC Field	Value	Language
dc.contributor.author	Wu, W	-
dc.contributor.author	Cai, Y	-
dc.contributor.author	Shen, C	-
dc.contributor.author	Zhang, D	-
dc.contributor.author	Fu, Y	-
dc.contributor.author	Zhou, H	-
dc.contributor.author	Luo, P	-
dc.date.accessioned	2022-08-19T09:04:47Z	-
dc.date.available	2022-08-19T09:04:47Z	-
dc.date.issued	2022	-
dc.identifier.citation	European Conference on Computer Vision (Hybrid), Tel Aviv, Israel, October 23-27, 2022. In Proceedings of the European Conference on Computer Vision (ECCV), 2022	-
dc.identifier.uri	http://hdl.handle.net/10722/315806	-
dc.description.abstract	Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. These methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines. In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes each text implicitly by the different query termed text query over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments in four video text datasets (i.e.,ICDAR2013 Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to demonstrate that TransDETR achieves state-of-the-art performance with up to around 8.0% improvements on video text spotting tasks.	-
dc.language	eng	-
dc.publisher	Ortra Ltd.	-
dc.relation.ispartof	Proceedings of the European Conference on Computer Vision (ECCV), 2022	-
dc.title	End-to-End Video Text Spotting with Transformer	-
dc.type	Conference_Paper	-
dc.identifier.email	Luo, P: pluo@hku.hk	-
dc.identifier.authority	Luo, P=rp02575	-
dc.identifier.doi	10.48550/arXiv.2203.10539	-
dc.identifier.hkuros	335609	-
dc.publisher.place	Israel	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: End-to-End Video Text Spotting with Transformer

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats