File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding

TitleSTVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
Authors
Issue Date2021
Citation
Proceedings of the IEEE International Conference on Computer Vision, 2021, p. 1513-1522 How to Cite?
AbstractSpatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube of a target object in an untrimmed video based on a query sentence. In this work, we propose a one-stage visual-linguistic transformer based framework called STVGBert for the STVG task, which can simultaneously localize the target object in both spatial and temporal domains. Specifically, without resorting to pre-generated object proposals, our STVGBert directly takes a video and a query sentence as the input, and then produces the cross-modal features by using the newly introduced cross-modal feature learning module ST-ViLBert. Based on the cross-modal features, our method then generates bounding boxes and predicts the starting and ending frames to produce the predicted object tube. To the best of our knowledge, our STVGBert is the first one-stage method, which can handle the STVG task without relying on any pre-trained object detectors. Comprehensive experiments demonstrate our newly proposed framework outperforms the state-ofthe-art multi-stage methods on two benchmark datasets VidSTG and HC-STVG.
Persistent Identifierhttp://hdl.handle.net/10722/321976
ISSN
2023 SCImago Journal Rankings: 12.263
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorSu, Rui-
dc.contributor.authorYu, Qian-
dc.contributor.authorXu, Dong-
dc.date.accessioned2022-11-03T02:22:45Z-
dc.date.available2022-11-03T02:22:45Z-
dc.date.issued2021-
dc.identifier.citationProceedings of the IEEE International Conference on Computer Vision, 2021, p. 1513-1522-
dc.identifier.issn1550-5499-
dc.identifier.urihttp://hdl.handle.net/10722/321976-
dc.description.abstractSpatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube of a target object in an untrimmed video based on a query sentence. In this work, we propose a one-stage visual-linguistic transformer based framework called STVGBert for the STVG task, which can simultaneously localize the target object in both spatial and temporal domains. Specifically, without resorting to pre-generated object proposals, our STVGBert directly takes a video and a query sentence as the input, and then produces the cross-modal features by using the newly introduced cross-modal feature learning module ST-ViLBert. Based on the cross-modal features, our method then generates bounding boxes and predicts the starting and ending frames to produce the predicted object tube. To the best of our knowledge, our STVGBert is the first one-stage method, which can handle the STVG task without relying on any pre-trained object detectors. Comprehensive experiments demonstrate our newly proposed framework outperforms the state-ofthe-art multi-stage methods on two benchmark datasets VidSTG and HC-STVG.-
dc.languageeng-
dc.relation.ispartofProceedings of the IEEE International Conference on Computer Vision-
dc.titleSTVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding-
dc.typeConference_Paper-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1109/ICCV48922.2021.00156-
dc.identifier.scopuseid_2-s2.0-85122350326-
dc.identifier.spage1513-
dc.identifier.epage1522-
dc.identifier.isiWOS:000797698901070-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats