File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-Resolution Information in Temporal Domain

TitleImproving Weakly Supervised Temporal Action Localization by Exploiting Multi-Resolution Information in Temporal Domain
Authors
Keywordstemporal multi-resolution information
two stream fusion
Weakly supervised temporal action localization
Issue Date2021
Citation
IEEE Transactions on Image Processing, 2021, v. 30, p. 6659-6672 How to Cite?
AbstractWeakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to generate high-quality frame-level pseudo labels by fully exploiting multi-resolution information in the temporal domain and complementary information between the appearance (i.e., RGB) and motion (i.e., optical flow) streams. In the first stage, we propose an Initial Label Generation (ILG) module to generate reliable initial frame-level pseudo labels. Specifically, in this newly proposed module, we exploit temporal multi-resolution consistency and cross-stream consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework to iteratively refine the pseudo labels, in which we use a set of selected frames with highly confident pseudo labels to progressively train two networks and better predict action class scores at each frame. Specifically, in our newly proposed PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each network/stream by exploiting the refined pseudo labels from another network/stream. Comprehensive experiments on two benchmark datasets THUMOS14 and ActivityNet v1.3 demonstrate the effectiveness of our newly proposed method for weakly supervised temporal action localization.
Persistent Identifierhttp://hdl.handle.net/10722/321949
ISSN
2023 Impact Factor: 10.8
2023 SCImago Journal Rankings: 3.556
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorSu, Rui-
dc.contributor.authorXu, Dong-
dc.contributor.authorZhou, Luping-
dc.contributor.authorOuyang, Wanli-
dc.date.accessioned2022-11-03T02:22:34Z-
dc.date.available2022-11-03T02:22:34Z-
dc.date.issued2021-
dc.identifier.citationIEEE Transactions on Image Processing, 2021, v. 30, p. 6659-6672-
dc.identifier.issn1057-7149-
dc.identifier.urihttp://hdl.handle.net/10722/321949-
dc.description.abstractWeakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to generate high-quality frame-level pseudo labels by fully exploiting multi-resolution information in the temporal domain and complementary information between the appearance (i.e., RGB) and motion (i.e., optical flow) streams. In the first stage, we propose an Initial Label Generation (ILG) module to generate reliable initial frame-level pseudo labels. Specifically, in this newly proposed module, we exploit temporal multi-resolution consistency and cross-stream consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework to iteratively refine the pseudo labels, in which we use a set of selected frames with highly confident pseudo labels to progressively train two networks and better predict action class scores at each frame. Specifically, in our newly proposed PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each network/stream by exploiting the refined pseudo labels from another network/stream. Comprehensive experiments on two benchmark datasets THUMOS14 and ActivityNet v1.3 demonstrate the effectiveness of our newly proposed method for weakly supervised temporal action localization.-
dc.languageeng-
dc.relation.ispartofIEEE Transactions on Image Processing-
dc.subjecttemporal multi-resolution information-
dc.subjecttwo stream fusion-
dc.subjectWeakly supervised temporal action localization-
dc.titleImproving Weakly Supervised Temporal Action Localization by Exploiting Multi-Resolution Information in Temporal Domain-
dc.typeArticle-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1109/TIP.2021.3089355-
dc.identifier.pmid34166188-
dc.identifier.scopuseid_2-s2.0-85111735059-
dc.identifier.volume30-
dc.identifier.spage6659-
dc.identifier.epage6672-
dc.identifier.eissn1941-0042-
dc.identifier.isiWOS:000679941200004-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats