File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TIP.2021.3089355
- Scopus: eid_2-s2.0-85111735059
- PMID: 34166188
- WOS: WOS:000679941200004
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-Resolution Information in Temporal Domain
Title | Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-Resolution Information in Temporal Domain |
---|---|
Authors | |
Keywords | temporal multi-resolution information two stream fusion Weakly supervised temporal action localization |
Issue Date | 2021 |
Citation | IEEE Transactions on Image Processing, 2021, v. 30, p. 6659-6672 How to Cite? |
Abstract | Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to generate high-quality frame-level pseudo labels by fully exploiting multi-resolution information in the temporal domain and complementary information between the appearance (i.e., RGB) and motion (i.e., optical flow) streams. In the first stage, we propose an Initial Label Generation (ILG) module to generate reliable initial frame-level pseudo labels. Specifically, in this newly proposed module, we exploit temporal multi-resolution consistency and cross-stream consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework to iteratively refine the pseudo labels, in which we use a set of selected frames with highly confident pseudo labels to progressively train two networks and better predict action class scores at each frame. Specifically, in our newly proposed PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each network/stream by exploiting the refined pseudo labels from another network/stream. Comprehensive experiments on two benchmark datasets THUMOS14 and ActivityNet v1.3 demonstrate the effectiveness of our newly proposed method for weakly supervised temporal action localization. |
Persistent Identifier | http://hdl.handle.net/10722/321949 |
ISSN | 2023 Impact Factor: 10.8 2023 SCImago Journal Rankings: 3.556 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Su, Rui | - |
dc.contributor.author | Xu, Dong | - |
dc.contributor.author | Zhou, Luping | - |
dc.contributor.author | Ouyang, Wanli | - |
dc.date.accessioned | 2022-11-03T02:22:34Z | - |
dc.date.available | 2022-11-03T02:22:34Z | - |
dc.date.issued | 2021 | - |
dc.identifier.citation | IEEE Transactions on Image Processing, 2021, v. 30, p. 6659-6672 | - |
dc.identifier.issn | 1057-7149 | - |
dc.identifier.uri | http://hdl.handle.net/10722/321949 | - |
dc.description.abstract | Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to generate high-quality frame-level pseudo labels by fully exploiting multi-resolution information in the temporal domain and complementary information between the appearance (i.e., RGB) and motion (i.e., optical flow) streams. In the first stage, we propose an Initial Label Generation (ILG) module to generate reliable initial frame-level pseudo labels. Specifically, in this newly proposed module, we exploit temporal multi-resolution consistency and cross-stream consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework to iteratively refine the pseudo labels, in which we use a set of selected frames with highly confident pseudo labels to progressively train two networks and better predict action class scores at each frame. Specifically, in our newly proposed PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each network/stream by exploiting the refined pseudo labels from another network/stream. Comprehensive experiments on two benchmark datasets THUMOS14 and ActivityNet v1.3 demonstrate the effectiveness of our newly proposed method for weakly supervised temporal action localization. | - |
dc.language | eng | - |
dc.relation.ispartof | IEEE Transactions on Image Processing | - |
dc.subject | temporal multi-resolution information | - |
dc.subject | two stream fusion | - |
dc.subject | Weakly supervised temporal action localization | - |
dc.title | Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-Resolution Information in Temporal Domain | - |
dc.type | Article | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1109/TIP.2021.3089355 | - |
dc.identifier.pmid | 34166188 | - |
dc.identifier.scopus | eid_2-s2.0-85111735059 | - |
dc.identifier.volume | 30 | - |
dc.identifier.spage | 6659 | - |
dc.identifier.epage | 6672 | - |
dc.identifier.eissn | 1941-0042 | - |
dc.identifier.isi | WOS:000679941200004 | - |