Fine-Grained Video Categorization with Redundancy Reduction Attention

Zhu, Chen; Tan, Xiao; Zhou, Feng; Liu, Xiao; Yue, Kaiyu; Ding, Errui; Ma, Yi

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/978-3-030-01228-1_9
Scopus: eid_2-s2.0-85055104808
WOS: WOS:000594216400009
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Conference papers

Conference Paper: Fine-Grained Video Categorization with Redundancy Reduction Attention

Title	Fine-Grained Video Categorization with Redundancy Reduction Attention
Authors	Zhu, Chen Tan, Xiao Zhou, Feng Liu, Xiao Yue, Kaiyu Ding, Errui Ma, Yi
Keywords	Attention mechanism Fine-grained video categorization
Issue Date	2018
Citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11209 LNCS, p. 139-155 How to Cite? DOI: http://dx.doi.org/10.1007/978-3-030-01228-1_9
Abstract	For fine-grained categorization tasks, videos could serve as a better source than static images as videos have a higher chance of containing discriminative patterns. Nevertheless, a video sequence could also contain a lot of redundant and irrelevant frames. How to locate critical information of interest is a challenging task. In this paper, we propose a new network structure, known as Redundancy Reduction Attention (RRA), which learns to focus on multiple discriminative patterns by suppressing redundant feature channels. Specifically, it firstly summarizes the video by weight-summing all feature vectors in the feature maps of selected frames with a spatio-temporal soft attention, and then predicts which channels to suppress or to enhance according to this summary with a learned non-linear transform. Suppression is achieved by modulating the feature maps and threshing out weak activations. The updated feature maps are then used in the next iteration. Finally, the video is classified based on multiple summaries. The proposed method achieves outstanding performances in multiple video classification datasets. Furthermore, we have collected two large-scale video datasets, YouTube-Birds and YouTube-Cars, for future researches on fine-grained video categorization. The datasets are available at http://www.cs.umd.edu/~chenzhu/fgvc.
Persistent Identifier	http://hdl.handle.net/10722/327209
ISSN	0302-9743 2023 SCImago Journal Rankings: 0.606
ISI Accession Number ID	WOS:000594216400009

DC Field	Value	Language
dc.contributor.author	Zhu, Chen	-
dc.contributor.author	Tan, Xiao	-
dc.contributor.author	Zhou, Feng	-
dc.contributor.author	Liu, Xiao	-
dc.contributor.author	Yue, Kaiyu	-
dc.contributor.author	Ding, Errui	-
dc.contributor.author	Ma, Yi	-
dc.date.accessioned	2023-03-31T05:29:44Z	-
dc.date.available	2023-03-31T05:29:44Z	-
dc.date.issued	2018	-
dc.identifier.citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11209 LNCS, p. 139-155	-
dc.identifier.issn	0302-9743	-
dc.identifier.uri	http://hdl.handle.net/10722/327209	-
dc.description.abstract	For fine-grained categorization tasks, videos could serve as a better source than static images as videos have a higher chance of containing discriminative patterns. Nevertheless, a video sequence could also contain a lot of redundant and irrelevant frames. How to locate critical information of interest is a challenging task. In this paper, we propose a new network structure, known as Redundancy Reduction Attention (RRA), which learns to focus on multiple discriminative patterns by suppressing redundant feature channels. Specifically, it firstly summarizes the video by weight-summing all feature vectors in the feature maps of selected frames with a spatio-temporal soft attention, and then predicts which channels to suppress or to enhance according to this summary with a learned non-linear transform. Suppression is achieved by modulating the feature maps and threshing out weak activations. The updated feature maps are then used in the next iteration. Finally, the video is classified based on multiple summaries. The proposed method achieves outstanding performances in multiple video classification datasets. Furthermore, we have collected two large-scale video datasets, YouTube-Birds and YouTube-Cars, for future researches on fine-grained video categorization. The datasets are available at http://www.cs.umd.edu/~chenzhu/fgvc.	-
dc.language	eng	-
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	-
dc.subject	Attention mechanism	-
dc.subject	Fine-grained video categorization	-
dc.title	Fine-Grained Video Categorization with Redundancy Reduction Attention	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/978-3-030-01228-1_9	-
dc.identifier.scopus	eid_2-s2.0-85055104808	-
dc.identifier.volume	11209 LNCS	-
dc.identifier.spage	139	-
dc.identifier.epage	155	-
dc.identifier.eissn	1611-3349	-
dc.identifier.isi	WOS:000594216400009	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Fine-Grained Video Categorization with Redundancy Reduction Attention

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats