Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting

Fang, Yanyan; Gao, Shenghua; Li, Jing; Luo, Weixin; He, Linfang; Hu, Bo

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1016/j.neucom.2020.01.087
Scopus: eid_2-s2.0-85079630205
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting

Title	Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting
Authors	Fang, Yanyan Gao, Shenghua Li, Jing Luo, Weixin He, Linfang Hu, Bo
Keywords	Convolutional neural network Locality-constrained spatial transformer network Multi-level feature fusion Video crowd counting
Issue Date	2020
Citation	Neurocomputing, 2020, v. 392, p. 98-107 How to Cite? DOI: http://dx.doi.org/10.1016/j.neucom.2020.01.087
Abstract	Video-based crowd counting can leverage the spatial-temporal information between neighboring frames, and thus this information would improve the robustness of crowd counting. Therefore, this solution is more practical than single image-based crowd counting in real applications. Since severe occlusions, translation, rotation, and scaling of persons will give rise to the change of density map of heads between neighboring frames, video-based crowd counting is a very challenging task. To alleviate these issues in video crowd counting, a Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network (MLSTN) is proposed, which consists of two components, namely density map regression module and Locality-Constrained Spatial Transformer (LST) module. Specifically, we first estimate the density map of each frame by utilizing the combination of the low-level, middle-level and high-level features of the Convolutional Neural Networks. This is because the low-level features may be more effective in the extraction of small head information, while the middle and high level features are more effective in the extraction of medium and large head information. Then to measure the relationship of the density maps between neighboring frames, the LST module is proposed, which estimates the density map of the next frame by concatenating several regression density maps. To facilitate the performance evaluation for video crowd counting, we have collected and labeled a large-scale video crowd counting dataset which includes 100 five-second-long sequences with 394,081 annotated heads from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments show the effectiveness of our proposed approach for crowd counting on our dataset and other video-based crowd counting datasets. All our dataset are released online.1
Persistent Identifier	http://hdl.handle.net/10722/345110
ISSN	0925-2312 2023 Impact Factor: 5.5 2023 SCImago Journal Rankings: 1.815

DC Field	Value	Language
dc.contributor.author	Fang, Yanyan	-
dc.contributor.author	Gao, Shenghua	-
dc.contributor.author	Li, Jing	-
dc.contributor.author	Luo, Weixin	-
dc.contributor.author	He, Linfang	-
dc.contributor.author	Hu, Bo	-
dc.date.accessioned	2024-08-15T09:25:19Z	-
dc.date.available	2024-08-15T09:25:19Z	-
dc.date.issued	2020	-
dc.identifier.citation	Neurocomputing, 2020, v. 392, p. 98-107	-
dc.identifier.issn	0925-2312	-
dc.identifier.uri	http://hdl.handle.net/10722/345110	-
dc.description.abstract	Video-based crowd counting can leverage the spatial-temporal information between neighboring frames, and thus this information would improve the robustness of crowd counting. Therefore, this solution is more practical than single image-based crowd counting in real applications. Since severe occlusions, translation, rotation, and scaling of persons will give rise to the change of density map of heads between neighboring frames, video-based crowd counting is a very challenging task. To alleviate these issues in video crowd counting, a Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network (MLSTN) is proposed, which consists of two components, namely density map regression module and Locality-Constrained Spatial Transformer (LST) module. Specifically, we first estimate the density map of each frame by utilizing the combination of the low-level, middle-level and high-level features of the Convolutional Neural Networks. This is because the low-level features may be more effective in the extraction of small head information, while the middle and high level features are more effective in the extraction of medium and large head information. Then to measure the relationship of the density maps between neighboring frames, the LST module is proposed, which estimates the density map of the next frame by concatenating several regression density maps. To facilitate the performance evaluation for video crowd counting, we have collected and labeled a large-scale video crowd counting dataset which includes 100 five-second-long sequences with 394,081 annotated heads from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments show the effectiveness of our proposed approach for crowd counting on our dataset and other video-based crowd counting datasets. All our dataset are released online.1	-
dc.language	eng	-
dc.relation.ispartof	Neurocomputing	-
dc.subject	Convolutional neural network	-
dc.subject	Locality-constrained spatial transformer network	-
dc.subject	Multi-level feature fusion	-
dc.subject	Video crowd counting	-
dc.title	Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1016/j.neucom.2020.01.087	-
dc.identifier.scopus	eid_2-s2.0-85079630205	-
dc.identifier.volume	392	-
dc.identifier.spage	98	-
dc.identifier.epage	107	-
dc.identifier.eissn	1872-8286	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats