File Download
Supplementary

postgraduate thesis: Strengthening cross-interaction learning for vision networks

TitleStrengthening cross-interaction learning for vision networks
Authors
Advisors
Advisor(s):Li, G
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Fang, Y. [方艷雯]. (2023). Strengthening cross-interaction learning for vision networks. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractIn recent years, the field of computer vision has grown astoundingly due to the notable success achieved by various vision networks such as CNNs, vision Transformers and so on. A vision network is generally designed to learn various interactions between objects for different tasks. For example, learning the temporal interaction between different time steps is key to modeling time series data for prediction task. This thesis studies strengthening cross-interaction learning for vision networks in three aspects: cross-layer interaction in backbone models, intraperiod and intratrend temporal interactions in human motion, and person-person interaction in multi-person poses. To achieve these objectives, the thesis proposes three approaches, all of which enhance the representation power of the networks with notable performances. Firstly, a new cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), is proposed to strengthen layerwise interactions by retrieving query-related information from previous layers. To reduce the quadratic computation cost inherited from the vanilla attention, a light-weighted version of MRLA with linear complexity is further proposed to make cross-layer attention feasible to more deep networks. This thesis devises MRLA as a plug-and-play module which is compatible with two types of mainstream vision networks: CNNs and vision Transformers. Remarkable improvements brought by MRLA in image classification, object detection and instance segmentation tasks on benchmark datasets demonstrate its effectiveness, showing that MRLA can enrich the representation power of many state-of-the-art vision networks by linking the fine-grained features to the global ones. Secondly, this thesis explores the intraperiod and intratrend interactions for human motion prediction. A new periodic-trend pose decomposition (PTPDecomp) block is proposed to decompose the hidden pose sequences into period and trend components for separately modeling the temporal dependencies within the period and trend. The PTPDecomp block cooperates with spatial GCNs and temporal GCNs, leading to an encoder-decoder framework called Periodic-Trend Enhanced GCN (PTE-GCN). The encoder or decoder progressively eliminates or refines the long-term trend pattern and focuses on modeling the period pattern, which facilitates learning the intricate temporal relationships entangled in pose sequences. Experiment results on three benchmark datasets demonstrate that PTE-GCN surpasses the state-of-the-art methods in both short-term and long-term predictions, especially for the periodic actions like walking in the long-term forecasting. Lastly, this thesis studies the interactions between the highly interacted persons' motion trajectories in the task of multi-person extreme motion prediction. A novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences. Additionally, a proxy unit is introduced to bridge the involved persons, which cooperates with the XQA module and subtly controls the bidirectional information flows. These designs are then integrated into a Transformer-based architecture and an end-to-end framework called proxy-bridged game Transformer (PGformer) is devised for multi-person motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, and PGformer consistently outperforms the state-of-the-art methods in both short-term and long-term predictions. Besides, PGformer can also be well-compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results.
DegreeDoctor of Philosophy
SubjectComputer vision
Neural networks (Computer science)
Dept/ProgramStatistics and Actuarial Science
Persistent Identifierhttp://hdl.handle.net/10722/335946

 

DC FieldValueLanguage
dc.contributor.advisorLi, G-
dc.contributor.authorFang, Yanwen-
dc.contributor.author方艷雯-
dc.date.accessioned2023-12-29T04:05:04Z-
dc.date.available2023-12-29T04:05:04Z-
dc.date.issued2023-
dc.identifier.citationFang, Y. [方艷雯]. (2023). Strengthening cross-interaction learning for vision networks. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/335946-
dc.description.abstractIn recent years, the field of computer vision has grown astoundingly due to the notable success achieved by various vision networks such as CNNs, vision Transformers and so on. A vision network is generally designed to learn various interactions between objects for different tasks. For example, learning the temporal interaction between different time steps is key to modeling time series data for prediction task. This thesis studies strengthening cross-interaction learning for vision networks in three aspects: cross-layer interaction in backbone models, intraperiod and intratrend temporal interactions in human motion, and person-person interaction in multi-person poses. To achieve these objectives, the thesis proposes three approaches, all of which enhance the representation power of the networks with notable performances. Firstly, a new cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), is proposed to strengthen layerwise interactions by retrieving query-related information from previous layers. To reduce the quadratic computation cost inherited from the vanilla attention, a light-weighted version of MRLA with linear complexity is further proposed to make cross-layer attention feasible to more deep networks. This thesis devises MRLA as a plug-and-play module which is compatible with two types of mainstream vision networks: CNNs and vision Transformers. Remarkable improvements brought by MRLA in image classification, object detection and instance segmentation tasks on benchmark datasets demonstrate its effectiveness, showing that MRLA can enrich the representation power of many state-of-the-art vision networks by linking the fine-grained features to the global ones. Secondly, this thesis explores the intraperiod and intratrend interactions for human motion prediction. A new periodic-trend pose decomposition (PTPDecomp) block is proposed to decompose the hidden pose sequences into period and trend components for separately modeling the temporal dependencies within the period and trend. The PTPDecomp block cooperates with spatial GCNs and temporal GCNs, leading to an encoder-decoder framework called Periodic-Trend Enhanced GCN (PTE-GCN). The encoder or decoder progressively eliminates or refines the long-term trend pattern and focuses on modeling the period pattern, which facilitates learning the intricate temporal relationships entangled in pose sequences. Experiment results on three benchmark datasets demonstrate that PTE-GCN surpasses the state-of-the-art methods in both short-term and long-term predictions, especially for the periodic actions like walking in the long-term forecasting. Lastly, this thesis studies the interactions between the highly interacted persons' motion trajectories in the task of multi-person extreme motion prediction. A novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences. Additionally, a proxy unit is introduced to bridge the involved persons, which cooperates with the XQA module and subtly controls the bidirectional information flows. These designs are then integrated into a Transformer-based architecture and an end-to-end framework called proxy-bridged game Transformer (PGformer) is devised for multi-person motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, and PGformer consistently outperforms the state-of-the-art methods in both short-term and long-term predictions. Besides, PGformer can also be well-compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshNeural networks (Computer science)-
dc.titleStrengthening cross-interaction learning for vision networks-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineStatistics and Actuarial Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044751040203414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats