File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Conference Paper: Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
Title | Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection |
---|---|
Authors | |
Issue Date | 2023 |
Citation | Advances in Neural Information Processing Systems, 2023, v. 36 How to Cite? |
Abstract | Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR-or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS. The code will be released at https://github.com/OpenDriveLab/Birds-eye-view-Perception. |
Persistent Identifier | http://hdl.handle.net/10722/351498 |
ISSN | 2020 SCImago Journal Rankings: 1.399 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Huang, Linyan | - |
dc.contributor.author | Li, Zhiqi | - |
dc.contributor.author | Sima, Chonghao | - |
dc.contributor.author | Wang, Wenhai | - |
dc.contributor.author | Wang, Jingdong | - |
dc.contributor.author | Qiao, Yu | - |
dc.contributor.author | Li, Hongyang | - |
dc.date.accessioned | 2024-11-20T03:56:44Z | - |
dc.date.available | 2024-11-20T03:56:44Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Advances in Neural Information Processing Systems, 2023, v. 36 | - |
dc.identifier.issn | 1049-5258 | - |
dc.identifier.uri | http://hdl.handle.net/10722/351498 | - |
dc.description.abstract | Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR-or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS. The code will be released at https://github.com/OpenDriveLab/Birds-eye-view-Perception. | - |
dc.language | eng | - |
dc.relation.ispartof | Advances in Neural Information Processing Systems | - |
dc.title | Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection | - |
dc.type | Conference_Paper | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.scopus | eid_2-s2.0-85190517074 | - |
dc.identifier.volume | 36 | - |