File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: PonderV2: Improved 3D Representation with A Universal Pre-training Paradigm

TitlePonderV2: Improved 3D Representation with A Universal Pre-training Paradigm
Authors
Keywords3D pre-training
3D vision
foundation model
LiDAR
multi-view image
neural rendering
point cloud
RGB-D image
Issue Date18-Apr-2025
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 8, p. 6550-6565 How to Cite?
Abstract

In contrast to numerous NLP and 2D vision foundational models, training a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a volumetric neural renderer by comparing the rendered with the real images. Notably, our pre-trained encoder can be seamlessly applied to various downstream tasks. These tasks include semantic challenges like 3D detection and segmentation, which involve scene understanding, and non-semantic tasks like 3D reconstruction and image synthesis, which focus on geometry and visuals. They span both indoor and outdoor scenarios. We also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.


Persistent Identifierhttp://hdl.handle.net/10722/362091
ISSN
2023 Impact Factor: 20.8
2023 SCImago Journal Rankings: 6.158

 

DC FieldValueLanguage
dc.contributor.authorZhu, Haoyi-
dc.contributor.authorYang, Honghui-
dc.contributor.authorWu, Xiaoyang-
dc.contributor.authorDi, Huang-
dc.contributor.authorSha, Zhang-
dc.contributor.authorHe, Xianglong-
dc.contributor.authorZhao, Hengshuang-
dc.contributor.authorShen, Chunhua-
dc.contributor.authorYu, Qiao-
dc.contributor.authorHe, Tong-
dc.contributor.authorWanli, Ouyang.-
dc.date.accessioned2025-09-19T00:31:49Z-
dc.date.available2025-09-19T00:31:49Z-
dc.date.issued2025-04-18-
dc.identifier.citationIEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 8, p. 6550-6565-
dc.identifier.issn0162-8828-
dc.identifier.urihttp://hdl.handle.net/10722/362091-
dc.description.abstract<p>In contrast to numerous NLP and 2D vision foundational models, training a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a volumetric neural renderer by comparing the rendered with the real images. Notably, our pre-trained encoder can be seamlessly applied to various downstream tasks. These tasks include semantic challenges like 3D detection and segmentation, which involve scene understanding, and non-semantic tasks like 3D reconstruction and image synthesis, which focus on geometry and visuals. They span both indoor and outdoor scenarios. We also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.<br></p>-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Pattern Analysis and Machine Intelligence-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject3D pre-training-
dc.subject3D vision-
dc.subjectfoundation model-
dc.subjectLiDAR-
dc.subjectmulti-view image-
dc.subjectneural rendering-
dc.subjectpoint cloud-
dc.subjectRGB-D image-
dc.titlePonderV2: Improved 3D Representation with A Universal Pre-training Paradigm-
dc.typeArticle-
dc.identifier.doi10.1109/TPAMI.2025.3561598-
dc.identifier.scopuseid_2-s2.0-105002839761-
dc.identifier.volume47-
dc.identifier.issue8-
dc.identifier.spage6550-
dc.identifier.epage6565-
dc.identifier.eissn1939-3539-
dc.identifier.issnl0162-8828-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats