Fusing differentiable rendering and language–image contrastive learning for superior zero-shot point cloud classification
Xie, Jinlong; Cheng, Long; Wang, Gang; Hu, Min; Yu, Zaiyang; Du, Minghua; Ning, Xin Source: Displays, v 84, September 2024; ISSN: 01419382; DOI: 10.1016/j.displa.2024.102773; Article number: 102773; Publisher: Elsevier B.V.
Author affiliation:
School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing; 100029, China
Chinese Academy of Sciences, Institute of Semiconductors, Annlab, Beijing; 100083, China
School of Control and Computer Engineering, North China Electric Power University, Beijing; 102206, China
School of Computing and Data Engineering, NingboTech University, Ningbo; 315100, China
Department of bioengineering, Imperial College London, London; SW7 2AZ, United Kingdom
Department of Ecrossncy, the First Medical Center, Chinese PLA General Hospital, Beijing; 100853, China
Beijing Ratu Technology Co., Ltd, Beijing; 100096, China
Abstract:
Zero-shot point cloud classification involves recognizing categories not encountered during training. Current models often exhibit reduced accuracy on unseen categories without 3D pre-training, emphasizing the need for improved precision and interoperability. We propose a novel approach integrating differentiable rendering with contrastive language–image pre-training. Initially, differentiable rendering autonomously learns representative viewpoints from the data, enabling the transformation of point clouds into multi-view images while preserving key visual information. This transformation facilitates optimized viewpoint selection during training, refining the final feature representation. Features are extracted from the multi-view images and integrated into a global multi-view feature using a cross-attention mechanism. On the textual side, a large language model (LLM) is provided with 3D heuristic prompts to generate 3D-specific text reflecting category-specific traits, from which textual features are derived. The LLM's extensive pre-trained knowledge enables it to capture abstract notions and categorical features relevant to distinct point cloud categories. Visual and textual features are aligned in a unified embedding space, enabling zero-shot classification. Throughout training, the Structural Similarity Index (SSIM) is integrated into the loss function to encourage the model to discern more distinctive viewpoints, reduce redundancy in multi-view imagery, and enhance computational efficiency. Experimental results on the ModelNet10, ModelNet40, and ScanObjectNN datasets demonstrate classification accuracies of 75.68%, 66.42%, and 52.03%, respectively, surpassing prevailing methods in zero-shot point cloud classification accuracy.