Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher
{"title":"Approximating Human-Level 3D Visual Inferences With Deep Neural Networks.","authors":"Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher","doi":"10.1162/opmi_a_00189","DOIUrl":null,"url":null,"abstract":"<p><p>Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.</p>","PeriodicalId":32558,"journal":{"name":"Open Mind","volume":"9 ","pages":"305-324"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11864798/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Mind","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/opmi_a_00189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0
Abstract
Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.