From Pixels to Prepositions: Linking Visual Perception with Spatial Prepositions Far and Near

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Cognitive Computation Pub Date : 2024-08-20 DOI:10.1007/s12559-024-10329-6

Krishna Raj S R, Srinivasa Chakravarthy V, Anindita Sahoo

{"title":"From Pixels to Prepositions: Linking Visual Perception with Spatial Prepositions Far and Near","authors":"Krishna Raj S R, Srinivasa Chakravarthy V, Anindita Sahoo","doi":"10.1007/s12559-024-10329-6","DOIUrl":null,"url":null,"abstract":"Human language is influenced by sensory-motor experiences. Sensory experiences gathered in a spatiotemporal world are used as raw material to create more abstract concepts. In language, one way to encode spatial relationships is through spatial prepositions. Spatial prepositions that specify the proximity of objects in space, like far and near or their variants, are found in most languages. The mechanism for determining the proximity of another entity to itself is a useful evolutionary trait. From the taxic behavior in unicellular organisms like bacteria to the tropism in the plant kingdom, this behavior can be found in almost all organisms. In humans, vision plays a critical role in spatial localization and navigation. This computational study analyzes the relationship between vision and spatial prepositions using an artificial neural network. For this study, a synthetic image dataset was created, with each image featuring a 2D projection of an object placed in 3D space. The objects can be of various shapes, sizes, and colors. A convolutional neural network is trained to classify the object in the images as far or near based on a set threshold. The study mainly explores two visual scenarios: objects confined to a plane (grounded) and objects not confined to a plane (ungrounded), while also analyzing the influence of camera placement. The classification performance is high for the grounded case, demonstrating that the problem of far/near classification is well-defined for grounded objects, given that the camera is at a sufficient height. The network performance showed that depth can be determined in grounded cases only from monocular cues with high accuracy, given the camera is at an adequate height. The difference in the network’s performance between grounded and ungrounded cases can be explained using the physical properties of the retinal imaging system. The task of determining the distance of an object from individual images in the dataset is challenging as they lack any background cues. Still, the network performance shows the influence of spatial constraints placed on the image generation process in determining depth. The results show that monocular cues significantly contribute to depth perception when all the objects are confined to a single plane. A set of sensory inputs (images) and a specific task (far/near classification) allowed us to obtain the aforementioned results. The visual task, along with reaching and motion, may enable humans to carve the space into various spatial prepositional categories like far and near. The network’s performance and how it learns to classify between far and near provided insights into certain visual illusions that involve size constancy.","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":"42 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10329-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human language is influenced by sensory-motor experiences. Sensory experiences gathered in a spatiotemporal world are used as raw material to create more abstract concepts. In language, one way to encode spatial relationships is through spatial prepositions. Spatial prepositions that specify the proximity of objects in space, like far and near or their variants, are found in most languages. The mechanism for determining the proximity of another entity to itself is a useful evolutionary trait. From the taxic behavior in unicellular organisms like bacteria to the tropism in the plant kingdom, this behavior can be found in almost all organisms. In humans, vision plays a critical role in spatial localization and navigation. This computational study analyzes the relationship between vision and spatial prepositions using an artificial neural network. For this study, a synthetic image dataset was created, with each image featuring a 2D projection of an object placed in 3D space. The objects can be of various shapes, sizes, and colors. A convolutional neural network is trained to classify the object in the images as far or near based on a set threshold. The study mainly explores two visual scenarios: objects confined to a plane (grounded) and objects not confined to a plane (ungrounded), while also analyzing the influence of camera placement. The classification performance is high for the grounded case, demonstrating that the problem of far/near classification is well-defined for grounded objects, given that the camera is at a sufficient height. The network performance showed that depth can be determined in grounded cases only from monocular cues with high accuracy, given the camera is at an adequate height. The difference in the network’s performance between grounded and ungrounded cases can be explained using the physical properties of the retinal imaging system. The task of determining the distance of an object from individual images in the dataset is challenging as they lack any background cues. Still, the network performance shows the influence of spatial constraints placed on the image generation process in determining depth. The results show that monocular cues significantly contribute to depth perception when all the objects are confined to a single plane. A set of sensory inputs (images) and a specific task (far/near classification) allowed us to obtain the aforementioned results. The visual task, along with reaching and motion, may enable humans to carve the space into various spatial prepositional categories like far and near. The network’s performance and how it learns to classify between far and near provided insights into certain visual illusions that involve size constancy.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从像素到介词：将视觉感知与空间介词远近联系起来

人类语言受到感官运动经验的影响。在时空世界中收集的感官经验被用作创建更抽象概念的原材料。在语言中，空间关系的编码方式之一是空间介词。大多数语言中都有空间介词，如远和近或它们的变体，用于指明空间中物体的远近。确定另一个实体与自身距离的机制是一种有用的进化特征。从细菌等单细胞生物的分类行为到植物界的趋向性，几乎所有生物都有这种行为。对于人类来说，视觉在空间定位和导航中起着至关重要的作用。这项计算研究利用人工神经网络分析了视觉与空间介词之间的关系。在这项研究中，我们创建了一个合成图像数据集，每个图像都是一个物体在三维空间中的二维投影。物体的形状、大小和颜色各不相同。通过训练卷积神经网络，可以根据设定的阈值将图像中的物体分为远近两类。研究主要探讨了两种视觉场景：局限于平面内的物体（接地）和不局限于平面内的物体（非接地），同时还分析了摄像头位置的影响。在接地的情况下，分类性能很高，这表明只要摄像机处于足够的高度，接地物体的远近分类问题就能得到很好的解决。网络性能表明，在摄像机处于足够高度的情况下，接地情况下只能通过单目线索高精度地确定深度。网络在接地和不接地情况下的性能差异可以用视网膜成像系统的物理特性来解释。从数据集中的单个图像中确定物体的距离是一项具有挑战性的任务，因为这些图像缺乏任何背景线索。不过，网络性能显示了图像生成过程中的空间限制对确定深度的影响。结果表明，当所有物体都被限制在一个平面内时，单眼线索对深度知觉有很大的帮助。通过一组感官输入（图像）和一项特定任务（远/近分类），我们获得了上述结果。视觉任务以及伸手和运动可能会使人类将空间划分为远近等不同的空间介词类别。该网络的表现以及它是如何学会远近分类的，为我们深入了解某些涉及尺寸恒定的视觉错觉提供了启示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Cognitive Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-NEUROSCIENCES

CiteScore

9.30

自引率

3.70%

发文量

116

审稿时长

>12 weeks

期刊介绍： Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.