Pub Date : 2026-03-11DOI: 10.1007/s11263-026-02767-6
Arjun Somayazulu, Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Whereas traditional methods for constructing such models assume dense geometry and/or sound measurements throughout the environment, we explore how to infer room impulse responses (RIRs) based on a sparse set of images and echoes observed in the space, as well as how to choose where to collect these audio-visual observations. Towards that goal, we first introduce a transformer-based method that uses self-attention to build a rich acoustic context, then infers the RIRs of arbitrary query source-receiver locations through cross-attention. Then, motivated by real-world physical constraints in collecting these observations, we further introduce active acoustic sampling , a new task in which a mobile agent jointly constructs the environment acoustic model and spatial occupancy map on-the-fly from sparse audio-visual observations. We train a reinforcement learning (RL) policy that guides agent navigation toward optimal acoustic data sampling positions, rewarding information gain for the full environment model. Evaluating on diverse unseen 3D indoor environments, our method outperforms the state-of-the-art and—in a major departure from traditional methods—generalizes to novel environments in a few-shot manner. Furthermore, when augmented with our active sampling policy, it successfully guides an embodied agent to acoustically informative positions given real-world exploration constraints, outperforming both traditional navigation agents and prior acoustic rendering methods. Project: http://vision.cs.utexas.edu/projects/fewShot-RIR .
{"title":"Sample-efficient Audio-Visual Learning of Scene Acoustics","authors":"Arjun Somayazulu, Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman","doi":"10.1007/s11263-026-02767-6","DOIUrl":"https://doi.org/10.1007/s11263-026-02767-6","url":null,"abstract":"An <jats:italic>environment acoustic model</jats:italic> represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Whereas traditional methods for constructing such models assume dense geometry and/or sound measurements throughout the environment, we explore how to infer room impulse responses (RIRs) based on a sparse set of images and echoes observed in the space, as well as how to choose where to collect these audio-visual observations. Towards that goal, we first introduce a transformer-based method that uses self-attention to build a rich acoustic context, then infers the RIRs of arbitrary query source-receiver locations through cross-attention. Then, motivated by real-world physical constraints in collecting these observations, we further introduce <jats:italic>active acoustic sampling</jats:italic> , a new task in which a mobile agent jointly constructs the environment acoustic model and spatial occupancy map on-the-fly from sparse audio-visual observations. We train a reinforcement learning (RL) policy that guides agent navigation toward optimal acoustic data sampling positions, rewarding information gain for the full environment model. Evaluating on diverse unseen 3D indoor environments, our method outperforms the state-of-the-art and—in a major departure from traditional methods—generalizes to novel environments in a few-shot manner. Furthermore, when augmented with our active sampling policy, it successfully guides an embodied agent to acoustically informative positions given real-world exploration constraints, outperforming both traditional navigation agents and prior acoustic rendering methods. Project: <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"http://vision.cs.utexas.edu/projects/fewShot-RIR\" ext-link-type=\"uri\">http://vision.cs.utexas.edu/projects/fewShot-RIR</jats:ext-link> .","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"36 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147462021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-07DOI: 10.1007/s11263-026-02731-4
Jiongshu Wang, Jing Yang, Jiankang Deng, Hatice Gunes, Siyang Song
Existing Graph Neural Networks (GNNs) are limited to process graphs each of whose vertices is represented by a vector or a single value, limited their representing capability to describe complex objects. In this paper, we propose a novel GNN (called Graph in Graph Neural (GIG) Network) which can process graph-style data (called GIG sample) whose vertices are further represented by graphs. Given a set of graphs or a data sample whose components can be represented by a set of graphs (called multi-graph data sample), our GIG network starts with a GIG sample generation (GSG) module which encodes the input as a GIG sample , where each GIG vertex includes a graph. Then, a set of GIG hidden layers are stacked, with each consisting of: (1) a GIG vertex-level updating (GVU) module that individually updates the graph in every GIG vertex based on its internal information; and (2) a global-level GIG sample updating (GGU) module that updates graphs in all GIG vertices based on their relationships, making the updated GIG vertices become global context-aware. This way, both internal cues within the graph contained in each GIG vertex and the relationships among GIG vertices could be utilized for down-stream tasks. Experimental results demonstrate that our GIG network generalizes well for not only various generic graph analysis tasks but also real-world multi-graph data analysis (e.g., human skeleton video-based action recognition), which achieved the new state-of-the-art results on 15 out of 16 evaluated datasets. Our code is publicly available at https://github.com/wangjs96/Graph-in-Graph-Neural-Network .
现有的图神经网络(gnn)仅限于处理每个顶点由向量或单个值表示的图,限制了它们描述复杂对象的表示能力。在本文中,我们提出了一种新的GNN(称为Graph In Graph Neural (GIG) Network),它可以处理图形样式的数据(称为GIG样本),这些数据的顶点进一步用图形表示。给定一组图或数据样本,其组件可以由一组图表示(称为多图数据样本),我们的GIG网络从一个GIG样本生成(GSG)模块开始,该模块将输入编码为GIG样本,其中每个GIG顶点包含一个图。然后,堆叠一组GIG隐藏层,每个隐藏层由:(1)GIG顶点级更新(GVU)模块组成,该模块根据每个GIG顶点的内部信息单独更新图;(2)全局级GIG样本更新(GGU)模块,根据所有GIG顶点的关系更新图形,使更新后的GIG顶点具有全局上下文感知能力。这样,每个GIG顶点中包含的图中的内部线索和GIG顶点之间的关系都可以用于下游任务。实验结果表明,我们的GIG网络不仅可以很好地泛化各种通用图分析任务,还可以很好地泛化现实世界的多图数据分析(例如,基于人体骨骼视频的动作识别),在16个评估数据集中的15个上取得了最新的结果。我们的代码可以在https://github.com/wangjs96/Graph-in-Graph-Neural-Network上公开获得。
{"title":"Graph in Graph Neural Network","authors":"Jiongshu Wang, Jing Yang, Jiankang Deng, Hatice Gunes, Siyang Song","doi":"10.1007/s11263-026-02731-4","DOIUrl":"https://doi.org/10.1007/s11263-026-02731-4","url":null,"abstract":"Existing Graph Neural Networks (GNNs) are limited to process graphs each of whose vertices is represented by a vector or a single value, limited their representing capability to describe complex objects. In this paper, we propose a novel GNN (called Graph in Graph Neural (GIG) Network) which can process graph-style data (called GIG sample) whose vertices are further represented by graphs. Given a set of graphs or a data sample whose components can be represented by a set of graphs (called multi-graph data sample), our GIG network starts with a GIG sample generation (GSG) module which encodes the input as a GIG sample , where each GIG vertex includes a graph. Then, a set of GIG hidden layers are stacked, with each consisting of: (1) a GIG vertex-level updating (GVU) module that individually updates the graph in every GIG vertex based on its internal information; and (2) a global-level GIG sample updating (GGU) module that updates graphs in all GIG vertices based on their relationships, making the updated GIG vertices become global context-aware. This way, both internal cues within the graph contained in each GIG vertex and the relationships among GIG vertices could be utilized for down-stream tasks. Experimental results demonstrate that our GIG network generalizes well for not only various generic graph analysis tasks but also real-world multi-graph data analysis (e.g., human skeleton video-based action recognition), which achieved the new state-of-the-art results on 15 out of 16 evaluated datasets. Our code is publicly available at <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://github.com/wangjs96/Graph-in-Graph-Neural-Network\" ext-link-type=\"uri\">https://github.com/wangjs96/Graph-in-Graph-Neural-Network</jats:ext-link> .","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147374227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-07DOI: 10.1007/s11263-025-02681-3
Lalit Manam, Venu Madhav Govindu
{"title":"Unifying Viewgraph Sparsification and Disambiguation of Repeated Structures in Structure-from-Motion","authors":"Lalit Manam, Venu Madhav Govindu","doi":"10.1007/s11263-025-02681-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02681-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"406 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147374229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-07DOI: 10.1007/s11263-025-02699-7
Yiwei Ma, Jiayi Ji, Zhipeng Qian, Xiaoshuai Sun, Rongrong Ji
{"title":"CoP: Chain of Perception for Referring 3D Instance Segmentation","authors":"Yiwei Ma, Jiayi Ji, Zhipeng Qian, Xiaoshuai Sun, Rongrong Ji","doi":"10.1007/s11263-025-02699-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02699-7","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"109 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147374221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-07DOI: 10.1007/s11263-025-02716-9
Yan Xia, Ran Ding, Ziyuan Qin, Guanqi Zhan, Kaichen Zhou, Long Yang, Hao Dong, Daniel Cremers
{"title":"TARGO and TARGO-Net: Benchmarking Target-Driven Object Grasping Under Occlusions","authors":"Yan Xia, Ran Ding, Ziyuan Qin, Guanqi Zhan, Kaichen Zhou, Long Yang, Hao Dong, Daniel Cremers","doi":"10.1007/s11263-025-02716-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02716-9","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"39 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147374224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}