首页 > 最新文献

IEEE Transactions on Pattern Analysis and Machine Intelligence最新文献

英文 中文
A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning 持续学习之外深度学习中的遗忘综合调查
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-14 DOI: 10.1109/tpami.2024.3498346
Zhenyi Wang, Enneng Yang, Li Shen, Heng Huang
{"title":"A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning","authors":"Zhenyi Wang, Enneng Yang, Li Shen, Heng Huang","doi":"10.1109/tpami.2024.3498346","DOIUrl":"https://doi.org/10.1109/tpami.2024.3498346","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"3 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiffI2I: Efficient Diffusion Model for Image-to-Image Translation DiffI2I:图像到图像转换的高效扩散模型
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-14 DOI: 10.1109/tpami.2024.3498003
Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Radu Timotfe, Luc Van Gool
{"title":"DiffI2I: Efficient Diffusion Model for Image-to-Image Translation","authors":"Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Radu Timotfe, Luc Van Gool","doi":"10.1109/tpami.2024.3498003","DOIUrl":"https://doi.org/10.1109/tpami.2024.3498003","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"109 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PATNAS: A Path-Based Training-Free Neural Architecture Search PATNAS:基于路径的免训练神经架构搜索
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-14 DOI: 10.1109/tpami.2024.3498035
Jiechao Yang, Yong Liu, Wei Wang, Haoran Wu, Zhiyuan Chen, Xibo Ma
{"title":"PATNAS: A Path-Based Training-Free Neural Architecture Search","authors":"Jiechao Yang, Yong Liu, Wei Wang, Haoran Wu, Zhiyuan Chen, Xibo Ma","doi":"10.1109/tpami.2024.3498035","DOIUrl":"https://doi.org/10.1109/tpami.2024.3498035","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"246 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introspective Deep Metric Learning 反思性深度度量学习
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-05 DOI: 10.48550/arXiv.2205.04449
Cheng-Hao Wang, Wenzhao Zheng, Zheng Hua Zhu, Jie Zhou, Jiwen Lu
This paper proposes an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images. Conventional deep metric learning methods focus on learning a discriminative embedding to describe the semantic features of images, which ignore the existence of uncertainty in each image resulting from noise or semantic ambiguity. Training without awareness of these uncertainties causes the model to overfit the annotated labels during training and produce overconfident judgments during inference. Motivated by this, we argue that a good similarity model should consider the semantic discrepancies with awareness of the uncertainty to better deal with ambiguous images for more robust training. To achieve this, we propose to represent an image using not only a semantic embedding but also an accompanying uncertainty embedding, which describes the semantic characteristics and ambiguity of an image, respectively. We further propose an introspective similarity metric to make similarity judgments between images considering both their semantic differences and ambiguities. The gradient analysis of the proposed metric shows that it enables the model to learn at an adaptive and slower pace to deal with the uncertainty during training. Our framework attains state-of-the-art performance on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets for image retrieval. We further evaluate our framework for image classification on the ImageNet-1K, CIFAR-10, and CIFAR-100 datasets, which shows that equipping existing data mixing methods with the proposed introspective metric consistently achieves better results (e.g., +0.44% for CutMix on ImageNet-1K).
本文提出了一种内省式深度度量学习(IDML)框架,用于图像的不确定性比较。传统的深度度量学习方法侧重于学习判别嵌入来描述图像的语义特征,而忽略了噪声或语义模糊导致的每个图像中存在的不确定性。在没有意识到这些不确定性的情况下进行训练会导致模型在训练过程中过度拟合注释标签,并在推理过程中产生过度自信的判断。受此启发,我们认为,一个好的相似性模型应该考虑语义差异,并意识到不确定性,以更好地处理模糊图像,从而进行更稳健的训练。为了实现这一点,我们建议不仅使用语义嵌入,还使用伴随的不确定性嵌入来表示图像,该嵌入分别描述图像的语义特征和模糊性。我们进一步提出了一种内省相似性度量,以在考虑图像语义差异和歧义的情况下对图像进行相似性判断。对所提出的度量的梯度分析表明,它使模型能够以自适应和较慢的速度学习,以应对训练过程中的不确定性。我们的框架在广泛使用的用于图像检索的CUB200-2011、Cars196和Stanford Online Products数据集上获得了最先进的性能。我们在ImageNet-1K、CIFAR-10和CIFAR-100数据集上进一步评估了我们的图像分类框架,这表明,为现有的数据混合方法配备所提出的内省度量始终可以获得更好的结果(例如,ImageNet-1K上的CutMix为+0.44%)。
{"title":"Introspective Deep Metric Learning","authors":"Cheng-Hao Wang, Wenzhao Zheng, Zheng Hua Zhu, Jie Zhou, Jiwen Lu","doi":"10.48550/arXiv.2205.04449","DOIUrl":"https://doi.org/10.48550/arXiv.2205.04449","url":null,"abstract":"This paper proposes an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images. Conventional deep metric learning methods focus on learning a discriminative embedding to describe the semantic features of images, which ignore the existence of uncertainty in each image resulting from noise or semantic ambiguity. Training without awareness of these uncertainties causes the model to overfit the annotated labels during training and produce overconfident judgments during inference. Motivated by this, we argue that a good similarity model should consider the semantic discrepancies with awareness of the uncertainty to better deal with ambiguous images for more robust training. To achieve this, we propose to represent an image using not only a semantic embedding but also an accompanying uncertainty embedding, which describes the semantic characteristics and ambiguity of an image, respectively. We further propose an introspective similarity metric to make similarity judgments between images considering both their semantic differences and ambiguities. The gradient analysis of the proposed metric shows that it enables the model to learn at an adaptive and slower pace to deal with the uncertainty during training. Our framework attains state-of-the-art performance on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets for image retrieval. We further evaluate our framework for image classification on the ImageNet-1K, CIFAR-10, and CIFAR-100 datasets, which shows that equipping existing data mixing methods with the proposed introspective metric consistently achieves better results (e.g., +0.44% for CutMix on ImageNet-1K).","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46967733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning to Solve Hard Minimal Problems. 学会解决最简单的难题
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-23 DOI: 10.1109/TPAMI.2023.3307898
Petr Hruby, Timothy Duff, Anton Leykin, Tomas Pajdla

We present an approach to solving hard geometric optimization problems in the RANSAC framework. The hard minimal problems arise from relaxing the original geometric optimization problem into a minimal problem with many spurious solutions. Our approach avoids computing large numbers of spurious solutions. We design a learning strategy for selecting a starting problem-solution pair that can be numerically continued to the problem and the solution of interest. We demonstrate our approach by developing a RANSAC solver for the problem of computing the relative pose of three calibrated cameras, via a minimal relaxation using four points in each view. On average, we can solve a single problem in under 70 μs. We also benchmark and study our engineering choices on the very familiar problem of computing the relative pose of two calibrated cameras, via the minimal case of five points in two views.

我们提出了一种在 RANSAC 框架下解决困难几何优化问题的方法。硬最小问题产生于将原始几何优化问题放宽为具有许多虚假解的最小问题。我们的方法可以避免计算大量虚假解。我们设计了一种学习策略,用于选择起始问题-解决方案对,该问题和感兴趣的解决方案可以在数值上继续。我们通过使用每个视图中的四个点进行最小松弛,为计算三个校准摄像机的相对姿态问题开发了一个 RANSAC 求解器,从而演示了我们的方法。平均而言,我们可以在 70 μs 内解决一个问题。我们还对我们的工程选择进行了基准测试和研究,该选择适用于我们非常熟悉的问题,即通过两个视图中五个点的最小情况计算两个校准摄像机的相对姿态。
{"title":"Learning to Solve Hard Minimal Problems.","authors":"Petr Hruby, Timothy Duff, Anton Leykin, Tomas Pajdla","doi":"10.1109/TPAMI.2023.3307898","DOIUrl":"10.1109/TPAMI.2023.3307898","url":null,"abstract":"<p><p>We present an approach to solving hard geometric optimization problems in the RANSAC framework. The hard minimal problems arise from relaxing the original geometric optimization problem into a minimal problem with many spurious solutions. Our approach avoids computing large numbers of spurious solutions. We design a learning strategy for selecting a starting problem-solution pair that can be numerically continued to the problem and the solution of interest. We demonstrate our approach by developing a RANSAC solver for the problem of computing the relative pose of three calibrated cameras, via a minimal relaxation using four points in each view. On average, we can solve a single problem in under 70 μs. We also benchmark and study our engineering choices on the very familiar problem of computing the relative pose of two calibrated cameras, via the minimal case of five points in two views.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10055768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coarse-to-Fine Multi-Scene Pose Regression with Transformers 使用Transformers进行粗到细多场景姿势回归
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-22 DOI: 10.48550/arXiv.2308.11783
Yoli Shavit, Ron Ferens, Y. Keller
Absolute camera pose regressors estimate the posi-tion and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while em-bedding multiple scenes in parallel. We extend our previous MS-Transformer approach [1] by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from here.
绝对相机姿态回归器仅在给定捕获图像的情况下估计相机的位置和方向。通常,使用图像和姿势标签来训练具有多层感知器(MLP)头部的卷积主干,以一次嵌入单个参考场景。最近,通过用一组完全连接的层替换MLP头,该方案被扩展到学习多个场景。在这项工作中,我们建议使用Transformers学习多场景绝对相机姿势回归,其中编码器用于聚合具有自关注的激活图,解码器将潜在特征和场景编码转换为姿势预测。这使得我们的模型能够专注于为定位提供信息的一般特征,同时并行地对多个场景进行em铺垫。我们通过引入混合分类回归架构来提高定位精度,从而扩展了我们以前的MS Transformer方法[1]。我们的方法在常用的室内和室外基准数据集上进行了评估,并已证明超过了多场景和最先进的单场景绝对姿势回归器。我们从这里公开我们的代码。
{"title":"Coarse-to-Fine Multi-Scene Pose Regression with Transformers","authors":"Yoli Shavit, Ron Ferens, Y. Keller","doi":"10.48550/arXiv.2308.11783","DOIUrl":"https://doi.org/10.48550/arXiv.2308.11783","url":null,"abstract":"Absolute camera pose regressors estimate the posi-tion and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while em-bedding multiple scenes in parallel. We extend our previous MS-Transformer approach [1] by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from here.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48341276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CoIR: Compressive Implicit Radar. CoIR:压缩隐含雷达。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-10 DOI: 10.1109/TPAMI.2023.3301553
Sean M Farrell, Vivek Boominathan, Nathaniel Raymondi, Ashutosh Sabharwal, Ashok Veeraraghavan

Using millimeter wave (mmWave) signals for imaging has an important advantage in that they can penetrate through poor environmental conditions such as fog, dust, and smoke that severely degrade optical-based imaging systems. However, mmWave radars, contrary to cameras and LiDARs, suffer from low angular resolution because of small physical apertures and conventional signal processing techniques. Sparse radar imaging, on the other hand, can increase the aperture size while minimizing the power consumption and read out bandwidth. This paper presents CoIR, an analysis by synthesis method that leverages the implicit neural network bias in convolutional decoders and compressed sensing to perform high accuracy sparse radar imaging. The proposed system is data set-agnostic and does not require any auxiliary sensors for training or testing. We introduce a sparse array design that allows for a 5.5× reduction in the number of antenna elements needed compared to conventional MIMO array designs. We demonstrate our system's improved imaging performance over standard mmWave radars and other competitive untrained methods on both simulated and experimental mmWave radar data.

使用毫米波(mmWave)信号进行成像有一个重要优势,即它们可以穿透恶劣的环境条件,如雾、灰尘和烟雾,而这些环境条件会严重降低光学成像系统的性能。然而,毫米波雷达与照相机和激光雷达不同,由于物理孔径小和采用传统的信号处理技术,其角度分辨率较低。另一方面,稀疏雷达成像可以增大孔径尺寸,同时最大限度地降低功耗和读出带宽。本文提出的 CoIR 是一种综合分析方法,它利用卷积解码器和压缩传感中的隐式神经网络偏差来执行高精度稀疏雷达成像。所提出的系统与数据集无关,不需要任何辅助传感器进行训练或测试。我们引入了一种稀疏阵列设计,与传统的多输入多输出阵列设计相比,可使所需的天线元件数量减少 5.5 倍。我们在模拟和实验毫米波雷达数据上证明了我们的系统比标准毫米波雷达和其他有竞争力的未经训练的方法具有更好的成像性能。
{"title":"CoIR: Compressive Implicit Radar.","authors":"Sean M Farrell, Vivek Boominathan, Nathaniel Raymondi, Ashutosh Sabharwal, Ashok Veeraraghavan","doi":"10.1109/TPAMI.2023.3301553","DOIUrl":"10.1109/TPAMI.2023.3301553","url":null,"abstract":"<p><p>Using millimeter wave (mmWave) signals for imaging has an important advantage in that they can penetrate through poor environmental conditions such as fog, dust, and smoke that severely degrade optical-based imaging systems. However, mmWave radars, contrary to cameras and LiDARs, suffer from low angular resolution because of small physical apertures and conventional signal processing techniques. Sparse radar imaging, on the other hand, can increase the aperture size while minimizing the power consumption and read out bandwidth. This paper presents CoIR, an analysis by synthesis method that leverages the implicit neural network bias in convolutional decoders and compressed sensing to perform high accuracy sparse radar imaging. The proposed system is data set-agnostic and does not require any auxiliary sensors for training or testing. We introduce a sparse array design that allows for a 5.5× reduction in the number of antenna elements needed compared to conventional MIMO array designs. We demonstrate our system's improved imaging performance over standard mmWave radars and other competitive untrained methods on both simulated and experimental mmWave radar data.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2023-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9971344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transformer-Empowered Invariant Grounding for Video Question Answering. 用于视频问题解答的变压器供电不变接地。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-09 DOI: 10.1109/TPAMI.2023.3303451
Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua

Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.

视频问题解答(VideoQA)是一项回答视频问题的任务。其核心是理解视频场景与问题语义之间的关联,从而得出答案。在主流的 VideoQA 模型中,典型的学习目标--经验风险最小化(ERM)--往往会过度利用与问题无关的场景和答案之间的虚假相关性,而不是检查问题关键场景的因果效应,从而以不可靠的推理破坏预测。在这项工作中,我们对 VideoQA 进行了因果分析,并提出了一种模式无关的学习框架,名为 "VideoQA 的不变基础"(Invariant Grounding for VideoQA,IGV),用于确定问题关键场景的基础,而问题关键场景与答案之间的因果关系在不同的补充干预中是不变的。有了 IGV,领先的视频质量保证模型就能使答案免受虚假相关性的负面影响,从而大大提高了推理能力。为了释放这一框架的潜力,我们进一步提供了变压器驱动的视频质量保证不变接地(TIGV),它是 IGV 框架的一个实质性实例化,自然地将不变接地的思想集成到了变压器式骨干网中。在四个基准数据集上进行的实验验证了我们的设计在准确性、视觉可解释性和泛化能力方面优于领先的基线。我们的代码见 https://github.com/yl3800/TIGV。
{"title":"Transformer-Empowered Invariant Grounding for Video Question Answering.","authors":"Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua","doi":"10.1109/TPAMI.2023.3303451","DOIUrl":"10.1109/TPAMI.2023.3303451","url":null,"abstract":"<p><p>Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9968656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Count-Free Single-Photon 3D Imaging with Race Logic. 利用竞赛逻辑进行无计数单光子三维成像
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-07 DOI: 10.1109/TPAMI.2023.3302822
Atul Ingle, David Maier

Single-photon cameras (SPCs) have emerged as a promising new technology for high-resolution 3D imaging. A single-photon 3D camera determines the round-trip time of a laser pulse by precisely capturing the arrival of individual photons at each camera pixel. Constructing photon-timestamp histograms is a fundamental operation for a single-photon 3D camera. However, in-pixel histogram processing is computationally expensive and requires large amount of memory per pixel. Digitizing and transferring photon timestamps to an off-sensor histogramming module is bandwidth and power hungry. Can we estimate distances without explicitly storing photon counts? Yes-here we present an online approach for distance estimation suitable for resource-constrained settings with limited bandwidth, memory and compute. The two key ingredients of our approach are (a) processing photon streams using race logic, which maintains photon data in the time-delay domain, and (b) constructing count-free equi-depth histograms as opposed to conventional equi-width histograms. Equi-depth histograms are a more succinct representation for "peaky" distributions, such as those obtained by an SPC pixel from a laser pulse reflected by a surface. Our approach uses a binner element that converges on the median (or, more generally, to another k-quantile) of a distribution. We cascade multiple binners to form an equi-depth histogrammer that produces multi-bin histograms. Our evaluation shows that this method can provide at least an order of magnitude reduction in bandwidth and power consumption while maintaining similar distance reconstruction accuracy as conventional histogram-based processing methods.

单光子照相机(SPC)已成为一种用于高分辨率三维成像的前景广阔的新技术。单光子三维相机通过精确捕捉到达每个相机像素的单个光子来确定激光脉冲的往返时间。构建光子时间戳直方图是单光子三维相机的基本操作。然而,像素内直方图处理的计算成本很高,而且每个像素需要大量内存。将光子时间戳数字化并传输到传感器外的直方图绘制模块既占用带宽又耗电。我们能在不明确存储光子计数的情况下估算距离吗?可以--我们在此介绍一种在线距离估算方法,适用于带宽、内存和计算能力有限的资源受限环境。我们方法的两个关键要素是:(a)使用竞赛逻辑处理光子流,在时延域中维护光子数据;(b)构建无计数等深直方图,而不是传统的等宽直方图。等深直方图是 "峰值 "分布的一种更简洁的表示方法,例如 SPC 像素从表面反射的激光脉冲中获得的分布。我们的方法使用一个收敛于分布中位数(或更广泛地说,收敛于另一个 k-四分位数)的分器元素。我们级联多个分选器,形成一个等深直方图器,生成多分选直方图。我们的评估结果表明,这种方法至少能将带宽和功耗降低一个数量级,同时还能保持与传统直方图处理方法类似的距离重建精度。
{"title":"Count-Free Single-Photon 3D Imaging with Race Logic.","authors":"Atul Ingle, David Maier","doi":"10.1109/TPAMI.2023.3302822","DOIUrl":"10.1109/TPAMI.2023.3302822","url":null,"abstract":"<p><p>Single-photon cameras (SPCs) have emerged as a promising new technology for high-resolution 3D imaging. A single-photon 3D camera determines the round-trip time of a laser pulse by precisely capturing the arrival of individual photons at each camera pixel. Constructing photon-timestamp histograms is a fundamental operation for a single-photon 3D camera. However, in-pixel histogram processing is computationally expensive and requires large amount of memory per pixel. Digitizing and transferring photon timestamps to an off-sensor histogramming module is bandwidth and power hungry. Can we estimate distances without explicitly storing photon counts? Yes-here we present an online approach for distance estimation suitable for resource-constrained settings with limited bandwidth, memory and compute. The two key ingredients of our approach are (a) processing photon streams using race logic, which maintains photon data in the time-delay domain, and (b) constructing count-free equi-depth histograms as opposed to conventional equi-width histograms. Equi-depth histograms are a more succinct representation for \"peaky\" distributions, such as those obtained by an SPC pixel from a laser pulse reflected by a surface. Our approach uses a binner element that converges on the median (or, more generally, to another k-quantile) of a distribution. We cascade multiple binners to form an equi-depth histogrammer that produces multi-bin histograms. Our evaluation shows that this method can provide at least an order of magnitude reduction in bandwidth and power consumption while maintaining similar distance reconstruction accuracy as conventional histogram-based processing methods.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2023-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9953599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aberration-Aware Depth-From-Focus. 畸变感知离焦深度。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-04 DOI: 10.1109/TPAMI.2023.3301931
Xinge Yang, Qiang Fu, Mohamed Elhoseiny, Wolfgang Heidrich

Computer vision methods for depth estimation usually use simple camera models with idealized optics. For modern machine learning approaches, this creates an issue when attempting to train deep networks with simulated data, especially for focus-sensitive tasks like Depth-from-Focus. In this work, we investigate the domain gap caused by off-axis aberrations that will affect the decision of the best-focused frame in a focal stack. We then explore bridging this domain gap through aberration-aware training (AAT). Our approach involves a lightweight network that models lens aberrations at different positions and focus distances, which is then integrated into the conventional network training pipeline. We evaluate the generality of network models on both synthetic and real-world data. The experimental results demonstrate that the proposed AAT scheme can improve depth estimation accuracy without fine-tuning the model for different datasets. The code will be available in github.com/vccimaging/Aberration-Aware-Depth-from-Focus.

用于深度估计的计算机视觉方法通常使用理想化光学的简单相机模型。对于现代机器学习方法来说,当尝试使用模拟数据训练深度网络时,这就产生了一个问题,尤其是对于深度对焦等对焦敏感的任务。在这项工作中,我们研究了离轴畸变造成的领域差距,它将影响焦点堆栈中最佳对焦帧的决定。然后,我们探索通过畸变感知训练(AAT)来弥合这一领域差距。我们的方法涉及一个轻量级网络,该网络可对不同位置和对焦距离的镜头像差进行建模,然后将其集成到传统的网络训练管道中。我们在合成数据和真实世界数据上评估了网络模型的通用性。实验结果表明,所提出的 AAT 方案可以提高深度估计的准确性,而无需针对不同的数据集对模型进行微调。代码将发布在 github.com/vccimaging/Aberration-Aware-Depth-from-Focus 上。
{"title":"Aberration-Aware Depth-From-Focus.","authors":"Xinge Yang, Qiang Fu, Mohamed Elhoseiny, Wolfgang Heidrich","doi":"10.1109/TPAMI.2023.3301931","DOIUrl":"10.1109/TPAMI.2023.3301931","url":null,"abstract":"<p><p>Computer vision methods for depth estimation usually use simple camera models with idealized optics. For modern machine learning approaches, this creates an issue when attempting to train deep networks with simulated data, especially for focus-sensitive tasks like Depth-from-Focus. In this work, we investigate the domain gap caused by off-axis aberrations that will affect the decision of the best-focused frame in a focal stack. We then explore bridging this domain gap through aberration-aware training (AAT). Our approach involves a lightweight network that models lens aberrations at different positions and focus distances, which is then integrated into the conventional network training pipeline. We evaluate the generality of network models on both synthetic and real-world data. The experimental results demonstrate that the proposed AAT scheme can improve depth estimation accuracy without fine-tuning the model for different datasets. The code will be available in github.com/vccimaging/Aberration-Aware-Depth-from-Focus.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9951494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Pattern Analysis and Machine Intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1