Pub Date : 2024-09-01DOI: 10.1016/j.imavis.2024.105247
Guangrui Liu, Wei Wu
Camouflaged object detection aims to accurately identify objects blending into the background. However, existing methods often struggle, especially with small object or multiple objects, due to their reliance on singular strategies. To address this, we introduce a novel Search and Recovery Network (SRNet) using a bionic approach and auxiliary features. SRNet comprises three key modules: the Region Search Module (RSM), Boundary Recovery Module (BRM), and Camouflaged Object Predictor (COP). The RSM mimics predator behavior to locate potential object regions, enhancing object location detection. The BRM refines texture features and recovers object boundaries. The COP fuse multilevel features to predict final segmentation maps. Experimental results on three benchmark datasets show SRNet's superiority over SOTA models, particularly with small and multiple objects. Notably, SRNet achieves performance improvements without significantly increasing model parameters. Moreover, the method exhibits promising performance in downstream tasks such as defect detection, polyp segmentation and military camouflage detection.
{"title":"Search and recovery network for camouflaged object detection","authors":"Guangrui Liu, Wei Wu","doi":"10.1016/j.imavis.2024.105247","DOIUrl":"10.1016/j.imavis.2024.105247","url":null,"abstract":"<div><p>Camouflaged object detection aims to accurately identify objects blending into the background. However, existing methods often struggle, especially with small object or multiple objects, due to their reliance on singular strategies. To address this, we introduce a novel Search and Recovery Network (SRNet) using a bionic approach and auxiliary features. SRNet comprises three key modules: the Region Search Module (RSM), Boundary Recovery Module (BRM), and Camouflaged Object Predictor (COP). The RSM mimics predator behavior to locate potential object regions, enhancing object location detection. The BRM refines texture features and recovers object boundaries. The COP fuse multilevel features to predict final segmentation maps. Experimental results on three benchmark datasets show SRNet's superiority over SOTA models, particularly with small and multiple objects. Notably, SRNet achieves performance improvements without significantly increasing model parameters. Moreover, the method exhibits promising performance in downstream tasks such as defect detection, polyp segmentation and military camouflage detection.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105247"},"PeriodicalIF":4.2,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-31DOI: 10.1016/j.imavis.2024.105245
Muhammad Haris Kaka Khel , Paul Greaney , Marion McAfee , Sandra Moffett , Kevin Meehan
Pedestrian trajectory prediction in urban environments has emerged as a critical research area with extensive applications across various domains. Accurate prediction of pedestrian trajectories is essential for the safe navigation of autonomous vehicles and robots in pedestrian-populated environments. Effective prediction models must capture both the spatial interactions among pedestrians and the temporal dependencies governing their movements. Existing research primarily focuses on forecasting a single trajectory per pedestrian, limiting its applicability in real-world scenarios characterised by diverse and unpredictable pedestrian behaviours. To address these challenges, this paper introduces the Graph Convolutional Network, Spatial–Temporal Attention, and Generative Model (GSTGM) for pedestrian trajectory prediction. GSTGM employs a spatiotemporal graph convolutional network to effectively capture complex interactions between pedestrians and their environment. Additionally, it integrates a spatial–temporal attention mechanism to prioritise relevant information during the prediction process. By incorporating a time-dependent prior within the latent space and utilising a computationally efficient generative model, GSTGM facilitates the generation of diverse and realistic future trajectories. The effectiveness of GSTGM is validated through experiments on real-world scenario datasets. Compared to the state-of-the-art models on benchmark datasets such as ETH/UCY, GSTGM demonstrates superior performance in accurately predicting multiple potential trajectories for individual pedestrians. This superiority is measured using metrics such as Final Displacement Error (FDE) and Average Displacement Error (ADE). Moreover, GSTGM achieves these results with significantly faster processing speeds.
{"title":"GSTGM: Graph, spatial–temporal attention and generative based model for pedestrian multi-path prediction","authors":"Muhammad Haris Kaka Khel , Paul Greaney , Marion McAfee , Sandra Moffett , Kevin Meehan","doi":"10.1016/j.imavis.2024.105245","DOIUrl":"10.1016/j.imavis.2024.105245","url":null,"abstract":"<div><p>Pedestrian trajectory prediction in urban environments has emerged as a critical research area with extensive applications across various domains. Accurate prediction of pedestrian trajectories is essential for the safe navigation of autonomous vehicles and robots in pedestrian-populated environments. Effective prediction models must capture both the spatial interactions among pedestrians and the temporal dependencies governing their movements. Existing research primarily focuses on forecasting a single trajectory per pedestrian, limiting its applicability in real-world scenarios characterised by diverse and unpredictable pedestrian behaviours. To address these challenges, this paper introduces the Graph Convolutional Network, Spatial–Temporal Attention, and Generative Model (GSTGM) for pedestrian trajectory prediction. GSTGM employs a spatiotemporal graph convolutional network to effectively capture complex interactions between pedestrians and their environment. Additionally, it integrates a spatial–temporal attention mechanism to prioritise relevant information during the prediction process. By incorporating a time-dependent prior within the latent space and utilising a computationally efficient generative model, GSTGM facilitates the generation of diverse and realistic future trajectories. The effectiveness of GSTGM is validated through experiments on real-world scenario datasets. Compared to the state-of-the-art models on benchmark datasets such as ETH/UCY, GSTGM demonstrates superior performance in accurately predicting multiple potential trajectories for individual pedestrians. This superiority is measured using metrics such as Final Displacement Error (FDE) and Average Displacement Error (ADE). Moreover, GSTGM achieves these results with significantly faster processing speeds.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105245"},"PeriodicalIF":4.2,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624003500/pdfft?md5=be799dd771bacffe5a12fc1424240e2d&pid=1-s2.0-S0262885624003500-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-31DOI: 10.1016/j.imavis.2024.105240
Yi Li , Sile Ma , Xiangyuan Jiang , Yizhong Luan , Zecui Jiang
By defining effective supervision labels for network training, the performance of object detectors can be improved without incurring additional inference costs. Current label assignment strategies generally require two steps: first, constructing a positive sample candidate bag, and then designing labels for these samples. However, the construction of candidate bag of positive samples may result in some noisy samples being introduced into the label assignment process. We explore a single-step label assignment approach: directly generating a probability map as labels for all samples. We design the label assignment approach from the following perspectives: Firstly, it should be able to reduce the impact of noise samples. Secondly, each sample should be treated differently because each one matches the target to a different extent, which assists the network to learn more valuable information from high-quality samples. We propose a probability-based dynamic soft label assignment method. Instead of dividing the samples into positive and negative samples, a probability map, which is calculated based on prediction quality and prior knowledge, is used to supervise all anchor points of the classification branch. The weight of prior knowledge in the labels decreases as the network improves the quality of instance predictions, as a way to reduce noise samples introduced by prior knowledge. By using continuous probability values as labels to supervise the classification branch, the network is able to focus on high-quality samples. As demonstrated in the experiments on the MS COCO benchmark, our label assignment method achieves 40.9% AP in the ResNet-50 under 1x schedule, which improves FCOS performance by approximately 2.0% AP. The code has been available at https://github.com/Liyi4578/PDSLA.
{"title":"Probability based dynamic soft label assignment for object detection","authors":"Yi Li , Sile Ma , Xiangyuan Jiang , Yizhong Luan , Zecui Jiang","doi":"10.1016/j.imavis.2024.105240","DOIUrl":"10.1016/j.imavis.2024.105240","url":null,"abstract":"<div><p>By defining effective supervision labels for network training, the performance of object detectors can be improved without incurring additional inference costs. Current label assignment strategies generally require two steps: first, constructing a positive sample candidate bag, and then designing labels for these samples. However, the construction of candidate bag of positive samples may result in some noisy samples being introduced into the label assignment process. We explore a single-step label assignment approach: directly generating a probability map as labels for all samples. We design the label assignment approach from the following perspectives: Firstly, it should be able to reduce the impact of noise samples. Secondly, each sample should be treated differently because each one matches the target to a different extent, which assists the network to learn more valuable information from high-quality samples. We propose a probability-based dynamic soft label assignment method. Instead of dividing the samples into positive and negative samples, a probability map, which is calculated based on prediction quality and prior knowledge, is used to supervise all anchor points of the classification branch. The weight of prior knowledge in the labels decreases as the network improves the quality of instance predictions, as a way to reduce noise samples introduced by prior knowledge. By using continuous probability values as labels to supervise the classification branch, the network is able to focus on high-quality samples. As demonstrated in the experiments on the MS COCO benchmark, our label assignment method achieves 40.9% AP in the ResNet-50 under 1x schedule, which improves FCOS performance by approximately 2.0% AP. The code has been available at <span><span><span>https://github.com/Liyi4578/PDSLA</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105240"},"PeriodicalIF":4.2,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1016/j.imavis.2024.105243
Zhaokun Li, Qiong Liu
Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.
由于固有的深度模糊性(包括根相关深度和绝对根深度),从单张图像中恢复多人三维姿势是一个具有挑战性的问题。当前的自下而上方法通过明确聚合全局上下文线索,在减轻绝对根深模糊性方面显示出巨大潜力。然而,这些方法在根深度回归时对整个图像区域一视同仁,忽略了无关区域的负面影响。此外,这些方法为两个深度学习共享特征,而每个特征都侧重于不同的信息。这种共享机制可能会导致负迁移,从而降低根深度预测的准确性。为了应对这些挑战,我们提出了一种自下而上的新方法--人群区域增强网络(Crowd Region Enhancement Network,CRENet),其中包含一个特征去耦模块(FDM)和一个全局注意力模块(GAM)。FDM 通过自适应地重新校准信道响应和融合多层次特征,明确地学习每个深度的判别特征,这使得模型能够分别关注每个深度的预测,从而避免了负迁移的不利影响。GAM 利用注意力机制突出人群区域,同时抑制无关区域,并根据对注意力的置信度进一步完善注意力区域,这有利于从信息丰富的人群区域学习与深度相关的线索,促进根深度估计。在基准MuPoTS-3D和CMU Panoptic上进行的综合实验表明,我们的方法在绝对三维姿态估计方面优于最先进的自下而上方法,并且适用于野外图像,这也表明学习特定深度特征和抑制噪声信号对多人绝对三维姿态估计大有裨益。
{"title":"CRENet: Crowd region enhancement network for multi-person 3D pose estimation","authors":"Zhaokun Li, Qiong Liu","doi":"10.1016/j.imavis.2024.105243","DOIUrl":"10.1016/j.imavis.2024.105243","url":null,"abstract":"<div><p>Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105243"},"PeriodicalIF":4.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1016/j.imavis.2024.105235
Shujun Liu
Subspace clustering supposes that hyperspectral image (HSI) pixels lie in a union vector spaces of multiple sample subspaces without considering their dual space, i.e., spectral space. In this article, we propose a promising dual subspace clustering (DualSC) for improving spectral-spatial HSIs clustering by relaxing subspace clustering. To this end, DualSC simultaneously optimizes row and column subspace-representations of HSI superpixels to capture the intrinsic connection between spectral and spatial information. From the new perspective, the original subspace clustering can be treated as a special case of DualSC that has larger solution space, so tends to finding better sample representation matrix for applying spectral clustering. Besides, we provide theoretical proofs that show the proposed method relaxes the subspace space clustering with dual subspace, and can recover subspace-sparse representation of HSI samples. To the best of our knowledge, this work could be one of the first dual clustering method leveraging sample and spectral subspaces simultaneously. As a result, we conduct several clustering experiments on four canonical data sets, implying that our proposed method with strong interpretability reaches comparable performance and computing efficiency with other state-of-the-art methods.
{"title":"Dual subspace clustering for spectral-spatial hyperspectral image clustering","authors":"Shujun Liu","doi":"10.1016/j.imavis.2024.105235","DOIUrl":"10.1016/j.imavis.2024.105235","url":null,"abstract":"<div><p>Subspace clustering supposes that hyperspectral image (HSI) pixels lie in a union vector spaces of multiple sample subspaces without considering their dual space, i.e., spectral space. In this article, we propose a promising dual subspace clustering (DualSC) for improving spectral-spatial HSIs clustering by relaxing subspace clustering. To this end, DualSC simultaneously optimizes row and column subspace-representations of HSI superpixels to capture the intrinsic connection between spectral and spatial information. From the new perspective, the original subspace clustering can be treated as a special case of DualSC that has larger solution space, so tends to finding better sample representation matrix for applying spectral clustering. Besides, we provide theoretical proofs that show the proposed method relaxes the subspace space clustering with dual subspace, and can recover subspace-sparse representation of HSI samples. To the best of our knowledge, this work could be one of the first dual clustering method leveraging sample and spectral subspaces simultaneously. As a result, we conduct several clustering experiments on four canonical data sets, implying that our proposed method with strong interpretability reaches comparable performance and computing efficiency with other state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105235"},"PeriodicalIF":4.2,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1016/j.imavis.2024.105244
Haiming Sun, Shiwei Ma
Mainstream unsupervised person ReIDentification (ReID) is on the basis of the alternation of clustering and fine-tuning to promote the task performance, but the clustering process inevitably produces noisy pseudo labels, which seriously constrains the further advancement of the task performance. To conquer the above concerns, the novel Pro-ReID framework is proposed to produce reliable person samples from the pseudo-labeled dataset to learn feature representations in this work. It consists of two modules: Pseudo Labels Correction (PLC) and Pseudo Labels Selection (PLS). Specifically, we further leverage the temporal ensemble prior knowledge to promote task performance. The PLC module assigns corresponding soft pseudo labels to each sample with control of soft pseudo label participation to potentially correct for noisy pseudo labels generated during clustering; the PLS module associates the predictions of the temporal ensemble model with pseudo label annotations and it detects noisy pseudo labele examples as out-of-distribution examples through the Gaussian Mixture Model (GMM) to supply reliable pseudo labels for the unsupervised person ReID task in consideration of their loss data distribution. Experimental findings validated on three person (Market-1501, DukeMTMC-reID and MSMT17) and one vehicle (VeRi-776) ReID benchmark establish that the novel Pro-ReID framework achieves competitive performance, in particular the mAP on the ambitious MSMT17 that is 4.3% superior to the state-of-the-art methods.
{"title":"Pro-ReID: Producing reliable pseudo labels for unsupervised person re-identification","authors":"Haiming Sun, Shiwei Ma","doi":"10.1016/j.imavis.2024.105244","DOIUrl":"10.1016/j.imavis.2024.105244","url":null,"abstract":"<div><p>Mainstream unsupervised person ReIDentification (ReID) is on the basis of the alternation of clustering and fine-tuning to promote the task performance, but the clustering process inevitably produces noisy pseudo labels, which seriously constrains the further advancement of the task performance. To conquer the above concerns, the novel Pro-ReID framework is proposed to produce reliable person samples from the pseudo-labeled dataset to learn feature representations in this work. It consists of two modules: Pseudo Labels Correction (PLC) and Pseudo Labels Selection (PLS). Specifically, we further leverage the temporal ensemble prior knowledge to promote task performance. The PLC module assigns corresponding soft pseudo labels to each sample with control of soft pseudo label participation to potentially correct for noisy pseudo labels generated during clustering; the PLS module associates the predictions of the temporal ensemble model with pseudo label annotations and it detects noisy pseudo labele examples as out-of-distribution examples through the Gaussian Mixture Model (GMM) to supply reliable pseudo labels for the unsupervised person ReID task in consideration of their loss data distribution. Experimental findings validated on three person (Market-1501, DukeMTMC-reID and MSMT17) and one vehicle (VeRi-776) ReID benchmark establish that the novel Pro-ReID framework achieves competitive performance, in particular the mAP on the ambitious MSMT17 that is 4.3% superior to the state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105244"},"PeriodicalIF":4.2,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO + and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.
{"title":"Language conditioned multi-scale visual attention networks for visual grounding","authors":"Haibo Yao, Lipeng Wang, Chengtao Cai, Wei Wang, Zhi Zhang, Xiaobing Shang","doi":"10.1016/j.imavis.2024.105242","DOIUrl":"10.1016/j.imavis.2024.105242","url":null,"abstract":"<div><p>Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO<!--> <!-->+ and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105242"},"PeriodicalIF":4.2,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1016/j.imavis.2024.105241
Biying Li , Zhiwei Liu , Jinqiao Wang
Facial structure's statistical characteristics offer pivotal prior information in facial landmark prediction, forming inter-dependencies among different landmarks. Such inter-dependencies ensure that predictions adhere to the shape distribution typical of natural faces. In challenging scenarios like occlusions or extreme facial poses, this structure becomes indispensable, which can help to predict elusive landmarks based on more discernible ones. While current deep learning methods do capture these landmark dependencies, it's often an implicit process heavily reliant on vast training datasets. We contest that such implicit modeling approaches fail to manage more challenging situations. In this paper, we propose a new method that harnesses the facial structure and explicitly explores inter-dependencies among facial landmarks in an end-to-end fashion. We propose a Structural Dependency Learning Module (SDLM). It uses 3D face information to map facial features into a canonical UV space, in which the facial structure is explicitly 3D semantically aligned. Besides, to explore the global relationships between facial landmarks, we take advantage of the self-attention mechanism in the image and UV spaces. We name the proposed method Facial Structure-based Face Alignment (FSFA). FSFA reinforces the landmark structure, especially under challenging conditions. Extensive experiments demonstrate that FSFA achieves state-of-the-art performance on the WFLW, 300W, AFLW, and COFW68 datasets.
{"title":"Learning facial structural dependency in 3D aligned space for face alignment","authors":"Biying Li , Zhiwei Liu , Jinqiao Wang","doi":"10.1016/j.imavis.2024.105241","DOIUrl":"10.1016/j.imavis.2024.105241","url":null,"abstract":"<div><p>Facial structure's statistical characteristics offer pivotal prior information in facial landmark prediction, forming inter-dependencies among different landmarks. Such inter-dependencies ensure that predictions adhere to the shape distribution typical of natural faces. In challenging scenarios like occlusions or extreme facial poses, this structure becomes indispensable, which can help to predict elusive landmarks based on more discernible ones. While current deep learning methods do capture these landmark dependencies, it's often an implicit process heavily reliant on vast training datasets. We contest that such implicit modeling approaches fail to manage more challenging situations. In this paper, we propose a new method that harnesses the facial structure and explicitly explores inter-dependencies among facial landmarks in an end-to-end fashion. We propose a Structural Dependency Learning Module (SDLM). It uses 3D face information to map facial features into a canonical UV space, in which the facial structure is explicitly 3D semantically aligned. Besides, to explore the global relationships between facial landmarks, we take advantage of the self-attention mechanism in the image and UV spaces. We name the proposed method Facial Structure-based Face Alignment (FSFA). FSFA reinforces the landmark structure, especially under challenging conditions. Extensive experiments demonstrate that FSFA achieves state-of-the-art performance on the WFLW, 300W, AFLW, and COFW68 datasets.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105241"},"PeriodicalIF":4.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1016/j.imavis.2024.105237
Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang
Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present Bilateral Voxel Transformer (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on GitHub.
{"title":"Learning accurate monocular 3D voxel representation via bilateral voxel transformer","authors":"Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang","doi":"10.1016/j.imavis.2024.105237","DOIUrl":"10.1016/j.imavis.2024.105237","url":null,"abstract":"<div><p>Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present <strong>Bilateral Voxel Transformer</strong> (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on <span><span>GitHub</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105237"},"PeriodicalIF":4.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.imavis.2024.105239
Sunpil Kim , Gang-Joon Yoon , Jinjoo Song , Sang Min Yoon
Vision transformer models provide superior performance compared to convolutional neural networks for various computer vision tasks but require increased computational overhead with large datasets. This paper proposes a patch selective vision transformer that effectively selects patches to reduce computational costs while simultaneously extracting global and local self-representative patch information to maintain performance. The inter-patch attention in the transformer encoder emphasizes meaningful features by capturing the inter-patch relationships of features, and dynamic patch pruning is applied to the attentive patches using a learnable soft threshold that measures the maximum multi-head importance scores. The proposed patch attention and pruning method provides constraints to exploit dominant feature maps in conjunction with self-attention, thus avoiding the propagation of noisy or irrelevant information. The proposed patch-selective transformer also helps to address computer vision problems such as scale, background clutter, and partial occlusion, resulting in a lightweight and general-purpose vision transformer suitable for mobile devices.
{"title":"Simultaneous image patch attention and pruning for patch selective transformer","authors":"Sunpil Kim , Gang-Joon Yoon , Jinjoo Song , Sang Min Yoon","doi":"10.1016/j.imavis.2024.105239","DOIUrl":"10.1016/j.imavis.2024.105239","url":null,"abstract":"<div><p>Vision transformer models provide superior performance compared to convolutional neural networks for various computer vision tasks but require increased computational overhead with large datasets. This paper proposes a patch selective vision transformer that effectively selects patches to reduce computational costs while simultaneously extracting global and local self-representative patch information to maintain performance. The inter-patch attention in the transformer encoder emphasizes meaningful features by capturing the inter-patch relationships of features, and dynamic patch pruning is applied to the attentive patches using a learnable soft threshold that measures the maximum multi-head importance scores. The proposed patch attention and pruning method provides constraints to exploit dominant feature maps in conjunction with self-attention, thus avoiding the propagation of noisy or irrelevant information. The proposed patch-selective transformer also helps to address computer vision problems such as scale, background clutter, and partial occlusion, resulting in a lightweight and general-purpose vision transformer suitable for mobile devices.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105239"},"PeriodicalIF":4.2,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}