首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Monocular per-object distance estimation with Masked Object Modeling
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-03 DOI: 10.1016/j.cviu.2025.104303
Aniello Panariello, Gianluca Mancusi, Fedy Haj Ali, Angelo Porrello, Simone Calderara, Rita Cucchiara
Per-object distance estimation is critical in surveillance and autonomous driving, where safety is crucial. While existing methods rely on geometric or deep supervised features, only a few attempts have been made to leverage self-supervised learning. In this respect, our paper draws inspiration from Masked Image Modeling (MiM) and extends it to multi-object tasks. While MiM focuses on extracting global image-level representations, it struggles with individual objects within the image. This is detrimental for distance estimation, as objects far away correspond to negligible portions of the image. Conversely, our strategy, termed Masked Object Modeling (MoM), enables a novel application of masking techniques. In a few words, we devise an auxiliary objective that reconstructs the portions of the image pertaining to the objects detected in the scene. The training phase is performed in a single unified stage, simultaneously optimizing the masking objective and the downstream loss (i.e., distance estimation).
We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOTSynth datasets. Our evaluation reveals that our framework surpasses the SoTA and highlights its robust regularization properties. The MoM strategy enhances both zero-shot and few-shot capabilities, from synthetic to real domain. Finally, it furthers the robustness of the model in the presence of occluded or poorly detected objects.
{"title":"Monocular per-object distance estimation with Masked Object Modeling","authors":"Aniello Panariello,&nbsp;Gianluca Mancusi,&nbsp;Fedy Haj Ali,&nbsp;Angelo Porrello,&nbsp;Simone Calderara,&nbsp;Rita Cucchiara","doi":"10.1016/j.cviu.2025.104303","DOIUrl":"10.1016/j.cviu.2025.104303","url":null,"abstract":"<div><div>Per-object distance estimation is critical in surveillance and autonomous driving, where safety is crucial. While existing methods rely on geometric or deep supervised features, only a few attempts have been made to leverage self-supervised learning. In this respect, our paper draws inspiration from Masked Image Modeling (MiM) and extends it to <strong>multi-object tasks</strong>. While MiM focuses on extracting global image-level representations, it struggles with individual objects within the image. This is detrimental for distance estimation, as objects far away correspond to negligible portions of the image. Conversely, our strategy, termed <strong>Masked Object Modeling</strong> (<strong>MoM</strong>), enables a novel application of masking techniques. In a few words, we devise an auxiliary objective that reconstructs the portions of the image pertaining to the objects detected in the scene. The training phase is performed in a single unified stage, simultaneously optimizing the masking objective and the downstream loss (<em>i.e</em>., distance estimation).</div><div>We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOTSynth datasets. Our evaluation reveals that our framework surpasses the SoTA and highlights its robust regularization properties. The MoM strategy enhances both zero-shot and few-shot capabilities, from synthetic to real domain. Finally, it furthers the robustness of the model in the presence of occluded or poorly detected objects.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104303"},"PeriodicalIF":4.3,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143136340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fake News Detection Based on BERT Multi-domain and Multi-modal Fusion Network
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104301
Kai Yu , Shiming Jiao , Zhilong Ma
The pervasive growth of the Internet has simplified communication, making the detection and annotation of fake news on social media increasingly critical. Leveraging existing studies, this work introduces the Fake News Detection Based on BERT Multi-domain and Multi-modal Fusion Network (BMMFN). This framework utilizes the BERT model to transform text content of fake news into textual vectors, while image features are extracted using the VGG-19 model. A multimodal fusion network is developed, factoring in text-image correlations and interactions through joint matrices that enhance the integration of information across modalities. Additionally, a multidomain classifier is incorporated to align multimodal features from various events within a unified feature space. The performance of this model is confirmed through experiments on Weibo and Twitter datasets, with results indicating that the BMMFN model surpasses contemporary state-of-the-art models in several metrics, thereby effectively enhancing the detection of fake news.
{"title":"Fake News Detection Based on BERT Multi-domain and Multi-modal Fusion Network","authors":"Kai Yu ,&nbsp;Shiming Jiao ,&nbsp;Zhilong Ma","doi":"10.1016/j.cviu.2025.104301","DOIUrl":"10.1016/j.cviu.2025.104301","url":null,"abstract":"<div><div>The pervasive growth of the Internet has simplified communication, making the detection and annotation of fake news on social media increasingly critical. Leveraging existing studies, this work introduces the Fake News Detection Based on BERT Multi-domain and Multi-modal Fusion Network (BMMFN). This framework utilizes the BERT model to transform text content of fake news into textual vectors, while image features are extracted using the VGG-19 model. A multimodal fusion network is developed, factoring in text-image correlations and interactions through joint matrices that enhance the integration of information across modalities. Additionally, a multidomain classifier is incorporated to align multimodal features from various events within a unified feature space. The performance of this model is confirmed through experiments on Weibo and Twitter datasets, with results indicating that the BMMFN model surpasses contemporary state-of-the-art models in several metrics, thereby effectively enhancing the detection of fake news.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104301"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local optimization cropping and boundary enhancement for end-to-end weakly-supervised segmentation network
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104260
Weizheng Wang, Chao Zeng, Haonan Wang, Lei Zhou
In recent years, the performance of weakly-supervised semantic segmentation(WSSS) has significantly increased. It usually employs image-level labels to generate Class Activation Map (CAM) for producing pseudo-labels, which greatly reduces the cost of annotation. Since CNN cannot fully identify object regions, researchers found that Vision Transformers (ViT) can complement the deficiencies of CNN by better extracting global contextual information. However, ViT also introduces the problem of over-smoothing. Great progress has been made in recent years to solve the over-smoothing problem, yet two issues remain. The first issue is that the high-confidence regions in the network-generated CAM still contain areas irrelevant to the class. The second issue is the inaccuracy of CAM boundaries, which contain a small portion of background regions. As we know, the precision of label boundaries is closely tied to excellent segmentation performance. In this work, to address the first issue, we propose a local optimized cropping module (LOC). By randomly cropping selected regions, we allow the local class tokens to be contrasted with the global class tokens. This method facilitates enhanced consistency between local and global representations. To address the second issue, we design a boundary enhancement module (BE) that utilizes an erasing strategy to re-train the image, increasing the network’s extraction of boundary information and greatly improving the accuracy of CAM boundaries, thereby enhancing the quality of pseudo labels. Experiments on the PASCAL VOC dataset show that the performance of our proposed LOC-BE Net outperforms multi-stage methods and is competitive with end-to-end methods. On the PASCAL VOC dataset, our method achieves a CAM mIoU of 74.2% and a segmentation mIoU of 73.1%. On the COCO2014 dataset, our method achieves a CAM mIoU of 43.8% and a segmentation mIoU of 43.4%. Our code has been open sourced: https://github.com/whn786/LOC-BE/tree/main.
{"title":"Local optimization cropping and boundary enhancement for end-to-end weakly-supervised segmentation network","authors":"Weizheng Wang,&nbsp;Chao Zeng,&nbsp;Haonan Wang,&nbsp;Lei Zhou","doi":"10.1016/j.cviu.2024.104260","DOIUrl":"10.1016/j.cviu.2024.104260","url":null,"abstract":"<div><div>In recent years, the performance of weakly-supervised semantic segmentation(WSSS) has significantly increased. It usually employs image-level labels to generate Class Activation Map (CAM) for producing pseudo-labels, which greatly reduces the cost of annotation. Since CNN cannot fully identify object regions, researchers found that Vision Transformers (ViT) can complement the deficiencies of CNN by better extracting global contextual information. However, ViT also introduces the problem of over-smoothing. Great progress has been made in recent years to solve the over-smoothing problem, yet two issues remain. The first issue is that the high-confidence regions in the network-generated CAM still contain areas irrelevant to the class. The second issue is the inaccuracy of CAM boundaries, which contain a small portion of background regions. As we know, the precision of label boundaries is closely tied to excellent segmentation performance. In this work, to address the first issue, we propose a local optimized cropping module (LOC). By randomly cropping selected regions, we allow the local class tokens to be contrasted with the global class tokens. This method facilitates enhanced consistency between local and global representations. To address the second issue, we design a boundary enhancement module (BE) that utilizes an erasing strategy to re-train the image, increasing the network’s extraction of boundary information and greatly improving the accuracy of CAM boundaries, thereby enhancing the quality of pseudo labels. Experiments on the PASCAL VOC dataset show that the performance of our proposed LOC-BE Net outperforms multi-stage methods and is competitive with end-to-end methods. On the PASCAL VOC dataset, our method achieves a CAM mIoU of 74.2% and a segmentation mIoU of 73.1%. On the COCO2014 dataset, our method achieves a CAM mIoU of 43.8% and a segmentation mIoU of 43.4%. Our code has been open sourced: <span><span>https://github.com/whn786/LOC-BE/tree/main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104260"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guided image filtering-conventional to deep models: A review and evaluation study
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104278
Weimin Yuan, Yinuo Wang, Cai Meng, Xiangzhi Bai
In the past decade, guided image filtering (GIF) has emerged as a successful edge-preserving smoothing technique designed to remove noise while retaining important edges and structures in images. By leveraging a well-aligned guidance image as the prior, GIF has become a valuable tool in various visual applications, offering a balance between edge preservation and computational efficiency. Despite the significant advancements and the development of numerous GIF variants, there has been limited effort to systematically review and evaluate the diverse methods within this research community. To address this gap, this paper offers a comprehensive survey of existing GIF variants, covering both conventional and deep learning-based models. Specifically, we begin by introducing the basic formulation of GIF and its fast implementations. Next, we categorize the GIF follow-up methods into three main categories: local methods, global methods and deep learning-based methods. Within each category, we provide a new sub-taxonomy to better illustrate the motivations behind their design, as well as their contributions and limitations. We then conduct experiments to compare the performance of representative methods, with an analysis of qualitative and quantitative results that reveals several insights into the current state of this research area. Finally, we discuss unresolved issues in the field of GIF and highlight some open problems for further research.
{"title":"Guided image filtering-conventional to deep models: A review and evaluation study","authors":"Weimin Yuan,&nbsp;Yinuo Wang,&nbsp;Cai Meng,&nbsp;Xiangzhi Bai","doi":"10.1016/j.cviu.2025.104278","DOIUrl":"10.1016/j.cviu.2025.104278","url":null,"abstract":"<div><div>In the past decade, guided image filtering (GIF) has emerged as a successful edge-preserving smoothing technique designed to remove noise while retaining important edges and structures in images. By leveraging a well-aligned guidance image as the prior, GIF has become a valuable tool in various visual applications, offering a balance between edge preservation and computational efficiency. Despite the significant advancements and the development of numerous GIF variants, there has been limited effort to systematically review and evaluate the diverse methods within this research community. To address this gap, this paper offers a comprehensive survey of existing GIF variants, covering both conventional and deep learning-based models. Specifically, we begin by introducing the basic formulation of GIF and its fast implementations. Next, we categorize the GIF follow-up methods into three main categories: local methods, global methods and deep learning-based methods. Within each category, we provide a new sub-taxonomy to better illustrate the motivations behind their design, as well as their contributions and limitations. We then conduct experiments to compare the performance of representative methods, with an analysis of qualitative and quantitative results that reveals several insights into the current state of this research area. Finally, we discuss unresolved issues in the field of GIF and highlight some open problems for further research.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104278"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143097182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incorporating degradation estimation in light field spatial super-resolution
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104295
Zeyu Xiao, Zhiwei Xiong
Recent advancements in light field super-resolution (SR) have yielded impressive results. In practice, however, many existing methods are limited by assuming fixed degradation models, such as bicubic downsampling, which hinders their robustness in real-world scenarios with complex degradations. To address this limitation, we present LF-DEST, an effective blind Light Field SR method that incorporates explicit Degradation Estimation to handle various degradation types. LF-DEST consists of two primary components: degradation estimation and light field restoration. The former concurrently estimates blur kernels and noise maps from low-resolution degraded light fields, while the latter generates super-resolved light fields based on the estimated degradations. Notably, we introduce a modulated and selective fusion module that intelligently combines degradation representations with image information, effectively handling diverse degradation types. We conduct extensive experiments on benchmark datasets, demonstrating that LF-DEST achieves superior performance across various degradation scenarios in light field SR. The implementation code is available at https://github.com/zeyuxiao1997/LF-DEST.
{"title":"Incorporating degradation estimation in light field spatial super-resolution","authors":"Zeyu Xiao,&nbsp;Zhiwei Xiong","doi":"10.1016/j.cviu.2025.104295","DOIUrl":"10.1016/j.cviu.2025.104295","url":null,"abstract":"<div><div>Recent advancements in light field super-resolution (SR) have yielded impressive results. In practice, however, many existing methods are limited by assuming fixed degradation models, such as bicubic downsampling, which hinders their robustness in real-world scenarios with complex degradations. To address this limitation, we present LF-DEST, an effective blind <u>L</u>ight <u>F</u>ield SR method that incorporates explicit <u>D</u>egradation <u>Est</u>imation to handle various degradation types. LF-DEST consists of two primary components: degradation estimation and light field restoration. The former concurrently estimates blur kernels and noise maps from low-resolution degraded light fields, while the latter generates super-resolved light fields based on the estimated degradations. Notably, we introduce a modulated and selective fusion module that intelligently combines degradation representations with image information, effectively handling diverse degradation types. We conduct extensive experiments on benchmark datasets, demonstrating that LF-DEST achieves superior performance across various degradation scenarios in light field SR. The implementation code is available at <span><span>https://github.com/zeyuxiao1997/LF-DEST</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104295"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DA2: Distribution-agnostic adaptive feature adaptation for one-class classification
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104256
Zilong Zhang, Zhibin Zhao, Xingwu Zhang, Xuefeng Chen
One-class classification (OCC), i.e., identifying whether an example belongs to the same distribution as the training data, is essential for deploying machine learning models in the real world. Adapting the pre-trained features on the target dataset has proven to be a promising paradigm for improving OCC performance. Existing methods are constrained by assumptions about the training distribution. This contradicts the real scenario where the data distribution is unknown. In this work, we propose a simple distribution-agnostic adaptive feature adaptation method (DA2). The core idea is to adaptively cluster the features of every class tighter depending on the property of the data. We rely on the prior that the augmentation distributions of intra-class samples overlap, then align the features of different augmentations of every sample by a non-contrastive method. We find that training a random initialized predictor degrades the pre-trained backbone in the non-contrastive method. To tackle this problem, we design a learnable symmetric predictor and initialize it based on the eigenspace alignment theory. Benchmarks, the proposed challenging near-distribution experiments substantiate the capability of our method in various data distributions. Furthermore, we find that utilizing DA2 can immensely mitigate the long-standing catastrophic forgetting in feature adaptation of OCC. Code will be released upon acceptance.
{"title":"DA2: Distribution-agnostic adaptive feature adaptation for one-class classification","authors":"Zilong Zhang,&nbsp;Zhibin Zhao,&nbsp;Xingwu Zhang,&nbsp;Xuefeng Chen","doi":"10.1016/j.cviu.2024.104256","DOIUrl":"10.1016/j.cviu.2024.104256","url":null,"abstract":"<div><div>One-class classification (OCC), i.e., identifying whether an example belongs to the same distribution as the training data, is essential for deploying machine learning models in the real world. Adapting the pre-trained features on the target dataset has proven to be a promising paradigm for improving OCC performance. Existing methods are constrained by assumptions about the training distribution. This contradicts the real scenario where the data distribution is unknown. In this work, we propose a simple <strong>d</strong>istribution-<strong>a</strong>gnostic <strong>a</strong>daptive feature adaptation method (<span><math><msup><mrow><mi>DA</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span>). The core idea is to adaptively cluster the features of every class tighter depending on the property of the data. We rely on the prior that the augmentation distributions of intra-class samples overlap, then align the features of different augmentations of every sample by a non-contrastive method. We find that training a random initialized predictor degrades the pre-trained backbone in the non-contrastive method. To tackle this problem, we design a learnable symmetric predictor and initialize it based on the eigenspace alignment theory. Benchmarks, the proposed challenging near-distribution experiments substantiate the capability of our method in various data distributions. Furthermore, we find that utilizing <span><math><msup><mrow><mi>DA</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> can immensely mitigate the long-standing catastrophic forgetting in feature adaptation of OCC. Code will be released upon acceptance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104256"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
As-Global-As-Possible stereo matching with Sparse Depth Measurement Fusion
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104268
Peng Yao , Haiwei Sang
The recently lauded methodologies of As-Global-As-Possible (AGAP) and Sparse Depth Measurement Fusion (SDMF) have emerged as celebrated solutions for tackling the issue of stereo matching. AGAP addresses the congenital shortcomings of Semi-Global-Matching (SGM) in terms of streaking effects, while SDMF leverages active depth sensors to boost disparity computation. In this paper, these two methods are intertwined for attaining superior disparity estimation. Random sparse Depth measurements are fused with Diffusion-Based Fusion to update AGAP’s matching costs. Then, Neighborhood-Based Fusion refines the cost further, leveraging the previous results. Ultimately, the segment-based disparity refinement strategy is utilized for handling outliers and mismatched pixels to achieve final disparity results. Performance evaluations on various stereo datasets demonstrate that the proposed algorithm not only surpasses other challenging stereo matching algorithms but also achieves near real-time efficiency. It is worth pointing out that our proposal surprisingly outperforms most of the deep learning based stereo matching algorithms on Middlebury v.3 online evaluation system, despite not utilizing any learning-based techniques, further validating its superiority and practicality.
{"title":"As-Global-As-Possible stereo matching with Sparse Depth Measurement Fusion","authors":"Peng Yao ,&nbsp;Haiwei Sang","doi":"10.1016/j.cviu.2024.104268","DOIUrl":"10.1016/j.cviu.2024.104268","url":null,"abstract":"<div><div>The recently lauded methodologies of As-Global-As-Possible (AGAP) and Sparse Depth Measurement Fusion (SDMF) have emerged as celebrated solutions for tackling the issue of stereo matching. AGAP addresses the congenital shortcomings of Semi-Global-Matching (SGM) in terms of streaking effects, while SDMF leverages active depth sensors to boost disparity computation. In this paper, these two methods are intertwined for attaining superior disparity estimation. Random sparse Depth measurements are fused with Diffusion-Based Fusion to update AGAP’s matching costs. Then, Neighborhood-Based Fusion refines the cost further, leveraging the previous results. Ultimately, the segment-based disparity refinement strategy is utilized for handling outliers and mismatched pixels to achieve final disparity results. Performance evaluations on various stereo datasets demonstrate that the proposed algorithm not only surpasses other challenging stereo matching algorithms but also achieves near real-time efficiency. It is worth pointing out that our proposal surprisingly outperforms most of the deep learning based stereo matching algorithms on Middlebury <em>v.3</em> online evaluation system, despite not utilizing any learning-based techniques, further validating its superiority and practicality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104268"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatio-Temporal Dynamic Interlaced Network for 3D human pose estimation in video
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104258
Feiyi Xu , Jifan Wang , Ying Sun , Jin Qi , Zhenjiang Dong , Yanfei Sun
Recent transformer-based methods have achieved excellent performance in 3D human pose estimation. The distinguishing characteristic of transformer lies in its equitable treatment of each token, encoding them independently. When applied to the human skeleton, transformer regards each joint as an equally significant token. This can lead to a lack of clarity in the extraction of connection relationships between joints, thus affecting the accuracy of relationship information. In addition, transformer also treats each frame of temporal sequences equally. This design can introduce a lot of redundant information in short frames with frequent action changes, which can have a negative impact on learning temporal correlations. To alleviate the above issues, we propose an end-to-end framework, a Spatio-Temporal Dynamic Interlaced Network (S-TDINet), including a dynamic spatial GCN encoder (DSGCE) and an interlaced temporal transformer encoder (ITTE). In the DSGCE module, we design three adaptive adjacency matrices to model spatial correlation from static and dynamic perspectives. In the ITTE module, we introduce a global–local interlaced mechanism to mitigate potential interference from redundant information in fast motion scenarios, thereby achieving more accurate temporal correlation modeling. Finally, we conduct extensive experiments and validate the effectiveness of our approach on two widely recognized benchmark datasets: Human3.6M and MPI-INF-3DHP.
{"title":"Spatio-Temporal Dynamic Interlaced Network for 3D human pose estimation in video","authors":"Feiyi Xu ,&nbsp;Jifan Wang ,&nbsp;Ying Sun ,&nbsp;Jin Qi ,&nbsp;Zhenjiang Dong ,&nbsp;Yanfei Sun","doi":"10.1016/j.cviu.2024.104258","DOIUrl":"10.1016/j.cviu.2024.104258","url":null,"abstract":"<div><div>Recent transformer-based methods have achieved excellent performance in 3D human pose estimation. The distinguishing characteristic of transformer lies in its equitable treatment of each token, encoding them independently. When applied to the human skeleton, transformer regards each joint as an equally significant token. This can lead to a lack of clarity in the extraction of connection relationships between joints, thus affecting the accuracy of relationship information. In addition, transformer also treats each frame of temporal sequences equally. This design can introduce a lot of redundant information in short frames with frequent action changes, which can have a negative impact on learning temporal correlations. To alleviate the above issues, we propose an end-to-end framework, a Spatio-Temporal Dynamic Interlaced Network (S-TDINet), including a dynamic spatial GCN encoder (DSGCE) and an interlaced temporal transformer encoder (ITTE). In the DSGCE module, we design three adaptive adjacency matrices to model spatial correlation from static and dynamic perspectives. In the ITTE module, we introduce a global–local interlaced mechanism to mitigate potential interference from redundant information in fast motion scenarios, thereby achieving more accurate temporal correlation modeling. Finally, we conduct extensive experiments and validate the effectiveness of our approach on two widely recognized benchmark datasets: Human3.6M and MPI-INF-3DHP.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104258"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gaussian Splatting with NeRF-based color and opacity
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104273
Dawid Malarz , Weronika Smolak-Dyżewska , Jacek Tabor , Sławomir Tadeja , Przemysław Spurek
Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of neural networks to capture the intricacies of 3D objects. NeRFs excel at producing strikingly sharp novel views of 3D objects by encoding the shape and color information within neural network weights. Recently, numerous generalizations of NeRFs utilizing generative models have emerged, expanding their versatility. In contrast, Gaussian Splatting (GS) offers a similar render quality with faster training and inference as it does not need neural networks to work. It encodes information about the 3D objects in the set of Gaussian distributions that can be rendered in 3D similarly to classical meshes. Unfortunately, GS is difficult to condition since its representation is fully explicit. To mitigate the caveats of both models, we propose a hybrid model Viewing Direction Gaussian Splatting (VDGS) that uses GS representation of the 3D object’s shape and NeRF-based encoding of opacity. Our model uses Gaussian distributions with trainable positions (i.e., means of Gaussian), shape (i.e., the covariance of Gaussian), opacity, and a neural network that takes Gaussian parameters and viewing direction to produce changes in the said opacity.As a result, our model better describes shadows, light reflections, and the transparency of 3D objects without adding additional texture and light components.
{"title":"Gaussian Splatting with NeRF-based color and opacity","authors":"Dawid Malarz ,&nbsp;Weronika Smolak-Dyżewska ,&nbsp;Jacek Tabor ,&nbsp;Sławomir Tadeja ,&nbsp;Przemysław Spurek","doi":"10.1016/j.cviu.2024.104273","DOIUrl":"10.1016/j.cviu.2024.104273","url":null,"abstract":"<div><div>Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of neural networks to capture the intricacies of 3D objects. NeRFs excel at producing strikingly sharp novel views of 3D objects by encoding the shape and color information within neural network weights. Recently, numerous generalizations of NeRFs utilizing generative models have emerged, expanding their versatility. In contrast, <em>Gaussian Splatting</em> (GS) offers a similar render quality with faster training and inference as it does not need neural networks to work. It encodes information about the 3D objects in the set of Gaussian distributions that can be rendered in 3D similarly to classical meshes. Unfortunately, GS is difficult to condition since its representation is fully explicit. To mitigate the caveats of both models, we propose a hybrid model <em>Viewing Direction Gaussian Splatting</em> (VDGS) that uses GS representation of the 3D object’s shape and NeRF-based encoding of opacity. Our model uses Gaussian distributions with trainable positions (i.e., means of Gaussian), shape (i.e., the covariance of Gaussian), opacity, and a neural network that takes Gaussian parameters and viewing direction to produce changes in the said opacity.As a result, our model better describes shadows, light reflections, and the transparency of 3D objects without adding additional texture and light components.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104273"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-domain conditional prior network for water-related optical image enhancement
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104251
Tianyu Wei , Dehuan Zhang , Zongxin He , Rui Zhou , Xiangfu Meng
Water-related optical image enhancement improves the perception of information for human and machine vision, facilitating the development and utilization of marine resources. Due to the absorption and scattering of light in different water media, water-related optical images typically suffer from color distortion and low contrast. However, existing enhancement methods struggle to accurately simulate the imaging process in real underwater environments. To model and invert the degradation process of water-related optical images, we propose a Multi-domain Conditional Prior Network (MCPN) based on color vector prior and spectrum vector prior for enhancing water-related optical images. MCPN captures color, luminance, and structural priors across different feature spaces, resulting in a lightweight architecture that enhances water-related optical images while preserving critical information fidelity. Specifically, MCPN includes a modulated network, and a conditional network comprises two conditional units. The modulated network is a lightweight Convolutional Neural Network responsible for image reconstruction and local feature refinement. To avoid feature loss from multiple extractions, the Gaussian Conditional Unit (GCU) extracts atmospheric light and color shift information from the input image to form color prior vectors. Simultaneously, incorporating the Fast Fourier Transform, the Spectrum Conditional Unit (SCU) extracts scene brightness and structure to form spectrum prior vectors. These prior vectors are embedded into the modulated network to guide the image reconstruction. MCPN utilizes a PAL-based weighted Selective Supervision (PSS) strategy, selectively adjusting learning weights for images with excessive artificial noise. Experimental results demonstrate that MCPN outperforms existing methods, achieving excellent performance on the UIEB dataset. The PSS also shows fine feature matching in downstream applications.
{"title":"Multi-domain conditional prior network for water-related optical image enhancement","authors":"Tianyu Wei ,&nbsp;Dehuan Zhang ,&nbsp;Zongxin He ,&nbsp;Rui Zhou ,&nbsp;Xiangfu Meng","doi":"10.1016/j.cviu.2024.104251","DOIUrl":"10.1016/j.cviu.2024.104251","url":null,"abstract":"<div><div>Water-related optical image enhancement improves the perception of information for human and machine vision, facilitating the development and utilization of marine resources. Due to the absorption and scattering of light in different water media, water-related optical images typically suffer from color distortion and low contrast. However, existing enhancement methods struggle to accurately simulate the imaging process in real underwater environments. To model and invert the degradation process of water-related optical images, we propose a Multi-domain Conditional Prior Network (MCPN) based on color vector prior and spectrum vector prior for enhancing water-related optical images. MCPN captures color, luminance, and structural priors across different feature spaces, resulting in a lightweight architecture that enhances water-related optical images while preserving critical information fidelity. Specifically, MCPN includes a modulated network, and a conditional network comprises two conditional units. The modulated network is a lightweight Convolutional Neural Network responsible for image reconstruction and local feature refinement. To avoid feature loss from multiple extractions, the Gaussian Conditional Unit (GCU) extracts atmospheric light and color shift information from the input image to form color prior vectors. Simultaneously, incorporating the Fast Fourier Transform, the Spectrum Conditional Unit (SCU) extracts scene brightness and structure to form spectrum prior vectors. These prior vectors are embedded into the modulated network to guide the image reconstruction. MCPN utilizes a PAL-based weighted Selective Supervision (PSS) strategy, selectively adjusting learning weights for images with excessive artificial noise. Experimental results demonstrate that MCPN outperforms existing methods, achieving excellent performance on the UIEB dataset. The PSS also shows fine feature matching in downstream applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104251"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1