The nonlocal low-rank (LR) modeling has proven to be an effective approach in image compressive sensing (CS) reconstruction, which starts by clustering similar patches using the nonlocal self-similarity (NSS) prior into nonlocal image group and then imposes an LR penalty on each nonlocal image group. However, most existing methods only approximate the LR matrix directly from the degraded nonlocal image group, which may lead to suboptimal LR matrix approximation and thus obtain unsatisfactory reconstruction results. In this paper, we propose a novel nonlocal low-rank residual (NLRR) approach for image CS reconstruction, which progressively approximates the underlying LR matrix by minimizing the LR residual. To do this, we first use the NSS prior to obtaining a good estimate of the original nonlocal image group, and then the LR residual between the degraded nonlocal image group and the estimated nonlocal image group is minimized to derive a more accurate LR matrix. To ensure the optimization is both feasible and reliable, we employ an alternative direction multiplier method (ADMM) to solve the NLRR-based image CS reconstruction problem. Our experimental results show that the proposed NLRR algorithm achieves superior performance against many popular or state-of-the-art image CS reconstruction methods, both in objective metrics and subjective perceptual quality.
非局部低阶(LR)建模已被证明是图像压缩传感(CS)重建中的一种有效方法,它首先利用非局部自相似性(NSS)先验将相似斑块聚类为非局部图像组,然后对每个非局部图像组施加 LR 惩罚。然而,大多数现有方法只是直接从退化的非局部图像组近似 LR 矩阵,这可能会导致 LR 矩阵近似效果不理想,从而得到不尽人意的重建结果。在本文中,我们提出了一种用于图像 CS 重建的新型非局部低阶残差(NLRR)方法,该方法通过最小化 LR 残差逐步逼近底层 LR 矩阵。为此,我们首先使用 NSS 先验法获得原始非本地图像组的良好估计值,然后最小化退化的非本地图像组和估计的非本地图像组之间的 LR 残差,从而得出更精确的 LR 矩阵。为确保优化的可行性和可靠性,我们采用了另一种方向乘法(ADMM)来解决基于 NLRR 的图像 CS 重建问题。我们的实验结果表明,与许多流行的或最先进的图像 CS 重建方法相比,所提出的 NLRR 算法在客观指标和主观感知质量方面都取得了优异的性能。
{"title":"Image compressive sensing reconstruction via nonlocal low-rank residual-based ADMM framework","authors":"Junhao Zhang , Kim-Hui Yap , Lap-Pui Chau , Ce Zhu","doi":"10.1016/j.cviu.2024.104204","DOIUrl":"10.1016/j.cviu.2024.104204","url":null,"abstract":"<div><div>The nonlocal low-rank (LR) modeling has proven to be an effective approach in image compressive sensing (CS) reconstruction, which starts by clustering similar patches using the nonlocal self-similarity (NSS) prior into nonlocal image group and then imposes an LR penalty on each nonlocal image group. However, most existing methods only approximate the LR matrix directly from the degraded nonlocal image group, which may lead to suboptimal LR matrix approximation and thus obtain unsatisfactory reconstruction results. In this paper, we propose a novel nonlocal low-rank residual (NLRR) approach for image CS reconstruction, which progressively approximates the underlying LR matrix by minimizing the LR residual. To do this, we first use the NSS prior to obtaining a good estimate of the original nonlocal image group, and then the LR residual between the degraded nonlocal image group and the estimated nonlocal image group is minimized to derive a more accurate LR matrix. To ensure the optimization is both feasible and reliable, we employ an alternative direction multiplier method (ADMM) to solve the NLRR-based image CS reconstruction problem. Our experimental results show that the proposed NLRR algorithm achieves superior performance against many popular or state-of-the-art image CS reconstruction methods, both in objective metrics and subjective perceptual quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104204"},"PeriodicalIF":4.3,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-25DOI: 10.1016/j.cviu.2024.104214
Zeyu Cai , Ru Hong , Xun Lin , Jiming Yang , YouLiang Ni , Zhen Liu , Chengqian Jin , Feipeng Da
The coded Aperture Snapshot Spectral Imaging (CASSI) system offers significant advantages in dynamically acquiring hyper-spectral images compared to traditional measurement methods. However, it faces the following challenges: (1) Traditional masks rely on random patterns or analytical design, limiting CASSI’s performance improvement. (2) Existing CASSI reconstruction algorithms do not fully utilize RGB information. (3) High-quality reconstruction algorithms are often slow and limited to offline scene reconstruction. To address these issues, this paper proposes a new MLP architecture, Spectral–Spatial MLP (SSMLP), which replaces the transformer structure with a network using CASSI measurements and RGB as multimodal inputs. This maintains reconstruction quality while significantly improving reconstruction speed. Additionally, we constructed a teacher-student network (SSMLP with a teacher, SSMLP-WT) to transfer the knowledge learned from a large model to a smaller network, further enhancing the smaller network’s accuracy. Extensive experiments show that SSMLP matches the performance of transformer-based structures in spectral image reconstruction while improving inference speed by at least 50%. The reconstruction quality of SSMLP-WT is further improved by knowledge transfer without changing the network, and the teacher boosts the performance by 0.92 dB (44.73 dB vs. 43.81 dB).
与传统测量方法相比,编码孔径快照光谱成像(CASSI)系统在动态获取高光谱图像方面具有显著优势。然而,它也面临着以下挑战:(1)传统的掩膜依赖于随机模式或分析设计,限制了 CASSI 性能的提高。(2) 现有的 CASSI 重建算法没有充分利用 RGB 信息。(3) 高质量的重建算法通常速度较慢,且仅限于离线场景重建。为了解决这些问题,本文提出了一种新的 MLP 架构--光谱空间 MLP (SSMLP),它利用 CASSI 测量和 RGB 作为多模态输入,用一个网络取代了变压器结构。这既保持了重建质量,又大大提高了重建速度。此外,我们还构建了一个师生网络(SSMLP with a teacher,SSMLP-WT),将从大型模型中学到的知识转移到小型网络中,进一步提高了小型网络的准确性。大量实验表明,SSMLP 在光谱图像重建方面的性能与基于变压器的结构相当,同时推理速度至少提高了 50%。在不改变网络的情况下,通过知识转移,SSMLP-WT 的重建质量得到了进一步提高,教师将其性能提高了 0.92 dB(44.73 dB 对 43.81 dB)。
{"title":"A MLP architecture fusing RGB and CASSI for computational spectral imaging","authors":"Zeyu Cai , Ru Hong , Xun Lin , Jiming Yang , YouLiang Ni , Zhen Liu , Chengqian Jin , Feipeng Da","doi":"10.1016/j.cviu.2024.104214","DOIUrl":"10.1016/j.cviu.2024.104214","url":null,"abstract":"<div><div>The coded Aperture Snapshot Spectral Imaging (CASSI) system offers significant advantages in dynamically acquiring hyper-spectral images compared to traditional measurement methods. However, it faces the following challenges: (1) Traditional masks rely on random patterns or analytical design, limiting CASSI’s performance improvement. (2) Existing CASSI reconstruction algorithms do not fully utilize RGB information. (3) High-quality reconstruction algorithms are often slow and limited to offline scene reconstruction. To address these issues, this paper proposes a new MLP architecture, Spectral–Spatial MLP (SSMLP), which replaces the transformer structure with a network using CASSI measurements and RGB as multimodal inputs. This maintains reconstruction quality while significantly improving reconstruction speed. Additionally, we constructed a teacher-student network (SSMLP with a teacher, SSMLP-WT) to transfer the knowledge learned from a large model to a smaller network, further enhancing the smaller network’s accuracy. Extensive experiments show that SSMLP matches the performance of transformer-based structures in spectral image reconstruction while improving inference speed by at least 50%. The reconstruction quality of SSMLP-WT is further improved by knowledge transfer without changing the network, and the teacher boosts the performance by 0.92 dB (44.73 dB vs. 43.81 dB).</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104214"},"PeriodicalIF":4.3,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-22DOI: 10.1016/j.cviu.2024.104213
Xuezhi Xiang , Xiaoheng Li , Xuzhao Liu , Yulong Qiao , Abdulmotaleb El Saddik
Graph Convolution Networks (GCNs) have been widely used in skeleton-based action recognition. Although there are significant progress, the inherent limitation still lies in the restricted receptive field of GCN, hindering its ability to extract global dependencies effectively. And the joints that are structurally separated can also have strong correlation. Previous works rarely explore local and global correlations of joints, leading to insufficiently model the complex dynamics of skeleton sequences. To address this issue, we propose a GCN and Transformer complementary network (GTC-Net) that allows parallel communications between GCN and Transformer domains. Specifically, we introduce a graph convolution and self-attention combined module (GAM), which can effectively leverage the complementarity of GCN and self-attention to perceive local and global dependencies of joints for the human body. Furthermore, in order to address the problems of long-term sequence ordering and position detection, we design a position-aware module (PAM), which can explicitly capture the ordering information and unique identity information for body joints of skeleton sequence. Extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets are conducted to evaluate our proposed method. The results demonstrate that our method can achieve competitive results on both datasets.
{"title":"A GCN and Transformer complementary network for skeleton-based action recognition","authors":"Xuezhi Xiang , Xiaoheng Li , Xuzhao Liu , Yulong Qiao , Abdulmotaleb El Saddik","doi":"10.1016/j.cviu.2024.104213","DOIUrl":"10.1016/j.cviu.2024.104213","url":null,"abstract":"<div><div>Graph Convolution Networks (GCNs) have been widely used in skeleton-based action recognition. Although there are significant progress, the inherent limitation still lies in the restricted receptive field of GCN, hindering its ability to extract global dependencies effectively. And the joints that are structurally separated can also have strong correlation. Previous works rarely explore local and global correlations of joints, leading to insufficiently model the complex dynamics of skeleton sequences. To address this issue, we propose a GCN and Transformer complementary network (GTC-Net) that allows parallel communications between GCN and Transformer domains. Specifically, we introduce a graph convolution and self-attention combined module (GAM), which can effectively leverage the complementarity of GCN and self-attention to perceive local and global dependencies of joints for the human body. Furthermore, in order to address the problems of long-term sequence ordering and position detection, we design a position-aware module (PAM), which can explicitly capture the ordering information and unique identity information for body joints of skeleton sequence. Extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets are conducted to evaluate our proposed method. The results demonstrate that our method can achieve competitive results on both datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104213"},"PeriodicalIF":4.3,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-19DOI: 10.1016/j.cviu.2024.104210
Florinel-Alin Croitoru , Vlad Hondru , Radu Tudor Ionescu , Mubarak Shah
Text-to-image diffusion models have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we study the task of predicting the prompt embedding given an image generated by a generative diffusion model. We consider a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned). We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation. Our code is publicly available for download at https://github.com/CroitoruAlin/Reverse-Stable-Diffusion.
{"title":"Reverse Stable Diffusion: What prompt was used to generate this image?","authors":"Florinel-Alin Croitoru , Vlad Hondru , Radu Tudor Ionescu , Mubarak Shah","doi":"10.1016/j.cviu.2024.104210","DOIUrl":"10.1016/j.cviu.2024.104210","url":null,"abstract":"<div><div>Text-to-image diffusion models have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we study the task of predicting the prompt embedding given an image generated by a generative diffusion model. We consider a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (<em>i</em>.<em>e</em>. that are better aligned). We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation. Our code is publicly available for download at <span><span>https://github.com/CroitoruAlin/Reverse-Stable-Diffusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104210"},"PeriodicalIF":4.3,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, with the development and widespread application of deep neural networks (DNNs), backdoor attacks have posed new security threats to the training process of DNNs. Backdoor attacks on neural networks undermine the security and trustworthiness of DNNs by implanting hidden, unauthorized triggers, leading to benign behavior on clean samples while exhibiting malicious behavior on samples containing backdoor triggers. Existing backdoor attacks typically employ triggers that are sample-agnostic and identical for each sample, resulting in poisoned images that lack naturalness and are ineffective against existing backdoor defenses. To address these issues, this paper proposes a novel stealthy backdoor attack, where the backdoor trigger is dynamic and specific to each sample. Specifically, we leverage spatial attention on images and pre-trained models to obtain dynamic triggers, which are then injected using an encoder–decoder network. The design of the injection network benefits from recent advances in steganography research. To demonstrate the effectiveness of the proposed steganographic network, we design two backdoor attack modes named ASBA and ATBA, where ASBA utilizes the steganographic network for attack, while ATBA is a backdoor attack without steganography. Subsequently, we conducted attacks on Deep Neural Networks (DNNs) using four standard datasets. Our extensive experiments show that ASBA surpasses ATBA in terms of stealthiness and resilience against current defensive measures. Furthermore, both ASBA and ATBA demonstrate superior attack efficiency.
{"title":"Invisible backdoor attack with attention and steganography","authors":"Wenmin Chen, Xiaowei Xu, Xiaodong Wang, Huasong Zhou, Zewen Li, Yangming Chen","doi":"10.1016/j.cviu.2024.104208","DOIUrl":"10.1016/j.cviu.2024.104208","url":null,"abstract":"<div><div>Recently, with the development and widespread application of deep neural networks (DNNs), backdoor attacks have posed new security threats to the training process of DNNs. Backdoor attacks on neural networks undermine the security and trustworthiness of DNNs by implanting hidden, unauthorized triggers, leading to benign behavior on clean samples while exhibiting malicious behavior on samples containing backdoor triggers. Existing backdoor attacks typically employ triggers that are sample-agnostic and identical for each sample, resulting in poisoned images that lack naturalness and are ineffective against existing backdoor defenses. To address these issues, this paper proposes a novel stealthy backdoor attack, where the backdoor trigger is dynamic and specific to each sample. Specifically, we leverage spatial attention on images and pre-trained models to obtain dynamic triggers, which are then injected using an encoder–decoder network. The design of the injection network benefits from recent advances in steganography research. To demonstrate the effectiveness of the proposed steganographic network, we design two backdoor attack modes named ASBA and ATBA, where ASBA utilizes the steganographic network for attack, while ATBA is a backdoor attack without steganography. Subsequently, we conducted attacks on Deep Neural Networks (DNNs) using four standard datasets. Our extensive experiments show that ASBA surpasses ATBA in terms of stealthiness and resilience against current defensive measures. Furthermore, both ASBA and ATBA demonstrate superior attack efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104208"},"PeriodicalIF":4.3,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-18DOI: 10.1016/j.cviu.2024.104206
Hannah Schieber , Fabian Deuser , Bernhard Egger , Norbert Oswald , Daniel Roth
Novel view synthesis using neural radiance fields (NeRF) is the state-of-the-art technique for generating high-quality images from novel viewpoints. Existing methods require a priori knowledge about extrinsic and intrinsic camera parameters. This limits their applicability to synthetic scenes, or real-world scenarios with the necessity of a preprocessing step. Current research on the joint optimization of camera parameters and NeRF focuses on refining noisy extrinsic camera parameters and often relies on the preprocessing of intrinsic camera parameters. Further approaches are limited to cover only one single camera intrinsic. To address these limitations, we propose a novel end-to-end trainable approach called NeRFtrinsic Four. We utilize Gaussian Fourier features to estimate extrinsic camera parameters and dynamically predict varying intrinsic camera parameters through the supervision of the projection error. Our approach outperforms existing joint optimization methods on LLFF and BLEFF. In addition to these existing datasets, we introduce a new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic Four is a step forward in joint optimization NeRF-based view synthesis and enables more realistic and flexible rendering in real-world scenarios with varying camera parameters.
{"title":"NeRFtrinsic Four: An end-to-end trainable NeRF jointly optimizing diverse intrinsic and extrinsic camera parameters","authors":"Hannah Schieber , Fabian Deuser , Bernhard Egger , Norbert Oswald , Daniel Roth","doi":"10.1016/j.cviu.2024.104206","DOIUrl":"10.1016/j.cviu.2024.104206","url":null,"abstract":"<div><div>Novel view synthesis using neural radiance fields (NeRF) is the state-of-the-art technique for generating high-quality images from novel viewpoints. Existing methods require a priori knowledge about extrinsic and intrinsic camera parameters. This limits their applicability to synthetic scenes, or real-world scenarios with the necessity of a preprocessing step. Current research on the joint optimization of camera parameters and NeRF focuses on refining noisy extrinsic camera parameters and often relies on the preprocessing of intrinsic camera parameters. Further approaches are limited to cover only one single camera intrinsic. To address these limitations, we propose a novel end-to-end trainable approach called NeRFtrinsic Four. We utilize Gaussian Fourier features to estimate extrinsic camera parameters and dynamically predict varying intrinsic camera parameters through the supervision of the projection error. Our approach outperforms existing joint optimization methods on LLFF and BLEFF. In addition to these existing datasets, we introduce a new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic Four is a step forward in joint optimization NeRF-based view synthesis and enables more realistic and flexible rendering in real-world scenarios with varying camera parameters.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104206"},"PeriodicalIF":4.3,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-17DOI: 10.1016/j.cviu.2024.104194
Nianchang Huang , Yang Yang , Qiang Zhang , Jungong Han , Jin Huang
Recently, Transformer-based RGB-D salient object detection (SOD) models have pushed the performance to a new level. However, they come at the cost of consuming abundant resources, including memory and power, thus hindering their real-life applications. To remedy this situation, a novel lightweight cross-modal Transformer (LCT) for RGB-D SOD will be presented in this paper. Specifically, LCT will first reduce its parameters and computational costs by employing a middle-level feature fusion structure and taking a lightweight Transformer as the backbone. Then, with the aid of Transformers, it will compensate for performance degradation by effectively capturing the cross-modal and cross-level complementary information from the multi-modal input images. To this end, a cross-modal enhancement and fusion module (CEFM) with a lightweight channel-wise cross attention block (LCCAB) will be designed to capture the cross-modal complementary information effectively but with fewer costs. A bi-directional multi-level feature interaction module (Bi-MFIM) with a lightweight spatial-wise cross attention block (LSCAB) will be designed to capture the cross-level complementary context information. By virtue of CEFM and Bi-MFIM, the performance degradation caused by parameter reduction can be well compensated, thus boosting the performances. By doing so, our proposed model has only 2.8M parameters with 7.6G FLOPs and runs at 66 FPS. Furthermore, experimental results on several benchmark datasets show that our proposed model can achieve competitive or even better results than other models. Our code will be released on https://github.com/nexiakele/lightweight-cross-modal-Transformer-LCT-for-RGB-D-SOD.
{"title":"Lightweight cross-modal transformer for RGB-D salient object detection","authors":"Nianchang Huang , Yang Yang , Qiang Zhang , Jungong Han , Jin Huang","doi":"10.1016/j.cviu.2024.104194","DOIUrl":"10.1016/j.cviu.2024.104194","url":null,"abstract":"<div><div>Recently, Transformer-based RGB-D salient object detection (SOD) models have pushed the performance to a new level. However, they come at the cost of consuming abundant resources, including memory and power, thus hindering their real-life applications. To remedy this situation, a novel lightweight cross-modal Transformer (LCT) for RGB-D SOD will be presented in this paper. Specifically, LCT will first reduce its parameters and computational costs by employing a middle-level feature fusion structure and taking a lightweight Transformer as the backbone. Then, with the aid of Transformers, it will compensate for performance degradation by effectively capturing the cross-modal and cross-level complementary information from the multi-modal input images. To this end, a cross-modal enhancement and fusion module (CEFM) with a lightweight channel-wise cross attention block (LCCAB) will be designed to capture the cross-modal complementary information effectively but with fewer costs. A bi-directional multi-level feature interaction module (Bi-MFIM) with a lightweight spatial-wise cross attention block (LSCAB) will be designed to capture the cross-level complementary context information. By virtue of CEFM and Bi-MFIM, the performance degradation caused by parameter reduction can be well compensated, thus boosting the performances. By doing so, our proposed model has only 2.8M parameters with 7.6G FLOPs and runs at 66 FPS. Furthermore, experimental results on several benchmark datasets show that our proposed model can achieve competitive or even better results than other models. Our code will be released on <span><span>https://github.com/nexiakele/lightweight-cross-modal-Transformer-LCT-for-RGB-D-SOD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104194"},"PeriodicalIF":4.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-16DOI: 10.1016/j.cviu.2024.104197
Yu Liu , Jianghao Li , Yanyi Zhang , Qi Jia , Weimin Wang , Nan Pu , Nicu Sebe
Compositional zero-shot learning (CZSL) aims to model compositions of two primitives (i.e., attributes and objects) to classify unseen attribute-object pairs. Most studies are devoted to integrating disentanglement and entanglement strategies to circumvent the trade-off between contextuality and generalizability. Indeed, the two strategies can mutually benefit when used together. Nevertheless, they neglect the significance of developing mutual guidance between the two strategies. In this work, we take full advantage of guidance from disentanglement to entanglement and vice versa. Additionally, we propose exploring multi-scale feature learning to achieve fine-grained mutual guidance in a progressive framework. Our approach, termed Progressive Mutual Guidance Network (PMGNet), unifies disentanglement–entanglement representation learning, allowing them to learn from and teach each other progressively in one unified model. Furthermore, to alleviate overfitting recognition on seen pairs, we adopt a relaxed cross-entropy loss to train PMGNet, without an increase of time and memory cost. Extensive experiments on three benchmarks demonstrate that our method achieves distinct improvements, reaching state-of-the-art performance. Moreover, PMGNet exhibits promising performance under the most challenging open-world CZSL setting, especially for unseen pairs.
{"title":"PMGNet: Disentanglement and entanglement benefit mutually for compositional zero-shot learning","authors":"Yu Liu , Jianghao Li , Yanyi Zhang , Qi Jia , Weimin Wang , Nan Pu , Nicu Sebe","doi":"10.1016/j.cviu.2024.104197","DOIUrl":"10.1016/j.cviu.2024.104197","url":null,"abstract":"<div><div>Compositional zero-shot learning (CZSL) aims to model compositions of two primitives (i.e., attributes and objects) to classify unseen attribute-object pairs. Most studies are devoted to integrating disentanglement and entanglement strategies to circumvent the trade-off between contextuality and generalizability. Indeed, the two strategies can mutually benefit when used together. Nevertheless, they neglect the significance of developing mutual guidance between the two strategies. In this work, we take full advantage of guidance from disentanglement to entanglement and vice versa. Additionally, we propose exploring multi-scale feature learning to achieve fine-grained mutual guidance in a progressive framework. Our approach, termed Progressive Mutual Guidance Network (PMGNet), unifies disentanglement–entanglement representation learning, allowing them to learn from and teach each other progressively in one unified model. Furthermore, to alleviate overfitting recognition on seen pairs, we adopt a relaxed cross-entropy loss to train PMGNet, without an increase of time and memory cost. Extensive experiments on three benchmarks demonstrate that our method achieves distinct improvements, reaching state-of-the-art performance. Moreover, PMGNet exhibits promising performance under the most challenging open-world CZSL setting, especially for unseen pairs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104197"},"PeriodicalIF":4.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-16DOI: 10.1016/j.cviu.2024.104188
Maria De Marsico, Giordano Dionisi, Donato Francesco Pio Stanco
This work deals with the delicate task of lie detection from facial dynamics. The proposed Face Truth Machine (FTM) is an intelligent system able to support a human operator without any special equipment. It can be embedded in the present infrastructures for forensic investigation or whenever it is required to assess the trustworthiness of responses during an interview. Due to its flexibility and its non-invasiveness, it can overcome some limitations of present solutions. Of course, privacy issues may arise from the use of such systems, as often underlined nowadays. However, it is up to the utilizer to take these into account and make fair use of tools of this kind. The paper will discuss particular aspects of the dynamic analysis of face landmarks to detect lies. In particular, it will delve into the behavior of the features used for detection and how these influence the system’s final decision. The novel detection system underlying the Face Truth Machine is able to analyze the subject’s expressions in a wide range of poses. The results of the experiments presented testify to the potential of the proposed approach and also highlight the very good results obtained in cross-dataset testing, which usually represents a challenge for other approaches.
{"title":"FTM: The Face Truth Machine—Hand-crafted features from micro-expressions to support lie detection","authors":"Maria De Marsico, Giordano Dionisi, Donato Francesco Pio Stanco","doi":"10.1016/j.cviu.2024.104188","DOIUrl":"10.1016/j.cviu.2024.104188","url":null,"abstract":"<div><div>This work deals with the delicate task of lie detection from facial dynamics. The proposed Face Truth Machine (FTM) is an intelligent system able to support a human operator without any special equipment. It can be embedded in the present infrastructures for forensic investigation or whenever it is required to assess the trustworthiness of responses during an interview. Due to its flexibility and its non-invasiveness, it can overcome some limitations of present solutions. Of course, privacy issues may arise from the use of such systems, as often underlined nowadays. However, it is up to the utilizer to take these into account and make fair use of tools of this kind. The paper will discuss particular aspects of the dynamic analysis of face landmarks to detect lies. In particular, it will delve into the behavior of the features used for detection and how these influence the system’s final decision. The novel detection system underlying the Face Truth Machine is able to analyze the subject’s expressions in a wide range of poses. The results of the experiments presented testify to the potential of the proposed approach and also highlight the very good results obtained in cross-dataset testing, which usually represents a challenge for other approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104188"},"PeriodicalIF":4.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1016/j.cviu.2024.104205
Qingzheng Xu , Huiqiang Chen , Heming Du , Hu Zhang , Szymon Łukasik , Tianqing Zhu , Xin Yu
With the development of various generative models, misinformation in news media becomes more deceptive and easier to create, posing a significant problem. However, existing datasets for misinformation study often have limited modalities, constrained sources, and a narrow range of topics. These limitations make it difficult to train models that can effectively combat real-world misinformation. To address this, we propose a comprehensive, large-scale Multimodal Misinformation dataset for Media Authenticity Analysis (), featuring broad sources and fine-grained annotations for topics and sentiments. To curate , we collect genuine news content from 60 renowned news outlets worldwide and generate fake samples using multiple techniques. These include altering named entities in texts, swapping modalities between samples, creating new modalities, and misrepresenting movie content as news. contains 708K genuine news samples and over 6M fake news samples, spanning text, images, audio, and video. provides detailed multi-class labels, crucial for various misinformation detection tasks, including out-of-context detection and deepfake detection. For each task, we offer extensive benchmarks using state-of-the-art models, aiming to enhance the development of robust misinformation detection systems.
{"title":"M3A: A multimodal misinformation dataset for media authenticity analysis","authors":"Qingzheng Xu , Huiqiang Chen , Heming Du , Hu Zhang , Szymon Łukasik , Tianqing Zhu , Xin Yu","doi":"10.1016/j.cviu.2024.104205","DOIUrl":"10.1016/j.cviu.2024.104205","url":null,"abstract":"<div><div>With the development of various generative models, misinformation in news media becomes more deceptive and easier to create, posing a significant problem. However, existing datasets for misinformation study often have limited modalities, constrained sources, and a narrow range of topics. These limitations make it difficult to train models that can effectively combat real-world misinformation. To address this, we propose a comprehensive, large-scale Multimodal Misinformation dataset for Media Authenticity Analysis (<span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>), featuring broad sources and fine-grained annotations for topics and sentiments. To curate <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>, we collect genuine news content from 60 renowned news outlets worldwide and generate fake samples using multiple techniques. These include altering named entities in texts, swapping modalities between samples, creating new modalities, and misrepresenting movie content as news. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> contains 708K genuine news samples and over 6M fake news samples, spanning text, images, audio, and video. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> provides detailed multi-class labels, crucial for various misinformation detection tasks, including out-of-context detection and deepfake detection. For each task, we offer extensive benchmarks using state-of-the-art models, aiming to enhance the development of robust misinformation detection systems.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104205"},"PeriodicalIF":4.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}