Pub Date : 2024-09-06DOI: 10.1016/j.cviu.2024.104140
Kainat Riaz, Muhammad Latif Anjum, Wajahat Hussain, Rohan Manzoor
Deep networks are susceptible to adversarial attacks. End-to-end differentiability of deep networks provides the analytical formulation which has aided in proliferation of diverse adversarial attacks. On the contrary, handcrafted pipelines (local feature matching, bag-of-words based place recognition, and visual tracking) consist of intuitive approaches and perhaps lack end-to-end formal description. In this work, we show that classic handcrafted pipelines are also susceptible to adversarial attacks.
We propose a novel targeted adversarial attack for multiple well-known handcrafted pipelines and datasets. Our attack is able to match an image with any given target image which can be completely different from the original image. Our approach manages to attack simple (image registration) as well as sophisticated multi-stage (place recognition (FAB-MAP), visual tracking (ORB-SLAM3)) pipelines. We outperform multiple baselines over different public datasets (Places, KITTI and HPatches).
Our analysis shows that although vulnerable, achieving true imperceptibility is harder in case of targeted attack on handcrafted pipelines. To this end, we propose a stealthy attack where the noise is perceptible but appears benign. In order to assist the community in further examining the weakness of popular handcrafted pipelines we release our code.
{"title":"Targeted adversarial attack on classic vision pipelines","authors":"Kainat Riaz, Muhammad Latif Anjum, Wajahat Hussain, Rohan Manzoor","doi":"10.1016/j.cviu.2024.104140","DOIUrl":"10.1016/j.cviu.2024.104140","url":null,"abstract":"<div><p>Deep networks are susceptible to adversarial attacks. End-to-end differentiability of deep networks provides the analytical formulation which has aided in proliferation of diverse adversarial attacks. On the contrary, handcrafted pipelines (local feature matching, bag-of-words based place recognition, and visual tracking) consist of intuitive approaches and perhaps lack end-to-end formal description. In this work, we show that classic handcrafted pipelines are also susceptible to adversarial attacks.</p><p>We propose a novel targeted adversarial attack for multiple well-known handcrafted pipelines and datasets. Our attack is able to match an image with any given target image which can be completely different from the original image. Our approach manages to attack simple (image registration) as well as sophisticated multi-stage (place recognition (FAB-MAP), visual tracking (ORB-SLAM3)) pipelines. We outperform multiple baselines over different public datasets (Places, KITTI and HPatches).</p><p>Our analysis shows that although vulnerable, achieving true imperceptibility is harder in case of targeted attack on handcrafted pipelines. To this end, we propose a stealthy attack where the noise is perceptible but appears benign. In order to assist the community in further examining the weakness of popular handcrafted pipelines we release our code.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104140"},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.cviu.2024.104147
Xuezhi Xiang , Xiaoheng Li , Weijie Bao , Yulong Qiao , Abdulmotaleb El Saddik
The estimation of 3D human poses from monocular videos presents a significant challenge. The existing methods face the problems of deep ambiguity and self-occlusion. To overcome these problems, we propose a Double-Branch Multi-Hypothesis Transformer (DBMHT). In detail, we utilize a Double-Branch architecture to capture temporal and spatial information and generate multiple hypotheses. To merge these hypotheses, we adopt a lightweight module to integrate spatial and temporal representations. The DBMHT can not only capture spatial information from each joint in the human body and temporal information from each frame in the video but also merge multiple hypotheses that have different spatio-temporal information. Comprehensive evaluation on two challenging datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrates the superior performance of DBMHT, marking it as a robust and efficient approach for accurate 3D HPE in dynamic scenarios. The results show that our model surpasses the state-of-the-art approach by 1.9% MPJPE with ground truth 2D keypoints as input.
{"title":"DBMHT: A double-branch multi-hypothesis transformer for 3D human pose estimation in video","authors":"Xuezhi Xiang , Xiaoheng Li , Weijie Bao , Yulong Qiao , Abdulmotaleb El Saddik","doi":"10.1016/j.cviu.2024.104147","DOIUrl":"10.1016/j.cviu.2024.104147","url":null,"abstract":"<div><p>The estimation of 3D human poses from monocular videos presents a significant challenge. The existing methods face the problems of deep ambiguity and self-occlusion. To overcome these problems, we propose a Double-Branch Multi-Hypothesis Transformer (DBMHT). In detail, we utilize a Double-Branch architecture to capture temporal and spatial information and generate multiple hypotheses. To merge these hypotheses, we adopt a lightweight module to integrate spatial and temporal representations. The DBMHT can not only capture spatial information from each joint in the human body and temporal information from each frame in the video but also merge multiple hypotheses that have different spatio-temporal information. Comprehensive evaluation on two challenging datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrates the superior performance of DBMHT, marking it as a robust and efficient approach for accurate 3D HPE in dynamic scenarios. The results show that our model surpasses the state-of-the-art approach by 1.9% MPJPE with ground truth 2D keypoints as input.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104147"},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.cviu.2024.104143
Francesco Tassone , Luca Maiano , Irene Amerini
Generative techniques continue to evolve at an impressively high rate, driven by the hype about these technologies. This rapid advancement severely limits the application of deepfake detectors, which, despite numerous efforts by the scientific community, struggle to achieve sufficiently robust performance against the ever-changing content. To address these limitations, in this paper, we propose an analysis of two continuous learning techniques on a Short and a Long sequence of fake media. Both sequences include a complex and heterogeneous range of deepfakes (generated images and videos) from GANs, computer graphics techniques, and unknown sources. Our experiments show that continual learning could be important in mitigating the need for generalizability. In fact, we show that, although with some limitations, continual learning methods help to maintain good performance across the entire training sequence. For these techniques to work in a sufficiently robust way, however, it is necessary that the tasks in the sequence share similarities. In fact, according to our experiments, the order and similarity of the tasks can affect the performance of the models over time. To address this problem, we show that it is possible to group tasks based on their similarity. This small measure allows for a significant improvement even in longer sequences. This result suggests that continual techniques can be combined with the most promising detection methods, allowing them to catch up with the latest generative techniques. In addition to this, we propose an overview of how this learning approach can be integrated into a deepfake detection pipeline for continuous integration and continuous deployment (CI/CD). This allows you to keep track of different funds, such as social networks, new generative tools, or third-party datasets, and through the integration of continuous learning, allows constant maintenance of the detectors.
{"title":"Continuous fake media detection: Adapting deepfake detectors to new generative techniques","authors":"Francesco Tassone , Luca Maiano , Irene Amerini","doi":"10.1016/j.cviu.2024.104143","DOIUrl":"10.1016/j.cviu.2024.104143","url":null,"abstract":"<div><p>Generative techniques continue to evolve at an impressively high rate, driven by the hype about these technologies. This rapid advancement severely limits the application of deepfake detectors, which, despite numerous efforts by the scientific community, struggle to achieve sufficiently robust performance against the ever-changing content. To address these limitations, in this paper, we propose an analysis of two continuous learning techniques on a <em>Short</em> and a <em>Long</em> sequence of fake media. Both sequences include a complex and heterogeneous range of deepfakes (generated images and videos) from GANs, computer graphics techniques, and unknown sources. Our experiments show that continual learning could be important in mitigating the need for generalizability. In fact, we show that, although with some limitations, continual learning methods help to maintain good performance across the entire training sequence. For these techniques to work in a sufficiently robust way, however, it is necessary that the tasks in the sequence share similarities. In fact, according to our experiments, the order and similarity of the tasks can affect the performance of the models over time. To address this problem, we show that it is possible to group tasks based on their similarity. This small measure allows for a significant improvement even in longer sequences. This result suggests that continual techniques can be combined with the most promising detection methods, allowing them to catch up with the latest generative techniques. In addition to this, we propose an overview of how this learning approach can be integrated into a deepfake detection pipeline for continuous integration and continuous deployment (CI/CD). This allows you to keep track of different funds, such as social networks, new generative tools, or third-party datasets, and through the integration of continuous learning, allows constant maintenance of the detectors.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104143"},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002248/pdfft?md5=055418833f110c748b5c22d95d3c42b9&pid=1-s2.0-S1077314224002248-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks achieve outstanding results in a large variety of tasks, often outperforming human experts. However, a known limitation of current neural architectures is the poor accessibility in understanding and interpreting the network’s response to a given input. This is directly related to the huge number of variables and the associated non-linearities of neural models, which are often used as black boxes. This lack of transparency, particularly in crucial areas like autonomous driving, security, and healthcare, can trigger skepticism and limit trust, despite the networks’ high performance. In this work, we want to advance the interpretability in neural networks. We present Agglomerator++, a framework capable of providing a representation of part-whole hierarchies from visual cues and organizing the input distribution to match the conceptual-semantic hierarchical structure between classes. We evaluate our method on common datasets, such as SmallNORB, MNIST, FashionMNIST, CIFAR-10, and CIFAR-100, showing that our solution delivers a more interpretable model compared to other state-of-the-art approaches. Our code is available at https://mmlab-cv.github.io/Agglomeratorplusplus/.
{"title":"Agglomerator++: Interpretable part-whole hierarchies and latent space representations in neural networks","authors":"Zeno Sambugaro, Nicola Garau, Niccoló Bisagno, Nicola Conci","doi":"10.1016/j.cviu.2024.104159","DOIUrl":"10.1016/j.cviu.2024.104159","url":null,"abstract":"<div><p>Deep neural networks achieve outstanding results in a large variety of tasks, often outperforming human experts. However, a known limitation of current neural architectures is the poor accessibility in understanding and interpreting the network’s response to a given input. This is directly related to the huge number of variables and the associated non-linearities of neural models, which are often used as black boxes. This lack of transparency, particularly in crucial areas like autonomous driving, security, and healthcare, can trigger skepticism and limit trust, despite the networks’ high performance. In this work, we want to advance the interpretability in neural networks. We present Agglomerator++, a framework capable of providing a representation of part-whole hierarchies from visual cues and organizing the input distribution to match the conceptual-semantic hierarchical structure between classes. We evaluate our method on common datasets, such as SmallNORB, MNIST, FashionMNIST, CIFAR-10, and CIFAR-100, showing that our solution delivers a more interpretable model compared to other state-of-the-art approaches. Our code is available at <span><span>https://mmlab-cv.github.io/Agglomeratorplusplus/</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104159"},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002406/pdfft?md5=ad401203069cc93800237abddffe0b0d&pid=1-s2.0-S1077314224002406-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.cviu.2024.104167
Zhenyu Li , Pengjie Xu
Deep hashing is being used to approximate nearest neighbor search for large-scale image recognition problems. However, CNN architectures have dominated similar applications. We present a Pyramid Transformer-based Triplet Hashing architecture to handle large-scale place recognition challenges in this study, leveraging the capabilities of Vision Transformer (ViT). For feature representation, we create a Siamese Pyramid Transformer backbone. We present a multi-scale feature aggregation technique to learn discriminative features for scale-invariant features. In addition, we observe that binary codes suitable for place recognition are sub-optimal. To overcome this issue, we use a self-restraint triplet loss deep learning network to create compact hash codes, further increasing recognition accuracy. To the best of our knowledge, this is the first study to use a triplet loss deep learning network to handle the deep hashing learning problem. We do extensive experiments on four difficult place datasets: KITTI, Nordland, VPRICE, and EuRoC. The experimental findings reveal that the suggested technique performs at the cutting edge of large-scale visual place recognition challenges.
{"title":"Pyramid transformer-based triplet hashing for robust visual place recognition","authors":"Zhenyu Li , Pengjie Xu","doi":"10.1016/j.cviu.2024.104167","DOIUrl":"10.1016/j.cviu.2024.104167","url":null,"abstract":"<div><p>Deep hashing is being used to approximate nearest neighbor search for large-scale image recognition problems. However, CNN architectures have dominated similar applications. We present a Pyramid Transformer-based Triplet Hashing architecture to handle large-scale place recognition challenges in this study, leveraging the capabilities of Vision Transformer (ViT). For feature representation, we create a Siamese Pyramid Transformer backbone. We present a multi-scale feature aggregation technique to learn discriminative features for scale-invariant features. In addition, we observe that binary codes suitable for place recognition are sub-optimal. To overcome this issue, we use a self-restraint triplet loss deep learning network to create compact hash codes, further increasing recognition accuracy. To the best of our knowledge, this is the first study to use a triplet loss deep learning network to handle the deep hashing learning problem. We do extensive experiments on four difficult place datasets: KITTI, Nordland, VPRICE, and EuRoC. The experimental findings reveal that the suggested technique performs at the cutting edge of large-scale visual place recognition challenges.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104167"},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.cviu.2024.104154
Qi Zheng , Chaoyue Wang , Dadong Wang
Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.
{"title":"Bypass network for semantics driven image paragraph captioning","authors":"Qi Zheng , Chaoyue Wang , Dadong Wang","doi":"10.1016/j.cviu.2024.104154","DOIUrl":"10.1016/j.cviu.2024.104154","url":null,"abstract":"<div><p>Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104154"},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-05DOI: 10.1016/j.cviu.2024.104146
Jianjian Yin , Shuai Yan , Tao Chen , Yi Chen , Yazhou Yao
Semantic segmentation achieves fine-grained scene parsing in any scenario, making it one of the key research directions to facilitate the development of human visual attention mechanisms. Recent advancements in semi-supervised semantic segmentation have attracted considerable attention due to their potential in leveraging unlabeled data. However, existing methods only focus on exploring the knowledge of unlabeled pixels with high certainty prediction. Their insufficient mining of low certainty regions of unlabeled data results in a significant loss of supervisory information. Therefore, this paper proposes the Class Probability Space Regularization (CPSR) approach to further exploit the potential of each unlabeled pixel. Specifically, we first design a class knowledge reshaping module to regularize the probability space of low certainty pixels, thereby transforming them into high certainty ones for supervised training. Furthermore, we propose a tail probability suppression module to suppress the probabilities of tailed classes, which facilitates the network to learn more discriminative information from the class probability space. Extensive experiments conducted on the PASCAL VOC2012 and Cityscapes datasets prove that our method achieves state-of-the-art performance without introducing much computational overhead. Code is available at https://github.com/MKSAQW/CPSR.
{"title":"Class Probability Space Regularization for semi-supervised semantic segmentation","authors":"Jianjian Yin , Shuai Yan , Tao Chen , Yi Chen , Yazhou Yao","doi":"10.1016/j.cviu.2024.104146","DOIUrl":"10.1016/j.cviu.2024.104146","url":null,"abstract":"<div><p>Semantic segmentation achieves fine-grained scene parsing in any scenario, making it one of the key research directions to facilitate the development of human visual attention mechanisms. Recent advancements in semi-supervised semantic segmentation have attracted considerable attention due to their potential in leveraging unlabeled data. However, existing methods only focus on exploring the knowledge of unlabeled pixels with high certainty prediction. Their insufficient mining of low certainty regions of unlabeled data results in a significant loss of supervisory information. Therefore, this paper proposes the <strong>C</strong>lass <strong>P</strong>robability <strong>S</strong>pace <strong>R</strong>egularization (<strong>CPSR</strong>) approach to further exploit the potential of each unlabeled pixel. Specifically, we first design a class knowledge reshaping module to regularize the probability space of low certainty pixels, thereby transforming them into high certainty ones for supervised training. Furthermore, we propose a tail probability suppression module to suppress the probabilities of tailed classes, which facilitates the network to learn more discriminative information from the class probability space. Extensive experiments conducted on the PASCAL VOC2012 and Cityscapes datasets prove that our method achieves state-of-the-art performance without introducing much computational overhead. Code is available at <span><span>https://github.com/MKSAQW/CPSR</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104146"},"PeriodicalIF":4.3,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artificial Intelligence (AI) techniques are becoming more and more sophisticated showing the potential to deeply understand and predict consumer behaviour in a way to boost the retail sector; however, retail-sensitive considerations underpinning their deployment have been poorly explored to date. This paper explores the application of AI technologies in the retail sector, focusing on their potential to enhance decision-making processes by preventing major ethical risks inherent to them, such as the propagation of bias and systems’ lack of explainability. Drawing on recent literature on AI ethics, this study proposes a methodological path for the design and the development of trustworthy, unbiased, and more explainable AI systems in the retail sector. Such framework grounds on European (EU) AI ethics principles and addresses the specific nuances of retail applications. To do this, we first examine the VRAI framework, a deep learning model used to analyse shopper interactions, people counting and re-identification, to highlight the critical need for transparency and fairness in AI operations. Second, the paper proposes actionable strategies for integrating high-level ethical guidelines into practical settings, and particularly, to mitigate biases leading to unfair outcomes in AI systems and improve their explainability. By doing so, the paper aims to show the key added value of embedding AI ethics requirements into AI practices and computer vision technology to truly promote technically and ethically robust AI in the retail domain.
{"title":"Embedding AI ethics into the design and use of computer vision technology for consumer’s behaviour understanding","authors":"Simona Tiribelli , Benedetta Giovanola , Rocco Pietrini , Emanuele Frontoni , Marina Paolanti","doi":"10.1016/j.cviu.2024.104142","DOIUrl":"10.1016/j.cviu.2024.104142","url":null,"abstract":"<div><p>Artificial Intelligence (AI) techniques are becoming more and more sophisticated showing the potential to deeply understand and predict consumer behaviour in a way to boost the retail sector; however, retail-sensitive considerations underpinning their deployment have been poorly explored to date. This paper explores the application of AI technologies in the retail sector, focusing on their potential to enhance decision-making processes by preventing major ethical risks inherent to them, such as the propagation of bias and systems’ lack of explainability. Drawing on recent literature on AI ethics, this study proposes a methodological path for the design and the development of trustworthy, unbiased, and more explainable AI systems in the retail sector. Such framework grounds on European (EU) AI ethics principles and addresses the specific nuances of retail applications. To do this, we first examine the VRAI framework, a deep learning model used to analyse shopper interactions, people counting and re-identification, to highlight the critical need for transparency and fairness in AI operations. Second, the paper proposes actionable strategies for integrating high-level ethical guidelines into practical settings, and particularly, to mitigate biases leading to unfair outcomes in AI systems and improve their explainability. By doing so, the paper aims to show the key added value of embedding AI ethics requirements into AI practices and computer vision technology to truly promote technically and ethically robust AI in the retail domain.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104142"},"PeriodicalIF":4.3,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002236/pdfft?md5=e8ae2d0422401ca2e5087d68686b6387&pid=1-s2.0-S1077314224002236-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142148050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-04DOI: 10.1016/j.cviu.2024.104145
Carmen Bisogni , Vincenzo Loia , Michele Nappi , Chiara Pero
The rapid evolution of synthetic voice generation and audio manipulation technologies poses significant challenges, raising societal and security concerns due to the risks of impersonation and the proliferation of audio deepfakes. This study introduces a lightweight machine learning (ML)-based framework designed to effectively distinguish between genuine and spoofed audio recordings. Departing from conventional deep learning (DL) approaches, which mainly rely on image-based spectrogram features or learning-based audio features, the proposed method utilizes a diverse set of hand-crafted audio features – such as spectral, temporal, chroma, and frequency-domain features – to enhance the accuracy of deepfake audio content detection. Through extensive evaluation and experiments on three well-known datasets, ASVSpoof2019, FakeAVCelebV2, and an In-The-Wild database, the proposed solution demonstrates robust performance and a high degree of generalization compared to state-of-the-art methods. In particular, our method achieved 89% accuracy on ASVSpoof2019, 94.5% on FakeAVCelebV2, and 94.67% on the In-The-Wild database. Additionally, the experiments performed on explainability techniques clarify the decision-making processes within ML models, enhancing transparency and identifying crucial features essential for audio deepfake detection.
{"title":"Acoustic features analysis for explainable machine learning-based audio spoofing detection","authors":"Carmen Bisogni , Vincenzo Loia , Michele Nappi , Chiara Pero","doi":"10.1016/j.cviu.2024.104145","DOIUrl":"10.1016/j.cviu.2024.104145","url":null,"abstract":"<div><p>The rapid evolution of synthetic voice generation and audio manipulation technologies poses significant challenges, raising societal and security concerns due to the risks of impersonation and the proliferation of audio deepfakes. This study introduces a lightweight machine learning (ML)-based framework designed to effectively distinguish between genuine and spoofed audio recordings. Departing from conventional deep learning (DL) approaches, which mainly rely on image-based spectrogram features or learning-based audio features, the proposed method utilizes a diverse set of hand-crafted audio features – such as spectral, temporal, chroma, and frequency-domain features – to enhance the accuracy of deepfake audio content detection. Through extensive evaluation and experiments on three well-known datasets, ASVSpoof2019, FakeAVCelebV2, and an In-The-Wild database, the proposed solution demonstrates robust performance and a high degree of generalization compared to state-of-the-art methods. In particular, our method achieved 89% accuracy on ASVSpoof2019, 94.5% on FakeAVCelebV2, and 94.67% on the In-The-Wild database. Additionally, the experiments performed on explainability techniques clarify the decision-making processes within ML models, enhancing transparency and identifying crucial features essential for audio deepfake detection.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104145"},"PeriodicalIF":4.3,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002261/pdfft?md5=fececaff6052bf1288f308ae6213933a&pid=1-s2.0-S1077314224002261-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data from a single modality may suffer from noise, low contrast, or other imaging limitations that affect the model’s accuracy. Furthermore, due to the limited amount of data, most models trained on single-modality data tend to overfit the training set and perform poorly on out-of-domain data. Therefore, in this paper, we propose a network named Cross-Modal Reasoning and Multi-Task Learning Network (CRML-Net), which combines cross-modal reasoning and multi-task learning, aiming to leverage the complementary information between different modalities and tasks to enhance the model’s generalization ability and accuracy. Specifically, CRML-Net consists of two stages. In the first stage, our network extracts a new morphological information modality from the original image and then performs cross-modal fusion with the original modality image, aiming to leverage the morphological information to enhance the model’s robustness to out-of-domain datasets. In the second stage, based on the output of the previous stage, we introduce a multi-task learning mechanism, aiming to improve the model’s performance on unseen data by sharing surface detail information from auxiliary tasks. We validated our method on a publicly available tooth cone beam computed tomography dataset. Our evaluation demonstrates that our method outperforms state-of-the-art approaches.
来自单一模式的数据可能存在噪声、低对比度或其他成像限制,从而影响模型的准确性。此外,由于数据量有限,大多数基于单模态数据训练的模型往往会过拟合训练集,在域外数据上表现不佳。因此,我们在本文中提出了一种名为 "跨模态推理和多任务学习网络"(Cross-Modal Reasoning and Multi-Task Learning Network,CRML-Net)的网络,它将跨模态推理和多任务学习相结合,旨在利用不同模态和任务之间的互补信息来提高模型的泛化能力和准确性。具体来说,CRML-Net 包括两个阶段。在第一阶段,我们的网络从原始图像中提取新的形态信息模态,然后与原始模态图像进行跨模态融合,旨在利用形态信息增强模型对域外数据集的鲁棒性。在第二阶段,基于前一阶段的输出,我们引入了多任务学习机制,旨在通过共享辅助任务的表面细节信息来提高模型在未见数据上的性能。我们在公开的牙齿锥形束计算机断层扫描数据集上验证了我们的方法。评估结果表明,我们的方法优于最先进的方法。
{"title":"CRML-Net: Cross-Modal Reasoning and Multi-Task Learning Network for tooth image segmentation","authors":"Yingda Lyu , Zhehao Liu , Yingxin Zhang , Haipeng Chen , Zhimin Xu","doi":"10.1016/j.cviu.2024.104138","DOIUrl":"10.1016/j.cviu.2024.104138","url":null,"abstract":"<div><p>Data from a single modality may suffer from noise, low contrast, or other imaging limitations that affect the model’s accuracy. Furthermore, due to the limited amount of data, most models trained on single-modality data tend to overfit the training set and perform poorly on out-of-domain data. Therefore, in this paper, we propose a network named Cross-Modal Reasoning and Multi-Task Learning Network (CRML-Net), which combines cross-modal reasoning and multi-task learning, aiming to leverage the complementary information between different modalities and tasks to enhance the model’s generalization ability and accuracy. Specifically, CRML-Net consists of two stages. In the first stage, our network extracts a new morphological information modality from the original image and then performs cross-modal fusion with the original modality image, aiming to leverage the morphological information to enhance the model’s robustness to out-of-domain datasets. In the second stage, based on the output of the previous stage, we introduce a multi-task learning mechanism, aiming to improve the model’s performance on unseen data by sharing surface detail information from auxiliary tasks. We validated our method on a publicly available tooth cone beam computed tomography dataset. Our evaluation demonstrates that our method outperforms state-of-the-art approaches.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104138"},"PeriodicalIF":4.3,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142148047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}