Pub Date : 2024-10-24DOI: 10.1016/j.imavis.2024.105306
Nahuel E. Garcia-D’Urso, Antonio Macia-Lillo, Higinio Mora-Mora, Jorge Azorin-Lopez, Andres Fuster-Guillo
Anthropometry plays a critical role across numerous sectors, particularly within healthcare and fashion, by facilitating the analysis of the human body structure. The significance of anthropometric data cannot be overstated; it is crucial for assessing nutritional status among children and adults alike, enabling early detection of conditions such as malnutrition, obesity, and being overweight. Furthermore, it is instrumental in creating tailored dietary interventions. This study introduces a novel automated technique for extracting anthropometric measurements from any body part. The proposed method leverages a parametric model to accurately determine the measurement parameters from either an unstructured point cloud or a mesh. We conducted a comprehensive evaluation of our approach by comparing perimetral measurements from over 400 body scans with expert assessments and existing state-of-the-art methods. The results demonstrate that our approach significantly surpasses the current methods for measuring the waist, hip, thigh, chest, and wrist perimeters with exceptional accuracy. These findings indicate the potential of our method to automate anthropometric analysis and offer efficient and accurate measurements for various applications in healthcare and fashion industries.
{"title":"Automated anthropometric measurements from 3D point clouds of scanned bodies","authors":"Nahuel E. Garcia-D’Urso, Antonio Macia-Lillo, Higinio Mora-Mora, Jorge Azorin-Lopez, Andres Fuster-Guillo","doi":"10.1016/j.imavis.2024.105306","DOIUrl":"10.1016/j.imavis.2024.105306","url":null,"abstract":"<div><div>Anthropometry plays a critical role across numerous sectors, particularly within healthcare and fashion, by facilitating the analysis of the human body structure. The significance of anthropometric data cannot be overstated; it is crucial for assessing nutritional status among children and adults alike, enabling early detection of conditions such as malnutrition, obesity, and being overweight. Furthermore, it is instrumental in creating tailored dietary interventions. This study introduces a novel automated technique for extracting anthropometric measurements from any body part. The proposed method leverages a parametric model to accurately determine the measurement parameters from either an unstructured point cloud or a mesh. We conducted a comprehensive evaluation of our approach by comparing perimetral measurements from over 400 body scans with expert assessments and existing state-of-the-art methods. The results demonstrate that our approach significantly surpasses the current methods for measuring the waist, hip, thigh, chest, and wrist perimeters with exceptional accuracy. These findings indicate the potential of our method to automate anthropometric analysis and offer efficient and accurate measurements for various applications in healthcare and fashion industries.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105306"},"PeriodicalIF":4.2,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142571748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-22DOI: 10.1016/j.imavis.2024.105300
Wenzheng Dong, Lanling Zeng, Shunli Ji, Yang Yang
Edge-preserving image smoothing plays an important role in the fields of image processing and computational photography, and is widely used for a variety of applications. The edge-preserving filters based on global optimization models have attracted widespread attention due to their nice smoothing quality. According to existing research, the edge-preserving capability is strongly correlated to the penalty function used for gradient regularization. By analyzing the edge-stopping function of existing penalties, we demonstrate that existing image smoothing models are not adequately edge-preserving. In this paper, based on a Gaussian error function (ERF), we propose a Gaussian error loss function (ERLF), which shows stronger edge-preserving capability. We embed the proposed loss function into a global optimization model for edge-preserving image smoothing. In addition, we propose an efficient solution based on additive half-quadratic minimization and Fourier-domain optimization that is capable of processing 720P color images (over 20 fps) in real-time on an NVIDIA RTX 3070 GPU. We have experimented with the proposed filter on a number of low-level vision tasks. Both quantitative and qualitative experimental results show that the proposed filter outperforms existing filters. Therefore, it can be practical for real applications.
{"title":"Gaussian error loss function for image smoothing","authors":"Wenzheng Dong, Lanling Zeng, Shunli Ji, Yang Yang","doi":"10.1016/j.imavis.2024.105300","DOIUrl":"10.1016/j.imavis.2024.105300","url":null,"abstract":"<div><div>Edge-preserving image smoothing plays an important role in the fields of image processing and computational photography, and is widely used for a variety of applications. The edge-preserving filters based on global optimization models have attracted widespread attention due to their nice smoothing quality. According to existing research, the edge-preserving capability is strongly correlated to the penalty function used for gradient regularization. By analyzing the edge-stopping function of existing penalties, we demonstrate that existing image smoothing models are not adequately edge-preserving. In this paper, based on a Gaussian error function (ERF), we propose a Gaussian error loss function (ERLF), which shows stronger edge-preserving capability. We embed the proposed loss function into a global optimization model for edge-preserving image smoothing. In addition, we propose an efficient solution based on additive half-quadratic minimization and Fourier-domain optimization that is capable of processing 720P color images (over 20 fps) in real-time on an NVIDIA RTX 3070 GPU. We have experimented with the proposed filter on a number of low-level vision tasks. Both quantitative and qualitative experimental results show that the proposed filter outperforms existing filters. Therefore, it can be practical for real applications.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105300"},"PeriodicalIF":4.2,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-19DOI: 10.1016/j.imavis.2024.105303
Tianyi Zhao, Guanci Yang, Yang Li, Minglang Lu, Haoran Sun
Machine vision is one of the major technologies to guarantee intelligent robots’ human-centered embodied intelligence. Especially in the complex dynamic scene involving multi-person, Multi-Object Tracking (MOT), which can accurately identify and track specific targets, significantly influences intelligent robots’ performance regarding behavior perception and monitoring, autonomous decision-making, and providing personalized humanoid services. In order to solve the problem of targets lost and identity switches caused by the scale variations of objects and frequent overlaps during the tracking process, this paper presents a multi-object tracking method using score-driven hierarchical association strategy between predicted tracklets and objects (ScoreMOT). Firstly, a motion prediction of occluded objects based on bounding box variation (MPOBV) is proposed to estimate the position of occluded objects. MPOBV models the motion state of the object using the bounding box and confidence score. Then, a score-driven hierarchical association strategy between predicted tracklets and objects (SHAS) is proposed to correctly associate them in frequently overlapping scenarios. SHAS associates the predicted tracklets and detected objects with different confidence in different stages. The comparison results with 16 state-of-the-art methods on Multiple Object Tracking Benchmark 20 (MOT20) and DanceTrack datasets are conducted, and ScoreMOT outperforms the compared methods.
{"title":"Multi-object tracking using score-driven hierarchical association strategy between predicted tracklets and objects","authors":"Tianyi Zhao, Guanci Yang, Yang Li, Minglang Lu, Haoran Sun","doi":"10.1016/j.imavis.2024.105303","DOIUrl":"10.1016/j.imavis.2024.105303","url":null,"abstract":"<div><div>Machine vision is one of the major technologies to guarantee intelligent robots’ human-centered embodied intelligence. Especially in the complex dynamic scene involving multi-person, Multi-Object Tracking (MOT), which can accurately identify and track specific targets, significantly influences intelligent robots’ performance regarding behavior perception and monitoring, autonomous decision-making, and providing personalized humanoid services. In order to solve the problem of targets lost and identity switches caused by the scale variations of objects and frequent overlaps during the tracking process, this paper presents a multi-object tracking method using score-driven hierarchical association strategy between predicted tracklets and objects (ScoreMOT). Firstly, a motion prediction of occluded objects based on bounding box variation (MPOBV) is proposed to estimate the position of occluded objects. MPOBV models the motion state of the object using the bounding box and confidence score. Then, a score-driven hierarchical association strategy between predicted tracklets and objects (SHAS) is proposed to correctly associate them in frequently overlapping scenarios. SHAS associates the predicted tracklets and detected objects with different confidence in different stages. The comparison results with 16 state-of-the-art methods on Multiple Object Tracking Benchmark 20 (MOT20) and DanceTrack datasets are conducted, and ScoreMOT outperforms the compared methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105303"},"PeriodicalIF":4.2,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-19DOI: 10.1016/j.imavis.2024.105304
Jie Luo, Tianlun Huang, Weijun Wang, Wei Feng
3D Gaussian Splatting (3DGS) represents a significant breakthrough in computer graphics and vision, offering an explicit scene representation and novel view synthesis without the reliance on neural networks, unlike Neural Radiance Fields (NeRF). This paper provides a comprehensive survey of recent research on 3DGS optimization and reconstruction, with a particular focus on studies featuring published or forthcoming open-source code. In terms of optimization, the paper examines techniques such as compression, densification, splitting, anti-aliasing, and reflection enhancement. For reconstruction, it explores methods including surface mesh extraction, sparse-view object and scene reconstruction, large-scale scene reconstruction, and dynamic object and scene reconstruction. Through comparative analysis and case studies, the paper highlights the practical advantages of 3DGS and outlines future research directions, offering valuable insights for advancing the field.
3D Gaussian Splatting(3DGS)是计算机图形学和视觉领域的一项重大突破,它与神经辐射场(NeRF)不同,无需依赖神经网络就能提供明确的场景表示和新颖的视图合成。本文全面介绍了近期有关 3DGS 优化和重建的研究,尤其关注已发布或即将发布开源代码的研究。在优化方面,本文探讨了压缩、致密化、分割、抗锯齿和反射增强等技术。在重建方面,论文探讨了包括表面网格提取、稀疏视图对象和场景重建、大规模场景重建以及动态对象和场景重建等方法。通过对比分析和案例研究,论文强调了 3DGS 的实用优势,并概述了未来的研究方向,为推动该领域的发展提供了宝贵的见解。
{"title":"A review of recent advances in 3D Gaussian Splatting for optimization and reconstruction","authors":"Jie Luo, Tianlun Huang, Weijun Wang, Wei Feng","doi":"10.1016/j.imavis.2024.105304","DOIUrl":"10.1016/j.imavis.2024.105304","url":null,"abstract":"<div><div>3D Gaussian Splatting (3DGS) represents a significant breakthrough in computer graphics and vision, offering an explicit scene representation and novel view synthesis without the reliance on neural networks, unlike Neural Radiance Fields (NeRF). This paper provides a comprehensive survey of recent research on 3DGS optimization and reconstruction, with a particular focus on studies featuring published or forthcoming open-source code. In terms of optimization, the paper examines techniques such as compression, densification, splitting, anti-aliasing, and reflection enhancement. For reconstruction, it explores methods including surface mesh extraction, sparse-view object and scene reconstruction, large-scale scene reconstruction, and dynamic object and scene reconstruction. Through comparative analysis and case studies, the paper highlights the practical advantages of 3DGS and outlines future research directions, offering valuable insights for advancing the field.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105304"},"PeriodicalIF":4.2,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142525914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-19DOI: 10.1016/j.imavis.2024.105299
Yang Wang , Ao Wang , Shijia Song , Fan Xie , Chang Ma , Jiawei Xu , Lijun Zhao
In augmented reality tasks, especially in indoor scenes, achieving illumination consistency between virtual objects and real environments is a critical challenge. Currently, mainstream methods are illumination parameters regression and illumination map generation. Among these two categories of methods, few works can effectively recover both high-frequency and low-frequency illumination information within indoor scenes. In this work, we argue that effective restoration of low-frequency illumination information forms the foundation for capturing high-frequency illumination details. In this way, we propose a novel illumination estimation method called FHLight. Technically, we use a low-frequency spherical harmonic irradiance map (LFSHIM) restored by the low-frequency illumination regression network (LFIRN) as prior information to guide the high-frequency illumination generator (HFIG) to restore the illumination map. Furthermore, we suggest an improved loss function to optimize the network training procedure, ensuring that the model accurately restores both low-frequency and high-frequency illumination information within the scene. We compare FHLight with several competitive methods, and the results demonstrate significant improvements in metrics such as RMSE, si-RMSE, and Angular error. In addition, visual experiments further confirm that FHLight is capable of generating scene illumination maps with genuine frequencies, effectively resolving the illumination consistency issue between virtual objects and real scenes. The code is available at https://github.com/WA-tyro/FHLight.git.
{"title":"FHLight: A novel method of indoor scene illumination estimation using improved loss function","authors":"Yang Wang , Ao Wang , Shijia Song , Fan Xie , Chang Ma , Jiawei Xu , Lijun Zhao","doi":"10.1016/j.imavis.2024.105299","DOIUrl":"10.1016/j.imavis.2024.105299","url":null,"abstract":"<div><div>In augmented reality tasks, especially in indoor scenes, achieving illumination consistency between virtual objects and real environments is a critical challenge. Currently, mainstream methods are illumination parameters regression and illumination map generation. Among these two categories of methods, few works can effectively recover both high-frequency and low-frequency illumination information within indoor scenes. In this work, we argue that effective restoration of low-frequency illumination information forms the foundation for capturing high-frequency illumination details. In this way, we propose a novel illumination estimation method called FHLight. Technically, we use a low-frequency spherical harmonic irradiance map (LFSHIM) restored by the low-frequency illumination regression network (LFIRN) as prior information to guide the high-frequency illumination generator (HFIG) to restore the illumination map. Furthermore, we suggest an improved loss function to optimize the network training procedure, ensuring that the model accurately restores both low-frequency and high-frequency illumination information within the scene. We compare FHLight with several competitive methods, and the results demonstrate significant improvements in metrics such as RMSE, si-RMSE, and Angular error. In addition, visual experiments further confirm that FHLight is capable of generating scene illumination maps with genuine frequencies, effectively resolving the illumination consistency issue between virtual objects and real scenes. The code is available at <span><span>https://github.com/WA-tyro/FHLight.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105299"},"PeriodicalIF":4.2,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-18DOI: 10.1016/j.imavis.2024.105302
Qiqi Xu, Zhenguang Di, Haoyu Dong, Gang Yang
In RGB-T salient object detection, effective utilization of the different characteristics of RGB and thermal modalities is essential to achieve accurate detection. Most of the previous methods usually only focus on reducing the differences between modalities, which may ignore the specific features that are crucial for salient object detection, leading to suboptimal results. To address the above issue, an RGB-T SOD network that simultaneously considers the reduction of modality differences and the preservation of specific features is proposed. Specifically, we construct a modality differences reduction and specific features preserving module (MDRSFPM) which aims to bridge the gap between modalities and enhance the specific features of each modality. In MDRSFPM, the dynamic vector generated by the interaction of RGB and thermal features is used to reduce modality differences, and then a dual branch is constructed to deal with the RGB and thermal modalities separately, employing a combination of channel-level and spatial-level operations to preserve their respective specific features. In addition, a multi-scale global feature enhancement module (MGFEM) is proposed to enhance global contextual information to provide guidance information for the subsequent decoding stage, so that the model can more easily localize the salient objects. Furthermore, our approach includes a fully fusion and gate module (FFGM) that utilizes dynamically generated importance maps to selectively filter and fuse features during the decoding process. Extensive experiments demonstrate that our proposed model surpasses other state-of-the-art models on three publicly available RGB-T datasets remarkably. Our code will be released at https://github.com/JOOOOKII/FRPNet.
{"title":"Feature differences reduction and specific features preserving network for RGB-T salient object detection","authors":"Qiqi Xu, Zhenguang Di, Haoyu Dong, Gang Yang","doi":"10.1016/j.imavis.2024.105302","DOIUrl":"10.1016/j.imavis.2024.105302","url":null,"abstract":"<div><div>In RGB-T salient object detection, effective utilization of the different characteristics of RGB and thermal modalities is essential to achieve accurate detection. Most of the previous methods usually only focus on reducing the differences between modalities, which may ignore the specific features that are crucial for salient object detection, leading to suboptimal results. To address the above issue, an RGB-T SOD network that simultaneously considers the reduction of modality differences and the preservation of specific features is proposed. Specifically, we construct a modality differences reduction and specific features preserving module (MDRSFPM) which aims to bridge the gap between modalities and enhance the specific features of each modality. In MDRSFPM, the dynamic vector generated by the interaction of RGB and thermal features is used to reduce modality differences, and then a dual branch is constructed to deal with the RGB and thermal modalities separately, employing a combination of channel-level and spatial-level operations to preserve their respective specific features. In addition, a multi-scale global feature enhancement module (MGFEM) is proposed to enhance global contextual information to provide guidance information for the subsequent decoding stage, so that the model can more easily localize the salient objects. Furthermore, our approach includes a fully fusion and gate module (FFGM) that utilizes dynamically generated importance maps to selectively filter and fuse features during the decoding process. Extensive experiments demonstrate that our proposed model surpasses other state-of-the-art models on three publicly available RGB-T datasets remarkably. Our code will be released at <span><span>https://github.com/JOOOOKII/FRPNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105302"},"PeriodicalIF":4.2,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142561279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-13DOI: 10.1016/j.imavis.2024.105301
Zhuhong Shao , Zuowei Zhang , Leding Li , Hailiang Li , Xuanyi Li , Bicao Li , Yuanyuan Shang , Bin Chen
The current face scanning era can quickly and conveniently attain identity authentication, but face images imply sensitive information simultaneously. Under such context, we introduce a novel cancelable face recognition methodology by using quaternion transform based convolutional network. Firstly, face images in different modalities (e.g., RGB and depth or near-infrared) are encoded into full quaternion matrix for synchronous processing. Based on the designed multiresolution quaternion singular value decomposition, we can obtain pyramid representation. Then they are transformed through random projection for making the process noninvertible. Even if the feature template is compromised, a new one can be generated. Subsequently, a three-stream convolutional network is developed to learn features, where predefined filters are stemmed from quaternion two-dimensional discrete cosine transform basis. Extensive experiments on the TIII-D, NVIE and CASIA datasets have demonstrated that the proposed method obtains competitive performance, also satisfies redistributable and irreversible.
{"title":"Pyramid quaternion discrete cosine transform based ConvNet for cancelable face recognition","authors":"Zhuhong Shao , Zuowei Zhang , Leding Li , Hailiang Li , Xuanyi Li , Bicao Li , Yuanyuan Shang , Bin Chen","doi":"10.1016/j.imavis.2024.105301","DOIUrl":"10.1016/j.imavis.2024.105301","url":null,"abstract":"<div><div>The current <em>face scanning era</em> can quickly and conveniently attain identity authentication, but face images imply sensitive information simultaneously. Under such context, we introduce a novel cancelable face recognition methodology by using quaternion transform based convolutional network. Firstly, face images in different modalities (e.g., RGB and depth or near-infrared) are encoded into full quaternion matrix for synchronous processing. Based on the designed multiresolution quaternion singular value decomposition, we can obtain pyramid representation. Then they are transformed through random projection for making the process noninvertible. Even if the feature template is compromised, a new one can be generated. Subsequently, a three-stream convolutional network is developed to learn features, where predefined filters are stemmed from quaternion two-dimensional discrete cosine transform basis. Extensive experiments on the TIII-D, NVIE and CASIA datasets have demonstrated that the proposed method obtains competitive performance, also satisfies redistributable and irreversible.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105301"},"PeriodicalIF":4.2,"publicationDate":"2024-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-11DOI: 10.1016/j.imavis.2024.105298
Angelo Genovese, Vincenzo Piuri, Fabio Scotti
The detection of acute lymphoblastic leukemia (ALL) via deep learning (DL) has received great interest because of its high accuracy in detecting lymphoblasts without the need for handcrafted feature extraction. However, current DL models, such as convolutional neural networks and vision Transformers, are extremely complex, making them black boxes that perform classification in an obscure way. To compensate for this and increase the explainability of the decisions made by such methods, in this paper, we propose an innovative decision support system for ALL detection that is based on DL and explainable artificial intelligence (XAI). Our approach first introduces causality into the decision with a metric learning approach, enabling a decision to be made by analyzing the most similar images in the database. Second, our method integrates XAI techniques to allow even non-trained personnel to obtain an informed decision by analyzing which regions of the images are most similar and how the samples are organized in the latent space. The results on publicly available ALL databases confirm the validity of our approach in opening the black box while achieving similar or superior accuracy to that of existing approaches.
通过深度学习(DL)检测急性淋巴细胞白血病(ALL)受到了极大的关注,因为它无需人工特征提取就能高精度地检测出淋巴细胞。然而,目前的深度学习模型(如卷积神经网络和视觉变形器)极其复杂,使其成为以模糊方式进行分类的黑盒子。为了弥补这一缺陷并提高此类方法所做决策的可解释性,我们在本文中提出了一种基于 DL 和可解释人工智能(XAI)的创新型 ALL 检测决策支持系统。我们的方法首先通过度量学习方法将因果关系引入决策,通过分析数据库中最相似的图像来做出决策。其次,我们的方法整合了 XAI 技术,通过分析图像中最相似的区域以及样本在潜在空间中的组织方式,即使是未经培训的人员也能做出明智的决策。在公开可用的 ALL 数据库上取得的结果证实了我们的方法在打开黑箱方面的有效性,同时达到了与现有方法相似或更高的准确性。
{"title":"A decision support system for acute lymphoblastic leukemia detection based on explainable artificial intelligence","authors":"Angelo Genovese, Vincenzo Piuri, Fabio Scotti","doi":"10.1016/j.imavis.2024.105298","DOIUrl":"10.1016/j.imavis.2024.105298","url":null,"abstract":"<div><div>The detection of acute lymphoblastic leukemia (ALL) via deep learning (DL) has received great interest because of its high accuracy in detecting lymphoblasts without the need for handcrafted feature extraction. However, current DL models, such as convolutional neural networks and vision Transformers, are extremely complex, making them black boxes that perform classification in an obscure way. To compensate for this and increase the explainability of the decisions made by such methods, in this paper, we propose an innovative decision support system for ALL detection that is based on DL and explainable artificial intelligence (XAI). Our approach first introduces causality into the decision with a metric learning approach, enabling a decision to be made by analyzing the most similar images in the database. Second, our method integrates XAI techniques to allow even non-trained personnel to obtain an informed decision by analyzing which regions of the images are most similar and how the samples are organized in the latent space. The results on publicly available ALL databases confirm the validity of our approach in opening the black box while achieving similar or superior accuracy to that of existing approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105298"},"PeriodicalIF":4.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1016/j.imavis.2024.105296
Zhuoyuan Li, Yi Sun
We propose a novel model to efficiently finetune pretrained Text-to-Image models by introducing additional image prompts. The model integrates information from image prompts into the text-to-image (T2I) diffusion process by locking the parameters of the large T2I model and reusing its trainable copy, rather than relying on additional adapters. The trainable copy guides the model by injecting its trainable self-attention features into the original diffusion model, enabling the synthesis of a new specific concept. We also apply Low-Rank Adaptation (LoRA) to restrict the trainable parameters in the self-attention layers. Furthermore, the network is optimized alongside a text embedding that serves as an object identifier to generate contextually relevant visual content. Our model is simple and effective, with a small memory footprint, yet can achieve comparable performance to a fully fine-tuned T2I model in both qualitative and quantitative evaluations.
{"title":"Parameter efficient finetuning of text-to-image models with trainable self-attention layer","authors":"Zhuoyuan Li, Yi Sun","doi":"10.1016/j.imavis.2024.105296","DOIUrl":"10.1016/j.imavis.2024.105296","url":null,"abstract":"<div><div>We propose a novel model to efficiently finetune pretrained Text-to-Image models by introducing additional image prompts. The model integrates information from image prompts into the text-to-image (T2I) diffusion process by locking the parameters of the large T2I model and reusing its trainable copy, rather than relying on additional adapters. The trainable copy guides the model by injecting its trainable self-attention features into the original diffusion model, enabling the synthesis of a new specific concept. We also apply Low-Rank Adaptation (LoRA) to restrict the trainable parameters in the self-attention layers. Furthermore, the network is optimized alongside a text embedding that serves as an object identifier to generate contextually relevant visual content. Our model is simple and effective, with a small memory footprint, yet can achieve comparable performance to a fully fine-tuned T2I model in both qualitative and quantitative evaluations.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105296"},"PeriodicalIF":4.2,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142525913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1016/j.imavis.2024.105297
Shufan Xie, Qiaohong Chen, Xian Fang, Qi Sun
Human language is considered multimodal, containing natural language, visual elements, and acoustic signals. Multimodal Sentiment Analysis (MSA) concentrates on the integration of various modalities to capture the sentiment polarity or intensity expressed in human language. Nevertheless, the absence of a comprehensive strategy for processing and integrating multimodal representations results in the inclusion of inaccurate or noisy data from diverse modalities in the ultimate decision-making process, potentially leading to the neglect of crucial information within or across modalities. To address this issue, we propose the Global Information Regulation Network (GIRN), a novel framework designed to regulate information flow and decision-making processes across various stages, ranging from unimodal feature extraction to multimodal outcome prediction. Specifically, before modal fusion stage, we maximize the mutual information between modalities and refine the input signals through random feature erasing, yielding a more robust unimodal representation. In the process of modal fusion, we enhance the traditional Transformer encoder through the gate mechanism and stacked attention to dynamically fuse the target and auxiliary modalities. After modal fusion, cross-hierarchical contrastive learning and decision gate are employed to integrate the valuable information represented in different categories and hierarchies. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets suggest that our methodology outperforms existing approaches across nearly all criteria.
{"title":"Global information regulation network for multimodal sentiment analysis","authors":"Shufan Xie, Qiaohong Chen, Xian Fang, Qi Sun","doi":"10.1016/j.imavis.2024.105297","DOIUrl":"10.1016/j.imavis.2024.105297","url":null,"abstract":"<div><div>Human language is considered multimodal, containing natural language, visual elements, and acoustic signals. Multimodal Sentiment Analysis (MSA) concentrates on the integration of various modalities to capture the sentiment polarity or intensity expressed in human language. Nevertheless, the absence of a comprehensive strategy for processing and integrating multimodal representations results in the inclusion of inaccurate or noisy data from diverse modalities in the ultimate decision-making process, potentially leading to the neglect of crucial information within or across modalities. To address this issue, we propose the Global Information Regulation Network (GIRN), a novel framework designed to regulate information flow and decision-making processes across various stages, ranging from unimodal feature extraction to multimodal outcome prediction. Specifically, before modal fusion stage, we maximize the mutual information between modalities and refine the input signals through random feature erasing, yielding a more robust unimodal representation. In the process of modal fusion, we enhance the traditional Transformer encoder through the gate mechanism and stacked attention to dynamically fuse the target and auxiliary modalities. After modal fusion, cross-hierarchical contrastive learning and decision gate are employed to integrate the valuable information represented in different categories and hierarchies. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets suggest that our methodology outperforms existing approaches across nearly all criteria.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105297"},"PeriodicalIF":4.2,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}