Pub Date : 2024-10-06DOI: 10.1016/j.imavis.2024.105295
Le Quan Nguyen , Jinwoo Choi , L. Minh Dang , Hyeonjoon Moon
In this work, we tackle class incremental learning (CIL) for video action recognition, a relatively under-explored problem despite its practical importance. Directly applying image-based CIL methods does not work well in the video action recognition setting. We hypothesize the major reason is the spurious correlation between the action and background in video action recognition datasets/models. Recent literature shows that the spurious correlation hampers the generalization of models in the conventional action recognition setting. The problem is even more severe in the CIL setting due to the limited exemplars available in the rehearsal memory. We empirically show that mitigating the spurious correlation between the action and background is crucial to the CIL for video action recognition. We propose to learn background invariant action representations in the CIL setting by providing training videos with diverse backgrounds generated from background augmentation techniques. We validate the proposed method on public benchmarks: HMDB-51, UCF-101, and Something-Something-v2.
{"title":"Background debiased class incremental learning for video action recognition","authors":"Le Quan Nguyen , Jinwoo Choi , L. Minh Dang , Hyeonjoon Moon","doi":"10.1016/j.imavis.2024.105295","DOIUrl":"10.1016/j.imavis.2024.105295","url":null,"abstract":"<div><div>In this work, we tackle class incremental learning (CIL) for video action recognition, a relatively under-explored problem despite its practical importance. Directly applying image-based CIL methods does not work well in the video action recognition setting. We hypothesize the major reason is the spurious correlation between the action and background in video action recognition datasets/models. Recent literature shows that the spurious correlation hampers the generalization of models in the conventional action recognition setting. The problem is even more severe in the CIL setting due to the limited exemplars available in the rehearsal memory. We empirically show that mitigating the spurious correlation between the action and background is crucial to the CIL for video action recognition. We propose to learn background invariant action representations in the CIL setting by providing training videos with diverse backgrounds generated from background augmentation techniques. We validate the proposed method on public benchmarks: HMDB-51, UCF-101, and Something-Something-v2.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105295"},"PeriodicalIF":4.2,"publicationDate":"2024-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-05DOI: 10.1016/j.imavis.2024.105294
Zhongyu Zhang , Shujun Liu , Yingxiang Qin , Huajun Wang
Remote sensing image change detection is crucial for natural disaster monitoring and land use change. As the resolution increases, the scenes covered by remote sensing images become more complex, and traditional methods have difficulties in extracting detailed information. With the development of deep learning, the field of change detection has new opportunities. However, existing algorithms mainly focus on the difference analysis between bi-temporal images, while ignoring the semantic information between images, resulting in the global and local information not being able to interact effectively. In this paper, we introduce a new transformer-based multilevel attention network (MATNet), which is capable of extracting multilevel features of global and local information, enabling information interaction and fusion, and thus modeling the global context more effectively. Specifically, we extract multilevel semantic features through the Transformer encoder, and utilize the Feature Enhancement Module (FEM) to perform feature summing and differencing on the multilevel features in order to better extract the local detail information, and thus better detect the changes in small regions. In addition, we employ a multilevel attention decoder (MAD) to obtain information in spatial and spectral dimensions, which can effectively fuse global and local information. In experiments, our method performs excellently on CDD, DSIFN-CD, LEVIR-CD, and SYSU-CD datasets, with F1 scores and OA reaching 95.67%∕87.75%∕90.94%∕86.82% and 98.95%∕95.93%∕99.11%∕90.53% respectively.
{"title":"MATNet: Multilevel attention-based transformers for change detection in remote sensing images","authors":"Zhongyu Zhang , Shujun Liu , Yingxiang Qin , Huajun Wang","doi":"10.1016/j.imavis.2024.105294","DOIUrl":"10.1016/j.imavis.2024.105294","url":null,"abstract":"<div><div>Remote sensing image change detection is crucial for natural disaster monitoring and land use change. As the resolution increases, the scenes covered by remote sensing images become more complex, and traditional methods have difficulties in extracting detailed information. With the development of deep learning, the field of change detection has new opportunities. However, existing algorithms mainly focus on the difference analysis between bi-temporal images, while ignoring the semantic information between images, resulting in the global and local information not being able to interact effectively. In this paper, we introduce a new transformer-based multilevel attention network (MATNet), which is capable of extracting multilevel features of global and local information, enabling information interaction and fusion, and thus modeling the global context more effectively. Specifically, we extract multilevel semantic features through the Transformer encoder, and utilize the Feature Enhancement Module (FEM) to perform feature summing and differencing on the multilevel features in order to better extract the local detail information, and thus better detect the changes in small regions. In addition, we employ a multilevel attention decoder (MAD) to obtain information in spatial and spectral dimensions, which can effectively fuse global and local information. In experiments, our method performs excellently on CDD, DSIFN-CD, LEVIR-CD, and SYSU-CD datasets, with F1 scores and OA reaching 95.67%∕87.75%∕90.94%∕86.82% and 98.95%∕95.93%∕99.11%∕90.53% respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105294"},"PeriodicalIF":4.2,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-30DOI: 10.1016/j.imavis.2024.105293
Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti
Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.
{"title":"Knowledge graph construction in hyperbolic space for automatic image annotation","authors":"Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti","doi":"10.1016/j.imavis.2024.105293","DOIUrl":"10.1016/j.imavis.2024.105293","url":null,"abstract":"<div><div>Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105293"},"PeriodicalIF":4.2,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-29DOI: 10.1016/j.imavis.2024.105291
Jiangxiao Han , Shikang Wang , Xianbo Deng , Wenyu Liu
Mitosis detection poses a significant challenge in medical image analysis, primarily due to the substantial variability in the appearance and shape of mitotic targets. This paper introduces an efficient and accurate mitosis detection framework, which stands apart from previous mitosis detection techniques with its two key features: Single-Level Feature (SLF) for bounding box prediction and Dense-Sparse Hybrid Label Assignment (HLA) for bounding box matching. The SLF component of our method employs a multi-scale Transformer backbone to capture the global context and morphological characteristics of both mitotic and non-mitotic cells. This information is then consolidated into a single-scale feature map, thereby enhancing the model's receptive field and reducing redundant detection across various feature maps. In the HLA component, we propose a hybrid label assignment strategy to facilitate the model's adaptation to mitotic cells of different shapes and positions during training, thereby improving the model's adaptability to diverse cell morphologies. Our method has been tested on the largest mitosis detection datasets and achieves state-of-the-art (SOTA) performance, with an F1 score of 0.782 on the TUPAC 16 benchmark, and 0.792 with test time augmentation (TTA). Our method also exhibits superior accuracy and faster processing speed compared to previous methods. The source code and pretrained models will be released to facilitate related research.
有丝分裂检测是医学图像分析中的一项重大挑战,这主要是由于有丝分裂目标的外观和形状存在很大差异。本文介绍了一种高效、准确的有丝分裂检测框架,它与以往的有丝分裂检测技术不同,具有两个关键特征:单级特征(SLF)用于边界框预测,密集解析混合标签分配(HLA)用于边界框匹配。我们方法中的单级特征(SLF)部分采用了多尺度变换器骨架,以捕捉有丝分裂和无丝分裂细胞的全局背景和形态特征。然后将这些信息整合到单尺度特征图中,从而增强了模型的感受野,减少了不同特征图之间的冗余检测。在 HLA 部分,我们提出了一种混合标签分配策略,以促进模型在训练过程中适应不同形状和位置的有丝分裂细胞,从而提高模型对不同细胞形态的适应性。我们的方法在最大的有丝分裂检测数据集上进行了测试,取得了最先进的(SOTA)性能,在 TUPAC 16 基准上的 F1 得分为 0.782,在测试时间增强(TTA)的情况下为 0.792。与之前的方法相比,我们的方法还具有更高的准确性和更快的处理速度。我们将发布源代码和预训练模型,以促进相关研究。
{"title":"High-performance mitosis detection using single-level feature and hybrid label assignment","authors":"Jiangxiao Han , Shikang Wang , Xianbo Deng , Wenyu Liu","doi":"10.1016/j.imavis.2024.105291","DOIUrl":"10.1016/j.imavis.2024.105291","url":null,"abstract":"<div><div>Mitosis detection poses a significant challenge in medical image analysis, primarily due to the substantial variability in the appearance and shape of mitotic targets. This paper introduces an efficient and accurate mitosis detection framework, which stands apart from previous mitosis detection techniques with its two key features: Single-Level Feature (SLF) for bounding box prediction and Dense-Sparse Hybrid Label Assignment (HLA) for bounding box matching. The SLF component of our method employs a multi-scale Transformer backbone to capture the global context and morphological characteristics of both mitotic and non-mitotic cells. This information is then consolidated into a single-scale feature map, thereby enhancing the model's receptive field and reducing redundant detection across various feature maps. In the HLA component, we propose a hybrid label assignment strategy to facilitate the model's adaptation to mitotic cells of different shapes and positions during training, thereby improving the model's adaptability to diverse cell morphologies. Our method has been tested on the largest mitosis detection datasets and achieves state-of-the-art (SOTA) performance, with an F1 score of 0.782 on the TUPAC 16 benchmark, and 0.792 with test time augmentation (TTA). Our method also exhibits superior accuracy and faster processing speed compared to previous methods. The source code and pretrained models will be released to facilitate related research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105291"},"PeriodicalIF":4.2,"publicationDate":"2024-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The eye fundus imaging is used for early diagnosis of most damaging concerns such as diabetic retinopathy, retinal detachments and vascular occlusions. However, the presence of noise, low contrast between background and vasculature during imaging, and vessel morphology lead to uncertain vessel segmentation.
Aim
This paper proposes a novel retinalblood vessel segmentation method for fundus imaging using a Difference of Gaussian (DoG) filter and an ensemble of three fully convolutional neural network (FCNN) models.
Methods
A Gaussian filter with standard deviation is applied on the preprocessed grayscale fundus image and is subtracted from a similarly applied Gaussian filter with standard deviation on the same image. The resultant image is then fed into each of the three fully convolutional neural networks as the input. The FCNN models' output is then passed through a voting classifier, and a final segmented vessel structure is obtained.The Difference of Gaussian filter played an essential part in removing the high frequency details (noise) and thus finely extracted the blood vessels from the retinal fundus with underlying artifacts.
Results
The total dataset consists of 3832 augmented images transformed from 479 fundus images. The result shows that the proposed method has performed extremely well by achieving an accuracy of 96.50%, 97.69%, and 95.78% on DRIVE, CHASE,and real-time clinical datasets respectively.
Conclusion
The FCNN ensemble model has demonstrated efficacy in precisely detecting retinal vessels and in the presence of various pathologies and vasculatures.
{"title":"Diabetic retinopathy data augmentation and vessel segmentation through deep learning based three fully convolution neural networks","authors":"Jainy Sachdeva PhD , Puneet Mishra , Deeksha Katoch","doi":"10.1016/j.imavis.2024.105284","DOIUrl":"10.1016/j.imavis.2024.105284","url":null,"abstract":"<div><h3>Problem</h3><div>The eye fundus imaging is used for early diagnosis of most damaging concerns such as diabetic retinopathy, retinal detachments and vascular occlusions. However, the presence of noise, low contrast between background and vasculature during imaging, and vessel morphology lead to uncertain vessel segmentation.</div></div><div><h3>Aim</h3><div>This paper proposes a novel retinalblood vessel segmentation method for fundus imaging using a Difference of Gaussian (DoG) filter and an ensemble of three fully convolutional neural network (FCNN) models.</div></div><div><h3>Methods</h3><div>A Gaussian filter with standard deviation <span><math><msub><mi>σ</mi><mn>1</mn></msub></math></span> is applied on the preprocessed grayscale fundus image and is subtracted from a similarly applied Gaussian filter with standard deviation <span><math><msub><mi>σ</mi><mn>2</mn></msub></math></span> on the same image. The resultant image is then fed into each of the three fully convolutional neural networks as the input. The FCNN models' output is then passed through a voting classifier, and a final segmented vessel structure is obtained.The Difference of Gaussian filter played an essential part in removing the high frequency details (noise) and thus finely extracted the blood vessels from the retinal fundus with underlying artifacts.</div></div><div><h3>Results</h3><div>The total dataset consists of 3832 augmented images transformed from 479 fundus images. The result shows that the proposed method has performed extremely well by achieving an accuracy of 96.50%, 97.69%, and 95.78% on DRIVE, CHASE,and real-time clinical datasets respectively.</div></div><div><h3>Conclusion</h3><div>The FCNN ensemble model has demonstrated efficacy in precisely detecting retinal vessels and in the presence of various pathologies and vasculatures.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105284"},"PeriodicalIF":4.2,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142442942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-28DOI: 10.1016/j.imavis.2024.105286
Xiao Zhou , Xiaogang Peng , Hao Wen , Yikai Luo , Keyang Yu , Ping Yang , Zizhao Wu
In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose HyperVD, a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. We contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent snippets and normal ones. Extensive experiments on the XD-Violence benchmark demonstrate that our method achieves 85.67% AP, outperforming the state-of-the-art methods by a sizable margin.
{"title":"Learning weakly supervised audio-visual violence detection in hyperbolic space","authors":"Xiao Zhou , Xiaogang Peng , Hao Wen , Yikai Luo , Keyang Yu , Ping Yang , Zizhao Wu","doi":"10.1016/j.imavis.2024.105286","DOIUrl":"10.1016/j.imavis.2024.105286","url":null,"abstract":"<div><div>In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose <em>HyperVD</em>, a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. We contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent snippets and normal ones. Extensive experiments on the XD-Violence benchmark demonstrate that our method achieves 85.67% AP, outperforming the state-of-the-art methods by a sizable margin.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105286"},"PeriodicalIF":4.2,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diabetic retinopathy (DR), a progressive condition due to diabetes that can lead to blindness, is typically characterized by a number of stages, including non-proliferative (mild, moderate and severe) and proliferative DR. These stages are marked by various vascular abnormalities, such as intraretinal microvascular abnormalities (IRMA), neovascularization (NV), and non-perfusion areas (NPA). Automated detection of these abnormalities and grading the severity of DR are crucial for computer-aided diagnosis. Ultra-wide optical coherence tomography angiography (UW-OCTA) images, a type of retinal imaging, are particularly well-suited for analyzing vascular abnormalities due to their prominence on these images. However, accurate detection of abnormalities and subsequent grading of DR is quite challenging due to noisy data, presence of artifacts, poor contrast and subtle nature of abnormalities. In this work, we aim to develop an automated method for accurate grading of DR severity on UW-OCTA images. Our method consists of various components such as UW-OCTA scan quality assessment, segmentation of vascular abnormalities and grading the scans for DR severity. Applied on publicly available data from Diabetic retinopathy analysis challenge (DRAC 2022), our method shows promising results with a Dice overlap metric and recall values of 0.88 for abnormality segmentation, and the coefficient-of-agreement () value of 0.873 for DR grading. We also performed a radiomics analysis, and observed that the radiomics features are significantly different for increasing levels of DR severity. This suggests that radiomics could be used for multimodal grading and further analysis of DR, indicating its potential scope in this area.
糖尿病视网膜病变(DR)是由糖尿病引起的一种进展性疾病,可导致失明,通常分为几个阶段,包括非增殖性(轻度、中度和重度)和增殖性 DR。这些阶段以各种血管异常为特征,如视网膜内微血管异常(IRMA)、新生血管(NV)和非灌注区(NPA)。自动检测这些异常并对 DR 的严重程度进行分级是计算机辅助诊断的关键。超宽光学相干断层血管成像(UW-OCTA)图像是视网膜成像的一种,由于血管异常在这些图像上非常明显,因此特别适合分析血管异常。然而,由于数据嘈杂、存在伪影、对比度差以及异常的细微性质,准确检测异常和随后对 DR 进行分级具有相当大的挑战性。在这项工作中,我们旨在开发一种自动方法,对 UW-OCTA 图像上的 DR 严重程度进行准确分级。我们的方法由多个部分组成,如 UW-OCTA 扫描质量评估、血管异常分割和 DR 严重程度分级。我们的方法应用于糖尿病视网膜病变分析挑战赛(DRAC 2022)的公开数据,显示出良好的效果,异常分割的 Dice 重叠度量和召回值为 0.88,DR 分级的一致系数 (κ)为 0.873。我们还进行了放射组学分析,观察到放射组学特征在 DR 严重程度增加时有显著差异。这表明,放射组学可用于 DR 的多模态分级和进一步分析,表明其在这一领域的潜在应用范围。
{"title":"Automated grading of diabetic retinopathy and Radiomics analysis on ultra-wide optical coherence tomography angiography scans","authors":"Vivek Noel Soren, H.S. Prajwal, Vaanathi Sundaresan","doi":"10.1016/j.imavis.2024.105292","DOIUrl":"10.1016/j.imavis.2024.105292","url":null,"abstract":"<div><div>Diabetic retinopathy (DR), a progressive condition due to diabetes that can lead to blindness, is typically characterized by a number of stages, including non-proliferative (mild, moderate and severe) and proliferative DR. These stages are marked by various vascular abnormalities, such as intraretinal microvascular abnormalities (IRMA), neovascularization (NV), and non-perfusion areas (NPA). Automated detection of these abnormalities and grading the severity of DR are crucial for computer-aided diagnosis. Ultra-wide optical coherence tomography angiography (UW-OCTA) images, a type of retinal imaging, are particularly well-suited for analyzing vascular abnormalities due to their prominence on these images. However, accurate detection of abnormalities and subsequent grading of DR is quite challenging due to noisy data, presence of artifacts, poor contrast and subtle nature of abnormalities. In this work, we aim to develop an automated method for accurate grading of DR severity on UW-OCTA images. Our method consists of various components such as UW-OCTA scan quality assessment, segmentation of vascular abnormalities and grading the scans for DR severity. Applied on publicly available data from Diabetic retinopathy analysis challenge (DRAC 2022), our method shows promising results with a Dice overlap metric and recall values of 0.88 for abnormality segmentation, and the coefficient-of-agreement (<span><math><mi>κ</mi></math></span>) value of 0.873 for DR grading. We also performed a radiomics analysis, and observed that the radiomics features are significantly different for increasing levels of DR severity. This suggests that radiomics could be used for multimodal grading and further analysis of DR, indicating its potential scope in this area.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105292"},"PeriodicalIF":4.2,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-27DOI: 10.1016/j.imavis.2024.105288
Neela Rahimi, Ming Shao
Learning from continuously evolving data is critical in real-world applications. This type of learning, known as Continual Learning (CL), aims to assimilate new information without compromising performance on prior knowledge. However, learning new information leads to a bias in the network towards recent observations, resulting in a phenomenon known as catastrophic forgetting. The complexity increases in Online Continual Learning (OCL) scenarios where models are allowed only a single pass over data. Existing OCL approaches that rely on replaying exemplar sets are not only memory-intensive when it comes to large-scale datasets but also raise security concerns. While recent dynamic network models address memory concerns, they often present computationally demanding, over-parameterized solutions with limited generalizability. To address this longstanding problem, we propose a novel OCL approach termed “Bias Robust online Continual Learning (BRCL).” BRCL retains all intermediate models generated. These models inherently exhibit a preference for recently learned classes. To leverage this property for enhanced performance, we devise a strategy we describe as ‘utilizing bias to counteract bias.’ This method involves the development of an Inference function that capitalizes on the inherent biases of each model towards the recent tasks. Furthermore, we integrate a model consolidation technique that aligns the first layers of these models, particularly focusing on similar feature representations. This process effectively reduces the memory requirement, ensuring a low memory footprint. Despite the simplicity of the methodology to guarantee expandability to various frameworks, extensive experiments reveal a notable performance edge over leading methods on key benchmarks, getting continual learning closer to matching offline training. (Source code will be made publicly available upon the publication of this paper.)
{"title":"Utilizing Inherent Bias for Memory Efficient Continual Learning: A Simple and Robust Baseline","authors":"Neela Rahimi, Ming Shao","doi":"10.1016/j.imavis.2024.105288","DOIUrl":"10.1016/j.imavis.2024.105288","url":null,"abstract":"<div><div>Learning from continuously evolving data is critical in real-world applications. This type of learning, known as Continual Learning (CL), aims to assimilate new information without compromising performance on prior knowledge. However, learning new information leads to a bias in the network towards recent observations, resulting in a phenomenon known as catastrophic forgetting. The complexity increases in Online Continual Learning (OCL) scenarios where models are allowed only a single pass over data. Existing OCL approaches that rely on replaying exemplar sets are not only memory-intensive when it comes to large-scale datasets but also raise security concerns. While recent dynamic network models address memory concerns, they often present computationally demanding, over-parameterized solutions with limited generalizability. To address this longstanding problem, we propose a novel OCL approach termed “Bias Robust online Continual Learning (BRCL).” BRCL retains all intermediate models generated. These models inherently exhibit a preference for recently learned classes. To leverage this property for enhanced performance, we devise a strategy we describe as ‘utilizing bias to counteract bias.’ This method involves the development of an Inference function that capitalizes on the inherent biases of each model towards the recent tasks. Furthermore, we integrate a model consolidation technique that aligns the first layers of these models, particularly focusing on similar feature representations. This process effectively reduces the memory requirement, ensuring a low memory footprint. Despite the simplicity of the methodology to guarantee expandability to various frameworks, extensive experiments reveal a notable performance edge over leading methods on key benchmarks, getting continual learning closer to matching offline training. (Source code will be made publicly available upon the publication of this paper.)</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105288"},"PeriodicalIF":4.2,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-26DOI: 10.1016/j.imavis.2024.105290
Jiahong Jiang, Nan Xia
Human pose estimation is an important technique in computer vision. Existing methods perform well in ideal environments, but there is room for improvement in occluded environments. The specific reasons are that the ambiguity of the features in the occlusion area makes the network pay insufficient attention to it, and the inadequate expressive ability of the features in the occlusion part cannot describe the true keypoint features. To address the occlusion issue, we propose a dual-channel network based on occlusion feature compensation. The dual channels are occlusion area enhancement channel based on convolution and occlusion feature compensation channel based on graph convolution, respectively. In the convolution channel, we propose an occlusion handling enhanced attention mechanism (OHE-attention) to improve the attention to the occlusion area. In the graph convolution channel, we propose a node feature compensation module that eliminates the obstacle features and integrates the shared and private attributes of the keypoints to improve the expressive ability of the node features. We conduct experiments on the COCO2017 dataset, COCO-Wholebody dataset, and CrowdPose dataset, achieving accuracy of 78.7%, 66.4%, and 77.9%, respectively. In addition, a series of ablation experiments and visualization demonstrations verify the performance of the dual-channel network in occluded environments.
{"title":"A dual-channel network based on occlusion feature compensation for human pose estimation","authors":"Jiahong Jiang, Nan Xia","doi":"10.1016/j.imavis.2024.105290","DOIUrl":"10.1016/j.imavis.2024.105290","url":null,"abstract":"<div><div>Human pose estimation is an important technique in computer vision. Existing methods perform well in ideal environments, but there is room for improvement in occluded environments. The specific reasons are that the ambiguity of the features in the occlusion area makes the network pay insufficient attention to it, and the inadequate expressive ability of the features in the occlusion part cannot describe the true keypoint features. To address the occlusion issue, we propose a dual-channel network based on occlusion feature compensation. The dual channels are occlusion area enhancement channel based on convolution and occlusion feature compensation channel based on graph convolution, respectively. In the convolution channel, we propose an occlusion handling enhanced attention mechanism (OHE-attention) to improve the attention to the occlusion area. In the graph convolution channel, we propose a node feature compensation module that eliminates the obstacle features and integrates the shared and private attributes of the keypoints to improve the expressive ability of the node features. We conduct experiments on the COCO2017 dataset, COCO-Wholebody dataset, and CrowdPose dataset, achieving accuracy of 78.7%, 66.4%, and 77.9%, respectively. In addition, a series of ablation experiments and visualization demonstrations verify the performance of the dual-channel network in occluded environments.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105290"},"PeriodicalIF":4.2,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-26DOI: 10.1016/j.imavis.2024.105289
Xiaoqiang Li , Kaiyuan Wu , Shaohua Zhang
Despite great efforts in recent years to research robust facial landmark localization methods, occlusion remains a challenge. To tackle this challenge, we propose a model called the Landmark-in-Facial-Component Network (LFCNet). Unlike mainstream models that focus on boundary information, LFCNet utilizes the strong structural constraints inherent in facial anatomy to address occlusion. Specifically, two key modules are designed, a component localization module and an offset localization module. After grouping landmarks based on facial components, the component localization module accomplishes coarse localization of facial components. Offset localization module performs fine localization of landmarks based on the coarse localization results, which can also be seen as delineating the shape of facial components. These two modules form a coarse-to-fine localization pipeline and can also enable LFCNet to better learn the shape constraint of human faces, thereby enhancing LFCNet's robustness to occlusion. LFCNet achieves 4.82% normalized mean error on occlusion subset of WFLW dataset and 6.33% normalized mean error on Masked 300W dataset. The results demonstrate that LFCNet achieves excellent performance in comparison to state-of-the-art methods, especially on occlusion datasets.
{"title":"Landmark-in-facial-component: Towards occlusion-robust facial landmark localization","authors":"Xiaoqiang Li , Kaiyuan Wu , Shaohua Zhang","doi":"10.1016/j.imavis.2024.105289","DOIUrl":"10.1016/j.imavis.2024.105289","url":null,"abstract":"<div><div>Despite great efforts in recent years to research robust facial landmark localization methods, occlusion remains a challenge. To tackle this challenge, we propose a model called the Landmark-in-Facial-Component Network (LFCNet). Unlike mainstream models that focus on boundary information, LFCNet utilizes the strong structural constraints inherent in facial anatomy to address occlusion. Specifically, two key modules are designed, a component localization module and an offset localization module. After grouping landmarks based on facial components, the component localization module accomplishes coarse localization of facial components. Offset localization module performs fine localization of landmarks based on the coarse localization results, which can also be seen as delineating the shape of facial components. These two modules form a coarse-to-fine localization pipeline and can also enable LFCNet to better learn the shape constraint of human faces, thereby enhancing LFCNet's robustness to occlusion. LFCNet achieves 4.82% normalized mean error on occlusion subset of WFLW dataset and 6.33% normalized mean error on Masked 300W dataset. The results demonstrate that LFCNet achieves excellent performance in comparison to state-of-the-art methods, especially on occlusion datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105289"},"PeriodicalIF":4.2,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}