Neural ordinary differential equations (ODEs) (Neural ODEs) construct the continuous dynamics of hidden units using ODEs specified by a neural network, demonstrating promising results on many tasks. However, Neural ODEs still do not perform well on image recognition tasks. The possible reason is that the one-hot encoding vector commonly used in Neural ODEs can not provide enough supervised information. A new training based on knowledge distillation is proposed to construct more powerful and robust Neural ODEs fitting image recognition tasks. Specially, the training of Neural ODEs is modelled into a teacher-student learning process, in which ResNets are proposed as the teacher model to provide richer supervised information. The experimental results show that the new training manner can improve the classification accuracy of Neural ODEs by 5.17%, 24.75%, 7.20%, and 8.99%, on Street View House Numbers, CIFAR10, CIFAR100, and Food-101, respectively. In addition, the effect of knowledge distillation is also evaluated in Neural ODEs on robustness against adversarial examples. The authors discover that incorporating knowledge distillation, coupled with the increase of the time horizon, can significantly enhance the robustness of Neural ODEs. The performance improvement is analysed from the perspective of the underlying dynamical system.
{"title":"Improving neural ordinary differential equations via knowledge distillation","authors":"Haoyu Chu, Shikui Wei, Qiming Lu, Yao Zhao","doi":"10.1049/cvi2.12248","DOIUrl":"10.1049/cvi2.12248","url":null,"abstract":"<p>Neural ordinary differential equations (ODEs) (Neural ODEs) construct the continuous dynamics of hidden units using ODEs specified by a neural network, demonstrating promising results on many tasks. However, Neural ODEs still do not perform well on image recognition tasks. The possible reason is that the one-hot encoding vector commonly used in Neural ODEs can not provide enough supervised information. A new training based on knowledge distillation is proposed to construct more powerful and robust Neural ODEs fitting image recognition tasks. Specially, the training of Neural ODEs is modelled into a teacher-student learning process, in which ResNets are proposed as the teacher model to provide richer supervised information. The experimental results show that the new training manner can improve the classification accuracy of Neural ODEs by 5.17%, 24.75%, 7.20%, and 8.99%, on Street View House Numbers, CIFAR10, CIFAR100, and Food-101, respectively. In addition, the effect of knowledge distillation is also evaluated in Neural ODEs on robustness against adversarial examples. The authors discover that incorporating knowledge distillation, coupled with the increase of the time horizon, can significantly enhance the robustness of Neural ODEs. The performance improvement is analysed from the perspective of the underlying dynamical system.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"304-314"},"PeriodicalIF":1.7,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12248","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135679482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A technique known as domain adaptation is utilised to address classification challenges in an unlabelled target domain by leveraging labelled source domains. Previous domain adaptation approaches have predominantly focussed on global domain adaptation, neglecting class-level information and resulting in suboptimal transfer performance. In recent years, a considerable number of researchers have explored class-level domain adaptation, aiming to precisely align the distribution of diverse domains. Nevertheless, existing research on class-level alignment tends to align domain features either on or in proximity to classification boundaries, which introduces ambiguous samples that can impact classification accuracy. In this study, the authors propose a novel strategy called class guided constraints (CGC) to tackle this issue. Specifically, CGC is employed to preserve the compactness within classes and separability between classes of domain features prior to class-level alignment. Furthermore, the authors incorporate CGC in conjunction with similarity guided constraint. Comprehensive evaluations conducted on four public datasets demonstrate that our approach outperforms numerous state-of-the-art domain adaptation methods significantly and achieves greater improvements compared to the baseline approach.
{"title":"Improved triplet loss for domain adaptation","authors":"Xiaoshun Wang, Yunhan Li, Xiangliang Zhang","doi":"10.1049/cvi2.12226","DOIUrl":"10.1049/cvi2.12226","url":null,"abstract":"<p>A technique known as domain adaptation is utilised to address classification challenges in an unlabelled target domain by leveraging labelled source domains. Previous domain adaptation approaches have predominantly focussed on global domain adaptation, neglecting class-level information and resulting in suboptimal transfer performance. In recent years, a considerable number of researchers have explored class-level domain adaptation, aiming to precisely align the distribution of diverse domains. Nevertheless, existing research on class-level alignment tends to align domain features either on or in proximity to classification boundaries, which introduces ambiguous samples that can impact classification accuracy. In this study, the authors propose a novel strategy called class guided constraints (CGC) to tackle this issue. Specifically, CGC is employed to preserve the compactness within classes and separability between classes of domain features prior to class-level alignment. Furthermore, the authors incorporate CGC in conjunction with similarity guided constraint. Comprehensive evaluations conducted on four public datasets demonstrate that our approach outperforms numerous state-of-the-art domain adaptation methods significantly and achieves greater improvements compared to the baseline approach.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"84-96"},"PeriodicalIF":1.7,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12226","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135820304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the popularity and advancement of 3D point cloud data acquisition technologies and sensors, research into 3D point clouds has made considerable strides based on deep learning. The semantic segmentation of point clouds, a crucial step in comprehending 3D scenes, has drawn much attention. The accuracy and effectiveness of fully supervised semantic segmentation tasks have greatly improved with the increase in the number of accessible datasets. However, these achievements rely on time-consuming and expensive full labelling. In solve of these existential issues, research on weakly supervised learning has recently exploded. These methods train neural networks to tackle 3D semantic segmentation tasks with fewer point labels. In addition to providing a thorough overview of the history and current state of the art in weakly supervised semantic segmentation of 3D point clouds, a detailed description of the most widely used data acquisition sensors, a list of publicly accessible benchmark datasets, and a look ahead to potential future development directions is provided.
{"title":"A survey on weakly supervised 3D point cloud semantic segmentation","authors":"Jingyi Wang, Yu Liu, Hanlin Tan, Maojun Zhang","doi":"10.1049/cvi2.12250","DOIUrl":"10.1049/cvi2.12250","url":null,"abstract":"<p>With the popularity and advancement of 3D point cloud data acquisition technologies and sensors, research into 3D point clouds has made considerable strides based on deep learning. The semantic segmentation of point clouds, a crucial step in comprehending 3D scenes, has drawn much attention. The accuracy and effectiveness of fully supervised semantic segmentation tasks have greatly improved with the increase in the number of accessible datasets. However, these achievements rely on time-consuming and expensive full labelling. In solve of these existential issues, research on weakly supervised learning has recently exploded. These methods train neural networks to tackle 3D semantic segmentation tasks with fewer point labels. In addition to providing a thorough overview of the history and current state of the art in weakly supervised semantic segmentation of 3D point clouds, a detailed description of the most widely used data acquisition sensors, a list of publicly accessible benchmark datasets, and a look ahead to potential future development directions is provided.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 3","pages":"329-342"},"PeriodicalIF":1.7,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12250","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135934040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the booming fields of computer vision and natural language processing, cross-modal intersections such as visual question answering (VQA) have become very popular. However, several studies have shown that many VQA models suffer from severe language prior problems. After a series of experiments, the authors found that previous VQA models are in an unstable state, that is, when training is repeated several times on the same dataset, there are significant differences between the distributions of the predicted answers given by the models each time, and these models also perform unsatisfactorily in terms of accuracy. The reason for model instability is that some of the difficult samples bring serious interference to model training, so we design a method to measure model stability quantitatively and further propose a method that can alleviate both model imbalance and instability phenomena. Precisely, the question types are classified into simple and difficult ones different weighting measures are applied. By imposing constraints on the training process for both types of questions, the stability and accuracy of the model improve. Experimental results demonstrate the effectiveness of our method, which achieves 63.11% on VQA-CP v2 and 75.49% with the addition of the pre-trained model.
{"title":"StableNet: Distinguishing the hard samples to overcome language priors in visual question answering","authors":"Zhengtao Yu, Jia Zhao, Chenliang Guo, Ying Yang","doi":"10.1049/cvi2.12249","DOIUrl":"10.1049/cvi2.12249","url":null,"abstract":"<p>With the booming fields of computer vision and natural language processing, cross-modal intersections such as visual question answering (VQA) have become very popular. However, several studies have shown that many VQA models suffer from severe language prior problems. After a series of experiments, the authors found that previous VQA models are in an unstable state, that is, when training is repeated several times on the same dataset, there are significant differences between the distributions of the predicted answers given by the models each time, and these models also perform unsatisfactorily in terms of accuracy. The reason for model instability is that some of the difficult samples bring serious interference to model training, so we design a method to measure model stability quantitatively and further propose a method that can alleviate both model imbalance and instability phenomena. Precisely, the question types are classified into simple and difficult ones different weighting measures are applied. By imposing constraints on the training process for both types of questions, the stability and accuracy of the model improve. Experimental results demonstrate the effectiveness of our method, which achieves 63.11% on VQA-CP v2 and 75.49% with the addition of the pre-trained model.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"315-327"},"PeriodicalIF":1.7,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12249","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136233692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingbing Meng, Mengya Yuan, Xuehan Shi, Le Zhang, Qingqing Liu, Dai Ping, Jinhua Wu, Fei Cheng
RGB depth (RGB-D) salient object detection (SOD) is a meaningful and challenging task, which has achieved good detection performance in dealing with simple scenes using convolutional neural networks, however, it cannot effectively handle scenes with complex contours of salient objects or similarly coloured salient objects and background. A novel end-to-end framework is proposed for RGB-D SOD, which comprises of four main components: the cross-modal attention feature enhancement (CMAFE) module, the multi-level contextual feature interaction (MLCFI) module, the boundary feature extraction (BFE) module, and the multi-level boundary attention guidance (MLBAG) module. The CMAFE module retains the more effective salient features by employing a dual-attention mechanism to filter noise from two modalities. In the MLCFI module, a shuffle operation is used for high-level and low-level channels to promote cross-channel information communication, and rich semantic information is extracted. The BFE module converts salient features into boundary features to generate boundary maps. The MLBAG module produces saliency maps by aggregating multi-level boundary saliency maps to guide cross-modal features in the decode stage. Extensive experiments are conducted on six public benchmark datasets, with the results demonstrating that the proposed model significantly outperforms 23 state-of-the-art RGB-D SOD models with regards to multiple evaluation metrics.
{"title":"RGB depth salient object detection via cross-modal attention and boundary feature guidance","authors":"Lingbing Meng, Mengya Yuan, Xuehan Shi, Le Zhang, Qingqing Liu, Dai Ping, Jinhua Wu, Fei Cheng","doi":"10.1049/cvi2.12244","DOIUrl":"10.1049/cvi2.12244","url":null,"abstract":"<p>RGB depth (RGB-D) salient object detection (SOD) is a meaningful and challenging task, which has achieved good detection performance in dealing with simple scenes using convolutional neural networks, however, it cannot effectively handle scenes with complex contours of salient objects or similarly coloured salient objects and background. A novel end-to-end framework is proposed for RGB-D SOD, which comprises of four main components: the cross-modal attention feature enhancement (CMAFE) module, the multi-level contextual feature interaction (MLCFI) module, the boundary feature extraction (BFE) module, and the multi-level boundary attention guidance (MLBAG) module. The CMAFE module retains the more effective salient features by employing a dual-attention mechanism to filter noise from two modalities. In the MLCFI module, a shuffle operation is used for high-level and low-level channels to promote cross-channel information communication, and rich semantic information is extracted. The BFE module converts salient features into boundary features to generate boundary maps. The MLBAG module produces saliency maps by aggregating multi-level boundary saliency maps to guide cross-modal features in the decode stage. Extensive experiments are conducted on six public benchmark datasets, with the results demonstrating that the proposed model significantly outperforms 23 state-of-the-art RGB-D SOD models with regards to multiple evaluation metrics.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"273-288"},"PeriodicalIF":1.7,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12244","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135779863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gang Wen, Fangrong Zhou, Yutang Ma, Hao Pan, Hao Geng, Jun Cao, Kang Li, Feiniu Yuan
It is very challenging to accurately segment smoke images because smoke has some adverse vision characteristics, such as anomalous shapes, blurred edges, and translucency. Existing methods cannot fully focus on the texture details of anomalous shapes and blurred edges simultaneously. To solve these problems, a Dense Multi-scale context and Asymmetric pooling Embedding Network (DMAENet) is proposed to model the smoke edge details and anomalous shapes for smoke segmentation. To capture the feature information from different scales, a Dense Multi-scale Context Module (DMCM) is proposed to further enhance the feature representation capability of our network under the help of asymmetric convolutions. To efficiently extract features for long-shaped objects, the authors use asymmetric pooling to propose an Asymmetric Pooling Enhancement Module (APEM). The vertical and horizontal pooling methods are responsible for enhancing features of irregular objects. Finally, a Feature Fusion Module (FFM) is designed, which accepts three inputs for improving performance. Low and high-level features are fused by pixel-wise summing, and then the summed feature maps are further enhanced in an attention manner. Experimental results on synthetic and real smoke datasets validate that all these modules can improve performance, and the proposed DMAENet obviously outperforms existing state-of-the-art methods.
{"title":"A dense multi-scale context and asymmetric pooling embedding network for smoke segmentation","authors":"Gang Wen, Fangrong Zhou, Yutang Ma, Hao Pan, Hao Geng, Jun Cao, Kang Li, Feiniu Yuan","doi":"10.1049/cvi2.12246","DOIUrl":"10.1049/cvi2.12246","url":null,"abstract":"<p>It is very challenging to accurately segment smoke images because smoke has some adverse vision characteristics, such as anomalous shapes, blurred edges, and translucency. Existing methods cannot fully focus on the texture details of anomalous shapes and blurred edges simultaneously. To solve these problems, a Dense Multi-scale context and Asymmetric pooling Embedding Network (DMAENet) is proposed to model the smoke edge details and anomalous shapes for smoke segmentation. To capture the feature information from different scales, a Dense Multi-scale Context Module (DMCM) is proposed to further enhance the feature representation capability of our network under the help of asymmetric convolutions. To efficiently extract features for long-shaped objects, the authors use asymmetric pooling to propose an Asymmetric Pooling Enhancement Module (APEM). The vertical and horizontal pooling methods are responsible for enhancing features of irregular objects. Finally, a Feature Fusion Module (FFM) is designed, which accepts three inputs for improving performance. Low and high-level features are fused by pixel-wise summing, and then the summed feature maps are further enhanced in an attention manner. Experimental results on synthetic and real smoke datasets validate that all these modules can improve performance, and the proposed DMAENet obviously outperforms existing state-of-the-art methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"236-246"},"PeriodicalIF":1.7,"publicationDate":"2023-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12246","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136033880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate target prediction, especially bounding box estimation, is a key problem in visual tracking. Many recently proposed trackers adopt the refinement module called IoU predictor by designing a high-level modulation vector to achieve bounding box estimation. However, due to the lack of spatial information that is important for precise box estimation, this simple one-dimensional modulation vector has limited refinement representation capability. In this study, a novel IoU predictor (IoUNet++) is designed to achieve more accurate bounding box estimation by investigating spatial matching with a spatial cross-layer interaction model. Rather than using a one-dimensional modulation vector to generate representations of the candidate bounding box for overlap prediction, this paper first extracts and fuses multi-level features of the target to generate template kernel with spatial description capability. Then, when aggregating the features of the template and the search region, the depthwise separable convolution correlation is adopted to preserve the spatial matching between the target feature and candidate feature, which makes their IoUNet++ network have better template representation and better fusion than the original network. The proposed IoUNet++ method with a plug-and-play style is applied to a series of strengthened trackers including DiMP++, SuperDiMP++ and SuperDIMP_AR++, which achieve consistent performance gain. Finally, experiments conducted on six popular tracking benchmarks show that their trackers outperformed the state-of-the-art trackers with significantly fewer training epochs.
{"title":"IoUNet++: Spatial cross-layer interaction-based bounding box regression for visual tracking","authors":"Shilei Wang, Yamin Han, Baozhen Sun, Jifeng Ning","doi":"10.1049/cvi2.12235","DOIUrl":"10.1049/cvi2.12235","url":null,"abstract":"<p>Accurate target prediction, especially bounding box estimation, is a key problem in visual tracking. Many recently proposed trackers adopt the refinement module called IoU predictor by designing a high-level modulation vector to achieve bounding box estimation. However, due to the lack of spatial information that is important for precise box estimation, this simple one-dimensional modulation vector has limited refinement representation capability. In this study, a novel IoU predictor (IoUNet++) is designed to achieve more accurate bounding box estimation by investigating spatial matching with a spatial cross-layer interaction model. Rather than using a one-dimensional modulation vector to generate representations of the candidate bounding box for overlap prediction, this paper first extracts and fuses multi-level features of the target to generate template kernel with spatial description capability. Then, when aggregating the features of the template and the search region, the depthwise separable convolution correlation is adopted to preserve the spatial matching between the target feature and candidate feature, which makes their IoUNet++ network have better template representation and better fusion than the original network. The proposed IoUNet++ method with a plug-and-play style is applied to a series of strengthened trackers including DiMP++, SuperDiMP++ and SuperDIMP_AR++, which achieve consistent performance gain. Finally, experiments conducted on six popular tracking benchmarks show that their trackers outperformed the state-of-the-art trackers with significantly fewer training epochs.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"177-189"},"PeriodicalIF":1.7,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12235","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136142452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although multimodal face data have obvious advantages in describing live and spoofed features, single-modality face antispoofing technologies are still widely used when it is difficult to obtain multimodal face images or inconvenient to integrate and deploy multimodal sensors. Since the live/spoofed representations in visible light facial images include considerable face identity information interference, existing deep learning-based face antispoofing models achieve poor performance when only the visible light modality is used. To address the above problems, the authors design a dual-channel network structure and a constrained representation learning method for face antispoofing. First, they design a dual-channel attention mechanism-based grouped convolutional neural network (CNN) to learn important deceptive cues in live and spoofed faces. Second, they design inner contrastive estimation-based representation constraints for both live and spoofed samples to minimise the sample similarity loss to prevent the CNN from learning more facial appearance information. This increases the distance between live and spoofed faces and enhances the network's ability to identify deceptive cues. The evaluation results indicate that the framework we designed achieves an average classification error rate (ACER) of 2.37% on the visible light modality subset of the CASIA-SURF dataset and an ACER of 2.4% on the CASIA-SURF CeFA dataset, outperforming existing methods. The proposed method achieves low ACER scores in cross-dataset testing, demonstrating its advantage in domain generalisation.
{"title":"Representation constraint-based dual-channel network for face antispoofing","authors":"Zuhe Li, Yuhao Cui, Fengqin Wang, Weihua Liu, Yongshuang Yang, Zeqi Yu, Bin Jiang, Hui Chen","doi":"10.1049/cvi2.12245","DOIUrl":"10.1049/cvi2.12245","url":null,"abstract":"<p>Although multimodal face data have obvious advantages in describing live and spoofed features, single-modality face antispoofing technologies are still widely used when it is difficult to obtain multimodal face images or inconvenient to integrate and deploy multimodal sensors. Since the live/spoofed representations in visible light facial images include considerable face identity information interference, existing deep learning-based face antispoofing models achieve poor performance when only the visible light modality is used. To address the above problems, the authors design a dual-channel network structure and a constrained representation learning method for face antispoofing. First, they design a dual-channel attention mechanism-based grouped convolutional neural network (CNN) to learn important deceptive cues in live and spoofed faces. Second, they design inner contrastive estimation-based representation constraints for both live and spoofed samples to minimise the sample similarity loss to prevent the CNN from learning more facial appearance information. This increases the distance between live and spoofed faces and enhances the network's ability to identify deceptive cues. The evaluation results indicate that the framework we designed achieves an average classification error rate (ACER) of 2.37% on the visible light modality subset of the CASIA-SURF dataset and an ACER of 2.4% on the CASIA-SURF CeFA dataset, outperforming existing methods. The proposed method achieves low ACER scores in cross-dataset testing, demonstrating its advantage in domain generalisation.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"289-303"},"PeriodicalIF":1.7,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136359519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lifang Chen, Shanglai Wang, Li Wan, Jianghu Su, Shunfeng Wang
Due to the large fluctuations of the boundaries and internal variations of the lesion regions in medical image segmentation, current methods may have difficulty capturing sufficient global contexts effectively to deal with these inherent challenges, which may lead to a problem of segmented discrete masks undermining the performance of segmentation. Although self-attention can be implemented to capture long-distance dependencies between pixels, it has the disadvantage of computational complexity and the global contexts extracted by self-attention are still insufficient. To this end, the authors propose the GFRNet, which resorts to the idea of low-rank matrix factorization by forming global contexts locally to obtain global contexts that are totally different from contexts extracted by self-attention. The authors effectively integrate the different global contexts extract by self-attention and low-rank matrix factorization to extract versatile global contexts. Also, to recover the spatial contexts lost during the matrix factorization process and enhance boundary contexts, the authors propose the Modified Matrix Decomposition module which employ depth-wise separable convolution and spatial augmentation in the low-rank matrix factorization process. Comprehensive experiments are performed on four benchmark datasets showing that GFRNet performs better than the relevant CNN and transformer-based recipes.
{"title":"GFRNet: Rethinking the global contexts extraction in medical images segmentation through matrix factorization and self-attention","authors":"Lifang Chen, Shanglai Wang, Li Wan, Jianghu Su, Shunfeng Wang","doi":"10.1049/cvi2.12243","DOIUrl":"10.1049/cvi2.12243","url":null,"abstract":"<p>Due to the large fluctuations of the boundaries and internal variations of the lesion regions in medical image segmentation, current methods may have difficulty capturing sufficient global contexts effectively to deal with these inherent challenges, which may lead to a problem of segmented discrete masks undermining the performance of segmentation. Although self-attention can be implemented to capture long-distance dependencies between pixels, it has the disadvantage of computational complexity and the global contexts extracted by self-attention are still insufficient. To this end, the authors propose the GFRNet, which resorts to the idea of low-rank matrix factorization by forming global contexts locally to obtain global contexts that are totally different from contexts extracted by self-attention. The authors effectively integrate the different global contexts extract by self-attention and low-rank matrix factorization to extract versatile global contexts. Also, to recover the spatial contexts lost during the matrix factorization process and enhance boundary contexts, the authors propose the Modified Matrix Decomposition module which employ depth-wise separable convolution and spatial augmentation in the low-rank matrix factorization process. Comprehensive experiments are performed on four benchmark datasets showing that GFRNet performs better than the relevant CNN and transformer-based recipes.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"260-272"},"PeriodicalIF":1.7,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12243","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135251729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Zhou, Fengchao Xiong, Lei Tong, Naoto Yokoya, Pedram Ghamisi
<p>The increasing accessibility and affordability of spectral imaging technology have revolutionised computer vision, allowing for data capture across various wavelengths beyond the visual spectrum. This advancement has greatly enhanced the capabilities of computers and AI systems in observing, understanding, and interacting with the world. Consequently, new datasets in various modalities, such as infrared, ultraviolet, fluorescent, multispectral, and hyperspectral, have been constructed, presenting fresh opportunities for computer vision research and applications.</p><p>Although significant progress has been made in processing, learning, and utilising data obtained through spectral imaging technology, several challenges persist in the field of computer vision. These challenges include the presence of low-quality images, sparse input, high-dimensional data, expensive data labelling processes, and a lack of methods to effectively analyse and utilise data considering their unique properties. Many mid-level and high-level computer vision tasks, such as object segmentation, detection and recognition, image retrieval and classification, and video tracking and understanding, still have not leveraged the advantages offered by spectral information. Additionally, the problem of effectively and efficiently fusing data in different modalities to create robust vision systems remains unresolved. Therefore, there is a pressing need for novel computer vision methods and applications to advance this research area. This special issue aims to provide a venue for researchers to present innovative computer vision methods driven by the spectral imaging technology.</p><p>This special issue has received 11 submissions. Among them, five papers have been accepted for publication, indicating their high quality and contribution to spectral imaging powered computer vision. Four papers have been rejected and sent to a transfer service for consideration in other journals or invited for re-submission after revision based on reviewers’ feedback.</p><p>The accepted papers can be categorised into three main groups based on the type of adopted data, that is, hyperspectral, multispectral, and X-ray images. Hyperspectral images provide material information about the scene and enable fine-grained object class classification. Multispectral images provide high spatial context and information beyond visible spectrum, such as infrared, providing enriched clues for visual computation. X-ray images can penetrate the surface of objects and provide internal structural information of targets, empowering medical applications, such as rib detection as exemplified by Tsai et al. Below is a brief summary of each paper in this special issue.</p><p>Zhong et al. proposed a lightweight criss-cross large kernel (CCLK) convolutional neural network for hyperspectral classification. The key component of this network is a CCLK module, which incorporates large kernels within the 1D convolutional layers and
{"title":"Guest Editorial: Spectral imaging powered computer vision","authors":"Jun Zhou, Fengchao Xiong, Lei Tong, Naoto Yokoya, Pedram Ghamisi","doi":"10.1049/cvi2.12242","DOIUrl":"https://doi.org/10.1049/cvi2.12242","url":null,"abstract":"<p>The increasing accessibility and affordability of spectral imaging technology have revolutionised computer vision, allowing for data capture across various wavelengths beyond the visual spectrum. This advancement has greatly enhanced the capabilities of computers and AI systems in observing, understanding, and interacting with the world. Consequently, new datasets in various modalities, such as infrared, ultraviolet, fluorescent, multispectral, and hyperspectral, have been constructed, presenting fresh opportunities for computer vision research and applications.</p><p>Although significant progress has been made in processing, learning, and utilising data obtained through spectral imaging technology, several challenges persist in the field of computer vision. These challenges include the presence of low-quality images, sparse input, high-dimensional data, expensive data labelling processes, and a lack of methods to effectively analyse and utilise data considering their unique properties. Many mid-level and high-level computer vision tasks, such as object segmentation, detection and recognition, image retrieval and classification, and video tracking and understanding, still have not leveraged the advantages offered by spectral information. Additionally, the problem of effectively and efficiently fusing data in different modalities to create robust vision systems remains unresolved. Therefore, there is a pressing need for novel computer vision methods and applications to advance this research area. This special issue aims to provide a venue for researchers to present innovative computer vision methods driven by the spectral imaging technology.</p><p>This special issue has received 11 submissions. Among them, five papers have been accepted for publication, indicating their high quality and contribution to spectral imaging powered computer vision. Four papers have been rejected and sent to a transfer service for consideration in other journals or invited for re-submission after revision based on reviewers’ feedback.</p><p>The accepted papers can be categorised into three main groups based on the type of adopted data, that is, hyperspectral, multispectral, and X-ray images. Hyperspectral images provide material information about the scene and enable fine-grained object class classification. Multispectral images provide high spatial context and information beyond visible spectrum, such as infrared, providing enriched clues for visual computation. X-ray images can penetrate the surface of objects and provide internal structural information of targets, empowering medical applications, such as rib detection as exemplified by Tsai et al. Below is a brief summary of each paper in this special issue.</p><p>Zhong et al. proposed a lightweight criss-cross large kernel (CCLK) convolutional neural network for hyperspectral classification. The key component of this network is a CCLK module, which incorporates large kernels within the 1D convolutional layers and","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"723-725"},"PeriodicalIF":1.7,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12242","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50125654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}