Li Sun, Ping Wang, Liuan Wang, Jun Sun, Takayuki Okatani
Temporal event localisation (TEL) has recently attracted increasing attention due to the rapid development of video platforms. Existing methods are based on either fully/weakly supervised or unsupervised learning, and thus they rely on expensive data annotation and time-consuming training. Moreover, these models, which are trained on specific domain data, limit the model generalisation to data distribution shifts. To cope with these difficulties, the authors propose a zero-shot TEL method that can operate without training data or annotations. Leveraging large-scale vision and language pre-trained models, for example, CLIP, we solve the two key problems: (1) how to find the relevant region where the event is likely to occur; (2) how to determine event duration after we find the relevant region. Query guided optimisation for local frame relevance relying on the query-to-frame relationship is proposed to find the most relevant frame region where the event is most likely to occur. Proposal generation method relying on the frame-to-frame relationship is proposed to determine the event duration. The authors also propose a greedy event sampling strategy to predict multiple durations with high reliability for the given event. The authors’ methodology is unique, offering a label-free, training-free, and domain-free approach. It enables the application of TEL purely at the testing stage. The practical results show it achieves competitive performance on the standard Charades-STA and ActivityCaptions datasets.
{"title":"Zero-shot temporal event localisation: Label-free, training-free, domain-free","authors":"Li Sun, Ping Wang, Liuan Wang, Jun Sun, Takayuki Okatani","doi":"10.1049/cvi2.12224","DOIUrl":"https://doi.org/10.1049/cvi2.12224","url":null,"abstract":"<p>Temporal event localisation (TEL) has recently attracted increasing attention due to the rapid development of video platforms. Existing methods are based on either fully/weakly supervised or unsupervised learning, and thus they rely on expensive data annotation and time-consuming training. Moreover, these models, which are trained on specific domain data, limit the model generalisation to data distribution shifts. To cope with these difficulties, the authors propose a zero-shot TEL method that can operate without training data or annotations. Leveraging large-scale vision and language pre-trained models, for example, CLIP, we solve the two key problems: (1) how to find the relevant region where the event is likely to occur; (2) how to determine event duration after we find the relevant region. Query guided optimisation for local frame relevance relying on the query-to-frame relationship is proposed to find the most relevant frame region where the event is most likely to occur. Proposal generation method relying on the frame-to-frame relationship is proposed to determine the event duration. The authors also propose a greedy event sampling strategy to predict multiple durations with high reliability for the given event. The authors’ methodology is unique, offering a label-free, training-free, and domain-free approach. It enables the application of TEL purely at the testing stage. The practical results show it achieves competitive performance on the standard Charades-STA and ActivityCaptions datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 5","pages":"599-613"},"PeriodicalIF":1.7,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12224","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50123617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The one-stage object detector has been widely applied in many computer vision applications due to its high detection efficiency and simple framework. However, one-stage detectors heavily rely on Non-maximum Suppression to remove the duplicated predictions for the same objects, and the detectors produce detection confidence to measure the quality of those predictions. The localisation quality is an important factor to evaluate the predicted bounding boxes, but its role has not been fully utilised in previous works. To alleviate the problem, the Quality Prediction Block (QPB), a lightweight sub-network, is designed by the authors, which strengthens the effect of localisation quality evaluation on detection confidence by leveraging the features of predicted bounding boxes. The QPB is simple in structure and applies to different forms of detection confidence. Extensive experiments are conducted on the public benchmarks, MS COCO, PASCAL VOC and Berkeley DeepDrive. The results demonstrate the effectiveness of our method in the detectors with various forms of detection confidence. The proposed approach also achieves better performance in the stronger one-stage detectors.
{"title":"Improving object detection by enhancing the effect of localisation quality evaluation on detection confidence","authors":"Zuyi Wang, Wei Zhao, Li Xu","doi":"10.1049/cvi2.12227","DOIUrl":"10.1049/cvi2.12227","url":null,"abstract":"<p>The one-stage object detector has been widely applied in many computer vision applications due to its high detection efficiency and simple framework. However, one-stage detectors heavily rely on Non-maximum Suppression to remove the duplicated predictions for the same objects, and the detectors produce detection confidence to measure the quality of those predictions. The localisation quality is an important factor to evaluate the predicted bounding boxes, but its role has not been fully utilised in previous works. To alleviate the problem, the Quality Prediction Block (QPB), a lightweight sub-network, is designed by the authors, which strengthens the effect of localisation quality evaluation on detection confidence by leveraging the features of predicted bounding boxes. The QPB is simple in structure and applies to different forms of detection confidence. Extensive experiments are conducted on the public benchmarks, MS COCO, PASCAL VOC and Berkeley DeepDrive. The results demonstrate the effectiveness of our method in the detectors with various forms of detection confidence. The proposed approach also achieves better performance in the stronger one-stage detectors.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"97-109"},"PeriodicalIF":1.7,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12227","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44989905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bing Liu, Yansheng Gao, Hai Li, Zhaohao Zhong, Hongwei Zhao
Due to the large computational and GPUs memory cost of semantic segmentation, some works focus on designing a lite weight model to achieve a good trade-off between computational cost and accuracy. A common method is to combined CNN and vision transformer. However, these methods ignore the contextual information of multi receptive fields. And existing methods often fail to inject detailed information losses in the downsampling of multi-scale feature. To fix these issues, we propose AG Self-Attention, which is Enhanced Atrous Self-Attention (EASA), and Gate Attention. AG Self-Attention adds the contextual information of multi receptive fields into the global semantic feature. Specifically, the Enhanced Atrous Self-Attention uses weight shared atrous convolution with different atrous rates to get the contextual information under the specific different receptive fields. Gate Attention introduces gating mechanism to inject detailed information into the global semantic feature and filter detailed information by producing “fusion” gate and “update” gate. In order to prove our insight. We conduct numerous experiments in common semantic segmentation datasets, consisting of ADE20 K, COCO-stuff, PASCAL Context, Cityscapes, to show that our method achieves state-of-the-art performance and achieve a good trade-off between computational cost and accuracy.
{"title":"Lite-weight semantic segmentation with AG self-attention","authors":"Bing Liu, Yansheng Gao, Hai Li, Zhaohao Zhong, Hongwei Zhao","doi":"10.1049/cvi2.12225","DOIUrl":"10.1049/cvi2.12225","url":null,"abstract":"<p>Due to the large computational and GPUs memory cost of semantic segmentation, some works focus on designing a lite weight model to achieve a good trade-off between computational cost and accuracy. A common method is to combined CNN and vision transformer. However, these methods ignore the contextual information of multi receptive fields. And existing methods often fail to inject detailed information losses in the downsampling of multi-scale feature. To fix these issues, we propose AG Self-Attention, which is Enhanced Atrous Self-Attention (EASA), and Gate Attention. AG Self-Attention adds the contextual information of multi receptive fields into the global semantic feature. Specifically, the Enhanced Atrous Self-Attention uses weight shared atrous convolution with different atrous rates to get the contextual information under the specific different receptive fields. Gate Attention introduces gating mechanism to inject detailed information into the global semantic feature and filter detailed information by producing “fusion” gate and “update” gate. In order to prove our insight. We conduct numerous experiments in common semantic segmentation datasets, consisting of ADE20 K, COCO-stuff, PASCAL Context, Cityscapes, to show that our method achieves state-of-the-art performance and achieve a good trade-off between computational cost and accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"72-83"},"PeriodicalIF":1.7,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12225","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45119461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chengcheng Zhong, Na Gong, Zitong Zhang, Yanan Jiang, Kai Zhang
High-performance convolutional neural networks (CNNs) stack many convolutional layers to obtain powerful feature extraction capability, which leads to huge storing and computational costs. The authors focus on lightweight models for hyperspectral image (HSI) classification, so a novel lightweight criss-cross large kernel convolutional neural network (LiteCCLKNet) is proposed. Specifically, a lightweight module containing two 1D convolutions with self-attention mechanisms in orthogonal directions is presented. By setting large kernels within the 1D convolutional layers, the proposed module can efficiently aggregate long-range contextual features. In addition, the authors effectively obtain a global receptive field by stacking only two of the proposed modules. Compared with traditional lightweight CNNs, LiteCCLKNet reduces the number of parameters for easy deployment to resource-limited platforms. Experimental results on three HSI datasets demonstrate that the proposed LiteCCLKNet outperforms the previous lightweight CNNs and has higher storage efficiency.
{"title":"LiteCCLKNet: A lightweight criss-cross large kernel convolutional neural network for hyperspectral image classification","authors":"Chengcheng Zhong, Na Gong, Zitong Zhang, Yanan Jiang, Kai Zhang","doi":"10.1049/cvi2.12218","DOIUrl":"10.1049/cvi2.12218","url":null,"abstract":"<p>High-performance convolutional neural networks (CNNs) stack many convolutional layers to obtain powerful feature extraction capability, which leads to huge storing and computational costs. The authors focus on lightweight models for hyperspectral image (HSI) classification, so a novel lightweight criss-cross large kernel convolutional neural network (LiteCCLKNet) is proposed. Specifically, a lightweight module containing two 1D convolutions with self-attention mechanisms in orthogonal directions is presented. By setting large kernels within the 1D convolutional layers, the proposed module can efficiently aggregate long-range contextual features. In addition, the authors effectively obtain a global receptive field by stacking only two of the proposed modules. Compared with traditional lightweight CNNs, LiteCCLKNet reduces the number of parameters for easy deployment to resource-limited platforms. Experimental results on three HSI datasets demonstrate that the proposed LiteCCLKNet outperforms the previous lightweight CNNs and has higher storage efficiency.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"763-776"},"PeriodicalIF":1.7,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12218","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46053001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the huge cost of manual annotations, the labelled data may not be sufficient to train a dynamic facial expression (DFR) recogniser with good performance. To address this, the authors propose a multi-modal pre-training method with a pseudo-label guidance mechanism to make full use of unlabelled video data for learning informative representations of facial expressions. First, the authors build a pre-training dataset of videos with aligned vision and audio modals. Second, the vision and audio feature encoders are trained through an instance discrimination strategy and a cross-modal alignment strategy on the pre-training data. Third, the vision feature encoder is extended as a dynamic expression recogniser and is fine-tuned on the labelled training data. Fourth, the fine-tuned expression recogniser is adopted to predict pseudo-labels for the pre-training data, and then start a new pre-training phase with the guidance of pseudo-labels to alleviate the long-tail distribution problem and the instance-class confliction. Fifth, since the representations learnt with the guidance of pseudo-labels are more informative, a new fine-tuning phase is added to further boost the generalisation performance on the DFR recognition task. Experimental results on the Dynamic Facial Expression in the Wild dataset demonstrate the superiority of the proposed method.
{"title":"Dynamic facial expression recognition with pseudo-label guided multi-modal pre-training","authors":"Bing Yin, Shi Yin, Cong Liu, Yanyong Zhang, Changfeng Xi, Baocai Yin, Zhenhua Ling","doi":"10.1049/cvi2.12217","DOIUrl":"10.1049/cvi2.12217","url":null,"abstract":"<p>Due to the huge cost of manual annotations, the labelled data may not be sufficient to train a dynamic facial expression (DFR) recogniser with good performance. To address this, the authors propose a multi-modal pre-training method with a pseudo-label guidance mechanism to make full use of unlabelled video data for learning informative representations of facial expressions. First, the authors build a pre-training dataset of videos with aligned vision and audio modals. Second, the vision and audio feature encoders are trained through an instance discrimination strategy and a cross-modal alignment strategy on the pre-training data. Third, the vision feature encoder is extended as a dynamic expression recogniser and is fine-tuned on the labelled training data. Fourth, the fine-tuned expression recogniser is adopted to predict pseudo-labels for the pre-training data, and then start a new pre-training phase with the guidance of pseudo-labels to alleviate the long-tail distribution problem and the instance-class confliction. Fifth, since the representations learnt with the guidance of pseudo-labels are more informative, a new fine-tuning phase is added to further boost the generalisation performance on the DFR recognition task. Experimental results on the Dynamic Facial Expression in the Wild dataset demonstrate the superiority of the proposed method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"33-45"},"PeriodicalIF":1.7,"publicationDate":"2023-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12217","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46222126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Convolutional Networks (GCNs) have been widely used in skeleton-based action recognition. Though significant performance has been achieved, it is still challenging to effectively model the complex dynamics of skeleton sequences. A novel position-aware spatio-temporal GCN for skeleton-based action recognition is proposed, where the positional encoding is investigated to enhance the capacity of typical baselines for comprehending the dynamic characteristics of action sequence. Specifically, the authors’ method systematically investigates the temporal position encoding and spatial position embedding, in favour of explicitly capturing the sequence ordering information and the identity information of nodes that are used in graphs. Additionally, to alleviate the redundancy and over-smoothing problems of typical GCNs, the authors’ method further investigates a subgraph mask, which gears to mine the prominent subgraph patterns over the underlying graph, letting the model be robust against the impaction of some irrelevant joints. Extensive experiments on three large-scale datasets demonstrate that our model can achieve competitive results comparing to the previous state-of-art methods.
{"title":"Position-aware spatio-temporal graph convolutional networks for skeleton-based action recognition","authors":"Ping Yang, Qin Wang, Hao Chen, Zizhao Wu","doi":"10.1049/cvi2.12223","DOIUrl":"10.1049/cvi2.12223","url":null,"abstract":"<p>Graph Convolutional Networks (GCNs) have been widely used in skeleton-based action recognition. Though significant performance has been achieved, it is still challenging to effectively model the complex dynamics of skeleton sequences. A novel position-aware spatio-temporal GCN for skeleton-based action recognition is proposed, where the positional encoding is investigated to enhance the capacity of typical baselines for comprehending the dynamic characteristics of action sequence. Specifically, the authors’ method systematically investigates the temporal position encoding and spatial position embedding, in favour of explicitly capturing the sequence ordering information and the identity information of nodes that are used in graphs. Additionally, to alleviate the redundancy and over-smoothing problems of typical GCNs, the authors’ method further investigates a subgraph mask, which gears to mine the prominent subgraph patterns over the underlying graph, letting the model be robust against the impaction of some irrelevant joints. Extensive experiments on three large-scale datasets demonstrate that our model can achieve competitive results comparing to the previous state-of-art methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"844-854"},"PeriodicalIF":1.7,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12223","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48043129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal information in event streams plays a critical role in event-based video frame interpolation as it provides temporal context cues complementary to images. Most previous event-based methods first transform the unstructured event data to structured data formats through voxelisation, and then employ advanced CNNs to extract temporal information. However, voxelisation inevitably leads to information loss, and processing the sparse voxels introduces severe computation redundancy. To address these limitations, this study proposes a point-image fusion network (PIFNet). In our PIFNet, rich temporal information from the events can be directly extracted at the point level. Then, a fusion module is designed to fuse complementary cues from both points and images for frame interpolation. Extensive experiments on both synthetic and real datasets demonstrate that our PIFNet achieves state-of-the-art performance with high efficiency.
{"title":"A point-image fusion network for event-based frame interpolation","authors":"Chushu Zhang, Wei An, Ye Zhang, Miao Li","doi":"10.1049/cvi2.12220","DOIUrl":"10.1049/cvi2.12220","url":null,"abstract":"<p>Temporal information in event streams plays a critical role in event-based video frame interpolation as it provides temporal context cues complementary to images. Most previous event-based methods first transform the unstructured event data to structured data formats through voxelisation, and then employ advanced CNNs to extract temporal information. However, voxelisation inevitably leads to information loss, and processing the sparse voxels introduces severe computation redundancy. To address these limitations, this study proposes a point-image fusion network (PIFNet). In our PIFNet, rich temporal information from the events can be directly extracted at the point level. Then, a fusion module is designed to fuse complementary cues from both points and images for frame interpolation. Extensive experiments on both synthetic and real datasets demonstrate that our PIFNet achieves state-of-the-art performance with high efficiency.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"439-447"},"PeriodicalIF":1.7,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12220","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47200165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human parsing is very important in a diverse range of industrial applications. Despite the considerable progress that has been achieved, the performance of existing methods is still less than satisfactory, since these methods learn the shared features of various parsing labels at the image level. This limits the representativeness of the learnt features, especially when the distribution of parsing labels is imbalanced or the scale of different labels is substantially different. To address this limitation, a Region-level Parsing Refiner (RPR) is proposed to enhance parsing performance by the introduction of region-level parsing learning. Region-level parsing focuses specifically on small regions of the body, for example, the head. The proposed RPR is an adaptive module that can be integrated with different existing human parsing models to improve their performance. Extensive experiments are conducted on two benchmark datasets, and the results demonstrated the effectiveness of our RPR model in terms of improving the overall parsing performance as well as parsing rare labels. This method was successfully applied to a commercial application for the extraction of human body measurements and has been used in various online shopping platforms for clothing size recommendations. The code and dataset are released at this link https://github.com/applezhouyp/PRP.
{"title":"Enhancing human parsing with region-level learning","authors":"Yanghong Zhou, P. Y. Mok","doi":"10.1049/cvi2.12222","DOIUrl":"10.1049/cvi2.12222","url":null,"abstract":"<p>Human parsing is very important in a diverse range of industrial applications. Despite the considerable progress that has been achieved, the performance of existing methods is still less than satisfactory, since these methods learn the shared features of various parsing labels at the image level. This limits the representativeness of the learnt features, especially when the distribution of parsing labels is imbalanced or the scale of different labels is substantially different. To address this limitation, a Region-level Parsing Refiner (RPR) is proposed to enhance parsing performance by the introduction of region-level parsing learning. Region-level parsing focuses specifically on small regions of the body, for example, the head. The proposed RPR is an adaptive module that can be integrated with different existing human parsing models to improve their performance. Extensive experiments are conducted on two benchmark datasets, and the results demonstrated the effectiveness of our RPR model in terms of improving the overall parsing performance as well as parsing rare labels. This method was successfully applied to a commercial application for the extraction of human body measurements and has been used in various online shopping platforms for clothing size recommendations. The code and dataset are released at this link https://github.com/applezhouyp/PRP.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"60-71"},"PeriodicalIF":1.7,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12222","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46633684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Coronavirus Disease 2019 (COVID-19) epidemic has constituted a Public Health Emergency of International Concern. Chest computed tomography (CT) can help early reveal abnormalities indicative of lung disease. Thus, accurate and automatic localisation of lung lesions is particularly important to assist physicians in rapid diagnosis of COVID-19 patients. The authors propose a classifier-augmented generative adversarial network framework for weakly supervised COVID-19 lung lesion localisation. It consists of an abnormality map generator, discriminator and classifier. The generator aims to produce the abnormality feature map M to locate lesion regions and then constructs images of the pseudo-healthy subjects by adding M to the input patient images. Besides constraining the generated images of healthy subjects with real distribution by the discriminator, a pre-trained classifier is introduced to enhance the generated images of healthy subjects to possess similar feature representations with real healthy people in terms of high-level semantic features. Moreover, an attention gate is employed in the generator to reduce the noise effect in the irrelevant regions of M. Experimental results on the COVID-19 CT dataset show that the method is effective in capturing more lesion areas and generating less noise in unrelated areas, and it has significant advantages in terms of quantitative and qualitative results over existing methods.
{"title":"CAGAN: Classifier-augmented generative adversarial networks for weakly-supervised COVID-19 lung lesion localisation","authors":"Xiaojie Li, Xin Fei, Zhe Yan, Hongping Ren, Canghong Shi, Xian Zhang, Imran Mumtaz, Yong Luo, Xi Wu","doi":"10.1049/cvi2.12216","DOIUrl":"10.1049/cvi2.12216","url":null,"abstract":"<p>The Coronavirus Disease 2019 (COVID-19) epidemic has constituted a Public Health Emergency of International Concern. Chest computed tomography (CT) can help early reveal abnormalities indicative of lung disease. Thus, accurate and automatic localisation of lung lesions is particularly important to assist physicians in rapid diagnosis of COVID-19 patients. The authors propose a classifier-augmented generative adversarial network framework for weakly supervised COVID-19 lung lesion localisation. It consists of an abnormality map generator, discriminator and classifier. The generator aims to produce the abnormality feature map <i>M</i> to locate lesion regions and then constructs images of the pseudo-healthy subjects by adding <i>M</i> to the input patient images. Besides constraining the generated images of healthy subjects with real distribution by the discriminator, a pre-trained classifier is introduced to enhance the generated images of healthy subjects to possess similar feature representations with real healthy people in terms of high-level semantic features. Moreover, an attention gate is employed in the generator to reduce the noise effect in the irrelevant regions of <i>M</i>. Experimental results on the COVID-19 CT dataset show that the method is effective in capturing more lesion areas and generating less noise in unrelated areas, and it has significant advantages in terms of quantitative and qualitative results over existing methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"1-14"},"PeriodicalIF":1.7,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46620917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional RGB-T salient object detection treats RGB and thermal modalities equally to locate the common salient regions. However, the authors observed that the rich colour and texture information of the RGB modality makes the objects more prominent compared to the background; and the thermal modality records the temperature difference of the scene, so the objects usually contain clear and continuous edge information. In this work, a novel mirror-complementary Transformer network (MCNet) is proposed for RGB-T SOD, which supervise the two modalities separately with a complementary set of saliency labels under a symmetrical structure. Moreover, the attention-based feature interaction and serial multiscale dilated convolution (SDC)-based feature fusion modules are introduced to make the two modalities complement and adjust each other flexibly. When one modality fails, the proposed model can still accurately segment the salient regions. To demonstrate the robustness of the proposed model under challenging scenes in real world, the authors build a novel RGB-T SOD dataset VT723 based on a large public semantic segmentation RGB-T dataset used in the autonomous driving domain. Extensive experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches, including CNN-based and Transformer-based methods. The code and dataset can be found at https://github.com/jxr326/SwinMCNet.
{"title":"Mirror complementary transformer network for RGB-thermal salient object detection","authors":"Xiurong Jiang, Yifan Hou, Hui Tian, Lin Zhu","doi":"10.1049/cvi2.12221","DOIUrl":"10.1049/cvi2.12221","url":null,"abstract":"<p>Conventional RGB-T salient object detection treats RGB and thermal modalities equally to locate the common salient regions. However, the authors observed that the rich colour and texture information of the RGB modality makes the objects more prominent compared to the background; and the thermal modality records the temperature difference of the scene, so the objects usually contain clear and continuous edge information. In this work, a novel mirror-complementary Transformer network (MCNet) is proposed for RGB-T SOD, which supervise the two modalities separately with a complementary set of saliency labels under a symmetrical structure. Moreover, the attention-based feature interaction and serial multiscale dilated convolution (SDC)-based feature fusion modules are introduced to make the two modalities complement and adjust each other flexibly. When one modality fails, the proposed model can still accurately segment the salient regions. To demonstrate the robustness of the proposed model under challenging scenes in real world, the authors build a novel RGB-T SOD dataset VT723 based on a large public semantic segmentation RGB-T dataset used in the autonomous driving domain. Extensive experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches, including CNN-based and Transformer-based methods. The code and dataset can be found at https://github.com/jxr326/SwinMCNet.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"15-32"},"PeriodicalIF":1.7,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12221","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135155949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}