Pub Date : 2024-06-07DOI: 10.1016/j.image.2024.117155
Shoufeng Tang , Shuo Zhou , Xiamin Tong , Jingyuan Gao , Yuhao Wang , Xiaojun Ma
The object detectors based on Transformers are advancing rapidly. On the contrary, the development of line segment detectors is relatively slow. It is noteworthy that the object and line segments are both 2D targets. In this work, we design a line segment detection algorithm based on deformable attention. Leveraging this algorithm and the line segment loss function, we transform the object detectors, Deformable DETR and ViDT, into end-to-end line segment detectors named Deformable LETR and ViDTLE, respectively. In order to adapt the idea of sparse modeling for line segment detection, we propose a new attention mechanism named line segment deformable attention (LSDA). This mechanism focuses on the valuable positions under the guidance of reference line to refine line segments. We design an auxiliary algorithm named line segment iterative refinement for LSDA. With as few modifications as possible, we transform two object detection detectors, namely SMCA DETR and PnP DETR into competitive line segment detectors named SMCA LETR and PnP LETR, respectively. The experimental results show that the performances of the proposed methods are efficient.
{"title":"Line segment detectors with deformable attention","authors":"Shoufeng Tang , Shuo Zhou , Xiamin Tong , Jingyuan Gao , Yuhao Wang , Xiaojun Ma","doi":"10.1016/j.image.2024.117155","DOIUrl":"10.1016/j.image.2024.117155","url":null,"abstract":"<div><p>The object detectors based on Transformers are advancing rapidly. On the contrary, the development of line segment detectors is relatively slow. It is noteworthy that the object and line segments are both 2D targets. In this work, we design a line segment detection algorithm based on deformable attention. Leveraging this algorithm and the line segment loss function, we transform the object detectors, Deformable DETR and ViDT, into end-to-end line segment detectors named Deformable LETR and ViDTLE, respectively. In order to adapt the idea of sparse modeling for line segment detection, we propose a new attention mechanism named line segment deformable attention (LSDA). This mechanism focuses on the valuable positions under the guidance of reference line to refine line segments. We design an auxiliary algorithm named line segment iterative refinement for LSDA. With as few modifications as possible, we transform two object detection detectors, namely SMCA DETR and PnP DETR into competitive line segment detectors named SMCA LETR and PnP LETR, respectively. The experimental results show that the performances of the proposed methods are efficient.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"128 ","pages":"Article 117155"},"PeriodicalIF":3.4,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141403728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-07DOI: 10.1016/j.image.2024.117156
Joao Prazeres , Manuela Pereira , Antonio M.G. Pinheiro
A study on the quality evaluation of point clouds in the presence of coding distortions is presented. For that, four different point cloud coding solutions, notably the standardized MPEG codecs G-PCC and V-PCC, a deep learning-based coding solution RS-DLPCC, and Draco, are compared using a subjective evaluation methodology. Furthermore, several full-reference, reduced-reference and no-reference point cloud quality metrics are evaluated. Two different point cloud normal computation methods were tested for the metrics that rely on them, notably the Cloud Compare quadric fitting method with radius of five, ten, and twenty and Meshlab KNN with K six, ten, and eighteen. To generalize the results, the objective quality metrics were also benchmarked on a public database, with mean opinion scores available. To evaluate the statistical differences between the metrics, the Krasula method was employed. The Point Cloud Quality Metric reveals the best performance and a very good representation of the subjective results, as well as being the metric with the most statistically significant results. It was also revealed that the Cloud Compare quadric fitting method with radius 10 and 20 produced the most reliable normals for the metrics dependent on them. Finally, the study revealed that the most commonly used metrics fail to accurately predict the compression quality when artifacts generated by deep learning methods are present.
{"title":"Quality evaluation of point cloud compression techniques","authors":"Joao Prazeres , Manuela Pereira , Antonio M.G. Pinheiro","doi":"10.1016/j.image.2024.117156","DOIUrl":"https://doi.org/10.1016/j.image.2024.117156","url":null,"abstract":"<div><p>A study on the quality evaluation of point clouds in the presence of coding distortions is presented. For that, four different point cloud coding solutions, notably the standardized MPEG codecs G-PCC and V-PCC, a deep learning-based coding solution RS-DLPCC, and Draco, are compared using a subjective evaluation methodology. Furthermore, several full-reference, reduced-reference and no-reference point cloud quality metrics are evaluated. Two different point cloud normal computation methods were tested for the metrics that rely on them, notably the Cloud Compare quadric fitting method with radius of five, ten, and twenty and Meshlab KNN with K six, ten, and eighteen. To generalize the results, the objective quality metrics were also benchmarked on a public database, with mean opinion scores available. To evaluate the statistical differences between the metrics, the Krasula method was employed. The Point Cloud Quality Metric reveals the best performance and a very good representation of the subjective results, as well as being the metric with the most statistically significant results. It was also revealed that the Cloud Compare quadric fitting method with radius 10 and 20 produced the most reliable normals for the metrics dependent on them. Finally, the study revealed that the most commonly used metrics fail to accurately predict the compression quality when artifacts generated by deep learning methods are present.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"128 ","pages":"Article 117156"},"PeriodicalIF":3.5,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141328978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-24DOI: 10.1016/j.image.2024.117152
Chenjian Pan , Chen Ling , Hongjin He , Liqun Qi , Yanwei Xu
Recovering color images and videos from highly undersampled data is a fundamental and challenging task in face recognition and computer vision. By the multi-dimensional nature of color images and videos, in this paper, we propose a novel tensor completion approach, which is able to efficiently explore the sparsity of tensor data under the discrete cosine transform (DCT). Specifically, we introduce two “sparse + low-rank” tensor completion models as well as two implementable algorithms for finding their solutions. The first one is a DCT-based sparse plus weighted nuclear norm induced low-rank minimization model. The second one is a DCT-based sparse plus -shrinking mapping induced low-rank optimization model. Moreover, we accordingly propose two implementable augmented Lagrangian-based algorithms for solving the underlying optimization models. A series of numerical experiments including color image inpainting and video data recovery demonstrate that our proposed approach performs better than many existing state-of-the-art tensor completion methods, especially for the case when the ratio of missing data is high.
{"title":"“Sparse + Low-Rank” tensor completion approach for recovering images and videos","authors":"Chenjian Pan , Chen Ling , Hongjin He , Liqun Qi , Yanwei Xu","doi":"10.1016/j.image.2024.117152","DOIUrl":"10.1016/j.image.2024.117152","url":null,"abstract":"<div><p>Recovering color images and videos from highly undersampled data is a fundamental and challenging task in face recognition and computer vision. By the multi-dimensional nature of color images and videos, in this paper, we propose a novel tensor completion approach, which is able to efficiently explore the sparsity of tensor data under the discrete cosine transform (DCT). Specifically, we introduce two “sparse + low-rank” tensor completion models as well as two implementable algorithms for finding their solutions. The first one is a DCT-based sparse plus weighted nuclear norm induced low-rank minimization model. The second one is a DCT-based sparse plus <span><math><mi>p</mi></math></span>-shrinking mapping induced low-rank optimization model. Moreover, we accordingly propose two implementable augmented Lagrangian-based algorithms for solving the underlying optimization models. A series of numerical experiments including color image inpainting and video data recovery demonstrate that our proposed approach performs better than many existing state-of-the-art tensor completion methods, especially for the case when the ratio of missing data is high.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117152"},"PeriodicalIF":3.5,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-24DOI: 10.1016/j.image.2024.117153
Yueming Su , Qiusheng Lian , Dan Zhang , Baoshun Shi
Compressed sensing (CS) with the binary sampling matrix is hardware-friendly and memory-saving in the signal processing field. Existing Convolutional Neural Network (CNN)-based CS methods show potential restrictions in exploiting non-local similarity and lack interpretability. In parallel, the emerging Transformer architecture performs well in modelling long-range correlations. To further improve the CS reconstruction quality from highly under-sampled CS measurements, a Transformer based deep unrolling reconstruction network abbreviated as DR-TransNet is proposed, whose design is inspired by the traditional iterative Douglas-Rachford algorithm. It combines the merits of structure insights of optimization-based methods and the speed of the network-based ones. Therein, a U-type Transformer based proximal sub-network is elaborated to reconstruct images in the wavelet domain and the spatial domain as an auxiliary mode, which aims to explore local informative details and global long-term interaction of the images. Specially, a flexible single model is trained to address the CS reconstruction with different binary CS sampling ratios. Compared with the state-of-the-art CS reconstruction methods with the binary sampling matrix, the proposed method can achieve appealing improvements in terms of Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) and visual metrics. Codes are available at https://github.com/svyueming/DR-TransNet.
{"title":"Transformer based Douglas-Rachford unrolling network for compressed sensing","authors":"Yueming Su , Qiusheng Lian , Dan Zhang , Baoshun Shi","doi":"10.1016/j.image.2024.117153","DOIUrl":"10.1016/j.image.2024.117153","url":null,"abstract":"<div><p>Compressed sensing (CS) with the binary sampling matrix is hardware-friendly and memory-saving in the signal processing field. Existing Convolutional Neural Network (CNN)-based CS methods show potential restrictions in exploiting non-local similarity and lack interpretability. In parallel, the emerging Transformer architecture performs well in modelling long-range correlations. To further improve the CS reconstruction quality from highly under-sampled CS measurements, a Transformer based deep unrolling reconstruction network abbreviated as DR-TransNet is proposed, whose design is inspired by the traditional iterative Douglas-Rachford algorithm. It combines the merits of structure insights of optimization-based methods and the speed of the network-based ones. Therein, a U-type Transformer based proximal sub-network is elaborated to reconstruct images in the wavelet domain and the spatial domain as an auxiliary mode, which aims to explore local informative details and global long-term interaction of the images. Specially, a flexible single model is trained to address the CS reconstruction with different binary CS sampling ratios. Compared with the state-of-the-art CS reconstruction methods with the binary sampling matrix, the proposed method can achieve appealing improvements in terms of Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) and visual metrics. Codes are available at <span>https://github.com/svyueming/DR-TransNet</span><svg><path></path></svg>.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117153"},"PeriodicalIF":3.5,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141143653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-22DOI: 10.1016/j.image.2024.117154
Peitong Li , Jiaying Chen , Chengtao Cai
Light propagation through water is subject to varying degrees of energy loss, causing captured images to display characteristics of color distortion, reduced contrast, and indistinct details and textures. The data-driven approach offers significant advantages over traditional algorithms, such as improved accuracy and reduced computational costs. However, challenges such as optimizing network architecture, refining coding techniques, and expanding database resources must be addressed to ensure the generation of high-quality reconstructed images across diverse tasks. In this paper, an underwater image enhancement network based on feature fusion is proposed named RUTUIE, which integrates feature fusion techniques. It leverages the strengths of both Resnet and U-shape architecture, primarily structured around a streamlined up-and-down sampling mechanism. Specifically, the U-shaped structure serves as the backbone of ResNet, equipped with two feature transformers at both the encoding and decoding ends, which are linked by a single-stage up-and-down sampling structure. This architecture is designed to minimize the omission of minor features during feature scale transformations. Furthermore, the improved Transformer encoder leverages a feature-level attention mechanism and the advantages of CNNs, endowing the network with both local and global perceptual capabilities. Then, we propose and demonstrate that embedding an adaptive feature selection module at appropriate locations can retain more learned feature representations. Moreover, the application of a previously proposed color transfer method for synthesizing underwater images and augmenting network training. Extensive experiments demonstrate that our work effectively corrects color casts, reconstructs the rich texture information in natural scenes, and improves the contrast.
{"title":"Reinforced Res-Unet transformer for underwater image enhancement","authors":"Peitong Li , Jiaying Chen , Chengtao Cai","doi":"10.1016/j.image.2024.117154","DOIUrl":"10.1016/j.image.2024.117154","url":null,"abstract":"<div><p>Light propagation through water is subject to varying degrees of energy loss, causing captured images to display characteristics of color distortion, reduced contrast, and indistinct details and textures. The data-driven approach offers significant advantages over traditional algorithms, such as improved accuracy and reduced computational costs. However, challenges such as optimizing network architecture, refining coding techniques, and expanding database resources must be addressed to ensure the generation of high-quality reconstructed images across diverse tasks. In this paper, an underwater image enhancement network based on feature fusion is proposed named RUTUIE, which integrates feature fusion techniques. It leverages the strengths of both Resnet and U-shape architecture, primarily structured around a streamlined up-and-down sampling mechanism. Specifically, the U-shaped structure serves as the backbone of ResNet, equipped with two feature transformers at both the encoding and decoding ends, which are linked by a single-stage up-and-down sampling structure. This architecture is designed to minimize the omission of minor features during feature scale transformations. Furthermore, the improved Transformer encoder leverages a feature-level attention mechanism and the advantages of CNNs, endowing the network with both local and global perceptual capabilities. Then, we propose and demonstrate that embedding an adaptive feature selection module at appropriate locations can retain more learned feature representations. Moreover, the application of a previously proposed color transfer method for synthesizing underwater images and augmenting network training. Extensive experiments demonstrate that our work effectively corrects color casts, reconstructs the rich texture information in natural scenes, and improves the contrast.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117154"},"PeriodicalIF":3.5,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141140727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-18DOI: 10.1016/j.image.2024.117149
He Liu, Yi Sun
The deep learning-based 3D object detection literature on monocular images has been dominated by methods that require supervision in the form of 3D bounding box annotations for training. However, obtaining sufficient 3D annotations is expensive, laborious and prone to introducing errors. To address this problem, we propose a monocular self-supervised approach towards 3D object detection relying solely on observed RGB data rather than 3D bounding boxes for training. We leverage differentiable rendering to apply visual alignment to depth maps, instance masks and point clouds for self-supervision. Furthermore, considering the complexity of autonomous driving scenes, we introduce a point cloud filter to reduce noise impact and design an automatic training set pruning strategy suitable for the self-supervised framework to further improve network performance. We provide detailed experiments on the KITTI benchmark and achieve competitive performance with existing self-supervised methods as well as some fully supervised methods.
{"title":"Self-supervised 3D vehicle detection based on monocular images","authors":"He Liu, Yi Sun","doi":"10.1016/j.image.2024.117149","DOIUrl":"10.1016/j.image.2024.117149","url":null,"abstract":"<div><p>The deep learning-based 3D object detection literature on monocular images has been dominated by methods that require supervision in the form of 3D bounding box annotations for training. However, obtaining sufficient 3D annotations is expensive, laborious and prone to introducing errors. To address this problem, we propose a monocular self-supervised approach towards 3D object detection relying solely on observed RGB data rather than 3D bounding boxes for training. We leverage differentiable rendering to apply visual alignment to depth maps, instance masks and point clouds for self-supervision. Furthermore, considering the complexity of autonomous driving scenes, we introduce a point cloud filter to reduce noise impact and design an automatic training set pruning strategy suitable for the self-supervised framework to further improve network performance. We provide detailed experiments on the KITTI benchmark and achieve competitive performance with existing self-supervised methods as well as some fully supervised methods.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117149"},"PeriodicalIF":3.5,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141130565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-18DOI: 10.1016/j.image.2024.117148
Fengsui Wang , Xi Chu
Recent advances in efficient image super-resolution (EISR) include convolutional neural networks, which exploit distillation and aggregation strategies with copious channel split and concatenation operations to fully exploit limited hierarchical features. In contrast, the Transformer network presents a challenge for EISR because multiheaded self-attention is a computationally demanding process. To respond to this challenge, this paper proposes replacing multiheaded self-attention in the Transformer network with global filtering and recursive gated convolution. This strategy allows us to design a high-order spatial interaction and residual global filter network for efficient image super-resolution (HorSR), which comprises three components: a shallow feature extraction module, a deep feature extraction module, and a high-quality image-reconstruction module. In particular, the deep feature extraction module comprises residual global filtering and recursive gated convolution blocks. The experimental results show that the HorSR network provides state-of-the-art performance with the lowest FLOPs of existing EISR methods.
{"title":"HorSR: High-order spatial interactions and residual global filter for efficient image super-resolution","authors":"Fengsui Wang , Xi Chu","doi":"10.1016/j.image.2024.117148","DOIUrl":"10.1016/j.image.2024.117148","url":null,"abstract":"<div><p>Recent advances in efficient image super-resolution (EISR) include convolutional neural networks, which exploit distillation and aggregation strategies with copious channel split and concatenation operations to fully exploit limited hierarchical features. In contrast, the Transformer network presents a challenge for EISR because multiheaded self-attention is a computationally demanding process. To respond to this challenge, this paper proposes replacing multiheaded self-attention in the Transformer network with global filtering and recursive gated convolution. This strategy allows us to design a high-order spatial interaction and residual global filter network for efficient image super-resolution (HorSR), which comprises three components: a shallow feature extraction module, a deep feature extraction module, and a high-quality image-reconstruction module. In particular, the deep feature extraction module comprises residual global filtering and recursive gated convolution blocks. The experimental results show that the HorSR network provides state-of-the-art performance with the lowest FLOPs of existing EISR methods.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117148"},"PeriodicalIF":3.5,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141144376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-14DOI: 10.1016/j.image.2024.117151
Xue Xia , Ying Li , Guobei Xiao , Kun Zhan , Jinhua Yan , Chao Cai , Yuming Fang , Guofu Huang
Retinal fundus imaging contributes to monitoring the vision of patients by providing views of the interior surface of the eyes. Machine learning models greatly aided ophthalmologists in detecting retinal disorders from color fundus images. Hence, the quality of the data is pivotal for enhancing diagnosis algorithms, which ultimately benefits vision care and maintenance. To facilitate further research in this domain, we introduce the Eye Disease Diagnosis and Fundus Synthesis (EDDFS) dataset, comprising 28,877 fundus images. These include 15,000 healthy samples and a diverse range of images depicting various disorders such as diabetic retinopathy, age-related macular degeneration, glaucoma, pathological myopia, hypertension retinopathy, retinal vein occlusion, and Laser photocoagulation. In addition to providing the dataset, we propose a Transformer-joint convolution network for automated eye disease screening. Firstly, a co-attention structure is integrated to capture long-range attention information along with local features. Secondly, a cross-stage feature fusion module is designed to extract multi-level and disease-related information. By leveraging the dataset and our proposed network, we establish benchmarks for disease screening and grading tasks. Our experimental results underscore the network’s proficiency in both multi-label and single-label disease diagnosis, while also showcasing the dataset’s capability in supporting fundus synthesis. (The dataset and code will be available onhttps://github.com/xia-xx-cv/EDDFS_dataset.)
{"title":"Benchmarking deep models on retinal fundus disease diagnosis and a large-scale dataset","authors":"Xue Xia , Ying Li , Guobei Xiao , Kun Zhan , Jinhua Yan , Chao Cai , Yuming Fang , Guofu Huang","doi":"10.1016/j.image.2024.117151","DOIUrl":"10.1016/j.image.2024.117151","url":null,"abstract":"<div><p>Retinal fundus imaging contributes to monitoring the vision of patients by providing views of the interior surface of the eyes. Machine learning models greatly aided ophthalmologists in detecting retinal disorders from color fundus images. Hence, the quality of the data is pivotal for enhancing diagnosis algorithms, which ultimately benefits vision care and maintenance. To facilitate further research in this domain, we introduce the Eye Disease Diagnosis and Fundus Synthesis (EDDFS) dataset, comprising 28,877 fundus images. These include 15,000 healthy samples and a diverse range of images depicting various disorders such as diabetic retinopathy, age-related macular degeneration, glaucoma, pathological myopia, hypertension retinopathy, retinal vein occlusion, and Laser photocoagulation. In addition to providing the dataset, we propose a Transformer-joint convolution network for automated eye disease screening. Firstly, a co-attention structure is integrated to capture long-range attention information along with local features. Secondly, a cross-stage feature fusion module is designed to extract multi-level and disease-related information. By leveraging the dataset and our proposed network, we establish benchmarks for disease screening and grading tasks. Our experimental results underscore the network’s proficiency in both multi-label and single-label disease diagnosis, while also showcasing the dataset’s capability in supporting fundus synthesis. (<em>The dataset and code will be available on</em> <span>https://github.com/xia-xx-cv/EDDFS_dataset</span><svg><path></path></svg>.)</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117151"},"PeriodicalIF":3.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0923596524000523/pdfft?md5=243413eb9a59b33a94d00c614393aaf1&pid=1-s2.0-S0923596524000523-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141038978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-10DOI: 10.1016/j.image.2024.117139
Jun Xiao , Qian Ye , Rui Zhao , Kin-Man Lam , Kao Wan
Single image super-resolution is a classic problem in computer vision. In recent years, deep learning-based models have achieved unprecedented success with this problem. However, most existing deep super-resolution models unavoidably produce degraded results when applied to real-world images captured by cameras with different focal lengths. The degradation in these images is called multiple-focal-length degradation, which is spatially variant and more complicated than the bicubic downsampling degradation. To address such a challenging issue, we propose a multi-scale feature mixture model in this paper. The proposed model can intensively exploit local patterns from different scales for image super-resolution. To improve the performance, we further propose a novel loss function based on the Laplacian pyramid, which guides the model to recover the information separately of different frequency subbands. Comprehensive experiments show that our proposed model has a better ability to preserve the structure of objects and generate high-quality images, leading to the best performance compared with other state-of-the-art deep single image super-resolution methods.
{"title":"Deep multi-scale feature mixture model for image super-resolution with multiple-focal-length degradation","authors":"Jun Xiao , Qian Ye , Rui Zhao , Kin-Man Lam , Kao Wan","doi":"10.1016/j.image.2024.117139","DOIUrl":"10.1016/j.image.2024.117139","url":null,"abstract":"<div><p>Single image super-resolution is a classic problem in computer vision. In recent years, deep learning-based models have achieved unprecedented success with this problem. However, most existing deep super-resolution models unavoidably produce degraded results when applied to real-world images captured by cameras with different focal lengths. The degradation in these images is called multiple-focal-length degradation, which is spatially variant and more complicated than the bicubic downsampling degradation. To address such a challenging issue, we propose a multi-scale feature mixture model in this paper. The proposed model can intensively exploit local patterns from different scales for image super-resolution. To improve the performance, we further propose a novel loss function based on the Laplacian pyramid, which guides the model to recover the information separately of different frequency subbands. Comprehensive experiments show that our proposed model has a better ability to preserve the structure of objects and generate high-quality images, leading to the best performance compared with other state-of-the-art deep single image super-resolution methods.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117139"},"PeriodicalIF":3.5,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141028694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-10DOI: 10.1016/j.image.2024.117150
Jin Peng, Yongxiong Wang, Zhiqun Pan
Weakly supervised instance segmentation based on image-level class labels has recently gained much attention, in which the primary key step is to generate the pseudo labels based on class activation maps (CAMs). Most methods adopt binary cross-entropy (BCE) loss to train the classification model. However, since BCE loss is not class mutually exclusive, activations among classes occur independently. Thus, not only do foreground classes are wrongly activated as background, but also incorrect activations among confusing classes are occurred in the foreground. To solve this problem, we propose the Class Double-Activation Map, called Double-CAM. Firstly, the vanilla CAM is extracted from the multi-label classifier and then fused with the output feature map of backbone. The enhanced feature map of each class is fed into the single-label classification branch with softmax cross-entropy (SCE) loss and entropy minimization module, from which the more accurate Double-CAM is extracted. It refines the vanilla CAM to improve the quality of pseudo labels. Secondly, to mine object edge cues from Double-CAM, we propose the Boundary Localization (BL) module to synthesize boundary annotations, so as to provide constraints for label propagation more explicitly without adding additional supervision. The quality of pseudo masks is also improved substantially with the addition of BL module. Finally, the generated pseudo labels are used to train fully supervised instance segmentation networks. The evaluations on VOC and COCO datasets show that our method achieves excellent performance, outperforming mainstream weakly supervised segmentation methods at the same supervisory level, even those that depend on stronger supervision.
{"title":"Weakly supervised instance segmentation via class double-activation maps and boundary localization","authors":"Jin Peng, Yongxiong Wang, Zhiqun Pan","doi":"10.1016/j.image.2024.117150","DOIUrl":"10.1016/j.image.2024.117150","url":null,"abstract":"<div><p>Weakly supervised instance segmentation based on image-level class labels has recently gained much attention, in which the primary key step is to generate the pseudo labels based on class activation maps (CAMs). Most methods adopt binary cross-entropy (BCE) loss to train the classification model. However, since BCE loss is not class mutually exclusive, activations among classes occur independently. Thus, not only do foreground classes are wrongly activated as background, but also incorrect activations among confusing classes are occurred in the foreground. To solve this problem, we propose the Class Double-Activation Map, called Double-CAM. Firstly, the vanilla CAM is extracted from the multi-label classifier and then fused with the output feature map of backbone. The enhanced feature map of each class is fed into the single-label classification branch with softmax cross-entropy (SCE) loss and entropy minimization module, from which the more accurate Double-CAM is extracted. It refines the vanilla CAM to improve the quality of pseudo labels. Secondly, to mine object edge cues from Double-CAM, we propose the Boundary Localization (BL) module to synthesize boundary annotations, so as to provide constraints for label propagation more explicitly without adding additional supervision. The quality of pseudo masks is also improved substantially with the addition of BL module. Finally, the generated pseudo labels are used to train fully supervised instance segmentation networks. The evaluations on VOC and COCO datasets show that our method achieves excellent performance, outperforming mainstream weakly supervised segmentation methods at the same supervisory level, even those that depend on stronger supervision.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117150"},"PeriodicalIF":3.5,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141030265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}