Pub Date : 2025-04-01Epub Date: 2025-01-04DOI: 10.1016/j.neunet.2024.107112
Yupeng Wang, Yongli Wang, Zaki Ahmad Khan, Anqi Huang, Jianghui Sang
Smoke is a critical indicator of forest fires, often detectable before flames ignite. Accurate smoke identification in remote sensing images is vital for effective forest fire monitoring within Internet of Things (IoT) systems. However, existing detection methods frequently falter in complex real-world scenarios, where variable smoke shapes and sizes, intricate backgrounds, and smoke-like phenomena (e.g., clouds and haze) lead to missed detections and false alarms. To address these challenges, we propose the Multi-level Feature Fusion Network (MFFNet), a novel framework grounded in contrastive learning. MFFNet begins by extracting multi-scale features from remote sensing images using a pre-trained ConvNeXt model, capturing information across different levels of granularity to accommodate variations in smoke appearance. The Attention Feature Enhancement Module further refines these multi-scale features, enhancing fine-grained, discriminative attributes relevant to smoke detection. Subsequently, the Bilinear Feature Fusion Module combines these enriched features, effectively reducing background interference and improving the model's ability to distinguish smoke from visually similar phenomena. Finally, contrastive feature learning is employed to improve robustness against intra-class variations by focusing on unique regions within the smoke patterns. Evaluated on the benchmark dataset USTC_SmokeRS, MFFNet achieves an accuracy of 98.87%. Additionally, our model demonstrates a detection rate of 94.54% on the extended E_SmokeRS dataset, with a low false alarm rate of 3.30%. These results highlight the effectiveness of MFFNet in recognizing smoke in remote sensing images, surpassing existing methodologies. The code is accessible at https://github.com/WangYuPeng1/MFFNet.
{"title":"Multi-level feature fusion networks for smoke recognition in remote sensing imagery.","authors":"Yupeng Wang, Yongli Wang, Zaki Ahmad Khan, Anqi Huang, Jianghui Sang","doi":"10.1016/j.neunet.2024.107112","DOIUrl":"10.1016/j.neunet.2024.107112","url":null,"abstract":"<p><p>Smoke is a critical indicator of forest fires, often detectable before flames ignite. Accurate smoke identification in remote sensing images is vital for effective forest fire monitoring within Internet of Things (IoT) systems. However, existing detection methods frequently falter in complex real-world scenarios, where variable smoke shapes and sizes, intricate backgrounds, and smoke-like phenomena (e.g., clouds and haze) lead to missed detections and false alarms. To address these challenges, we propose the Multi-level Feature Fusion Network (MFFNet), a novel framework grounded in contrastive learning. MFFNet begins by extracting multi-scale features from remote sensing images using a pre-trained ConvNeXt model, capturing information across different levels of granularity to accommodate variations in smoke appearance. The Attention Feature Enhancement Module further refines these multi-scale features, enhancing fine-grained, discriminative attributes relevant to smoke detection. Subsequently, the Bilinear Feature Fusion Module combines these enriched features, effectively reducing background interference and improving the model's ability to distinguish smoke from visually similar phenomena. Finally, contrastive feature learning is employed to improve robustness against intra-class variations by focusing on unique regions within the smoke patterns. Evaluated on the benchmark dataset USTC_SmokeRS, MFFNet achieves an accuracy of 98.87%. Additionally, our model demonstrates a detection rate of 94.54% on the extended E_SmokeRS dataset, with a low false alarm rate of 3.30%. These results highlight the effectiveness of MFFNet in recognizing smoke in remote sensing images, surpassing existing methodologies. The code is accessible at https://github.com/WangYuPeng1/MFFNet.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"184 ","pages":"107112"},"PeriodicalIF":6.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142967303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-01Epub Date: 2025-01-06DOI: 10.1016/j.neunet.2024.107096
Xinlei Yu, Ahmed Elazab, Ruiquan Ge, Jichao Zhu, Lingyan Zhang, Gangyong Jia, Qing Wu, Xiang Wan, Lihua Li, Changmiao Wang
Accurately predicting intracerebral hemorrhage (ICH) prognosis is a critical and indispensable step in the clinical management of patients post-ICH. Recently, integrating artificial intelligence, particularly deep learning, has significantly enhanced prediction accuracy and alleviated neurosurgeons from the burden of manual prognosis assessment. However, uni-modal methods have shown suboptimal performance due to the intricate pathophysiology of the ICH. On the other hand, existing cross-modal approaches that incorporate tabular data have often failed to effectively extract complementary information and cross-modal features between modalities, thereby limiting their prognostic capabilities. This study introduces a novel cross-modal network, ICH-PRNet, designed to predict ICH prognosis outcomes. Specifically, we propose a joint-attention interaction encoder that effectively integrates computed tomography images and clinical texts within a unified representational space. Additionally, we define a multi-loss function comprising three components to comprehensively optimize cross-modal fusion capabilities. To balance the training process, we employ a self-adaptive dynamic prioritization algorithm that adjusts the weights of each component, accordingly. Our model, through these innovative designs, establishes robust semantic connections between modalities and uncovers rich, complementary cross-modal information, thereby achieving superior prediction results. Extensive experimental results and comparisons with state-of-the-art methods on both in-house and publicly available datasets unequivocally demonstrate the superiority and efficacy of the proposed method. Our code is at https://github.com/YU-deep/ICH-PRNet.git.
{"title":"ICH-PRNet: a cross-modal intracerebral haemorrhage prognostic prediction method using joint-attention interaction mechanism.","authors":"Xinlei Yu, Ahmed Elazab, Ruiquan Ge, Jichao Zhu, Lingyan Zhang, Gangyong Jia, Qing Wu, Xiang Wan, Lihua Li, Changmiao Wang","doi":"10.1016/j.neunet.2024.107096","DOIUrl":"10.1016/j.neunet.2024.107096","url":null,"abstract":"<p><p>Accurately predicting intracerebral hemorrhage (ICH) prognosis is a critical and indispensable step in the clinical management of patients post-ICH. Recently, integrating artificial intelligence, particularly deep learning, has significantly enhanced prediction accuracy and alleviated neurosurgeons from the burden of manual prognosis assessment. However, uni-modal methods have shown suboptimal performance due to the intricate pathophysiology of the ICH. On the other hand, existing cross-modal approaches that incorporate tabular data have often failed to effectively extract complementary information and cross-modal features between modalities, thereby limiting their prognostic capabilities. This study introduces a novel cross-modal network, ICH-PRNet, designed to predict ICH prognosis outcomes. Specifically, we propose a joint-attention interaction encoder that effectively integrates computed tomography images and clinical texts within a unified representational space. Additionally, we define a multi-loss function comprising three components to comprehensively optimize cross-modal fusion capabilities. To balance the training process, we employ a self-adaptive dynamic prioritization algorithm that adjusts the weights of each component, accordingly. Our model, through these innovative designs, establishes robust semantic connections between modalities and uncovers rich, complementary cross-modal information, thereby achieving superior prediction results. Extensive experimental results and comparisons with state-of-the-art methods on both in-house and publicly available datasets unequivocally demonstrate the superiority and efficacy of the proposed method. Our code is at https://github.com/YU-deep/ICH-PRNet.git.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"184 ","pages":"107096"},"PeriodicalIF":6.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142972996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-01Epub Date: 2024-12-31DOI: 10.1016/j.neunet.2024.107098
Zhongyuan Lu, Jin Liu, Miaozhong Xu
Modifying the structure of an existing network is a common method to further improve the performance of the network. However, modifying some layers in network often results in pre-trained weight mismatch, and fine-tune process is time-consuming and resource-inefficient. To address this issue, we propose a novel technique called Identity Model Transformation (IMT), which keep the output before and after transformation in an equal form by rigorous algebraic transformations. This approach ensures the preservation of the original model's performance when modifying layers. Additionally, IMT significantly reduces the total training time required to achieve optimal results while further enhancing network performance. IMT has established a bridge for rapid transformation between model architectures, enabling a model to quickly perform analytic continuation and derive a family of tree-like models with better performance. This model family possesses a greater potential for optimization improvements compared to a single model. Extensive experiments across various object detection tasks validated the effectiveness and efficiency of our proposed IMT solution, which saved 94.76% time in fine-tuning the basic model YOLOv4-Rot on DOTA 1.5 dataset, and by using the IMT method, we saw stable performance improvements of 9.89%, 6.94%, 2.36%, and 4.86% on the four datasets: AI-TOD, DOTA1.5, coco2017, and MRSAText, respectively.
{"title":"Identity Model Transformation for boosting performance and efficiency in object detection network.","authors":"Zhongyuan Lu, Jin Liu, Miaozhong Xu","doi":"10.1016/j.neunet.2024.107098","DOIUrl":"10.1016/j.neunet.2024.107098","url":null,"abstract":"<p><p>Modifying the structure of an existing network is a common method to further improve the performance of the network. However, modifying some layers in network often results in pre-trained weight mismatch, and fine-tune process is time-consuming and resource-inefficient. To address this issue, we propose a novel technique called Identity Model Transformation (IMT), which keep the output before and after transformation in an equal form by rigorous algebraic transformations. This approach ensures the preservation of the original model's performance when modifying layers. Additionally, IMT significantly reduces the total training time required to achieve optimal results while further enhancing network performance. IMT has established a bridge for rapid transformation between model architectures, enabling a model to quickly perform analytic continuation and derive a family of tree-like models with better performance. This model family possesses a greater potential for optimization improvements compared to a single model. Extensive experiments across various object detection tasks validated the effectiveness and efficiency of our proposed IMT solution, which saved 94.76% time in fine-tuning the basic model YOLOv4-Rot on DOTA 1.5 dataset, and by using the IMT method, we saw stable performance improvements of 9.89%, 6.94%, 2.36%, and 4.86% on the four datasets: AI-TOD, DOTA1.5, coco2017, and MRSAText, respectively.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"184 ","pages":"107098"},"PeriodicalIF":6.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142957832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-01Epub Date: 2025-01-03DOI: 10.1016/j.neunet.2024.107113
Varun Kumar, Somdatta Goswami, Katiana Kontolati, Michael D Shields, George Em Karniadakis
Multi-task learning (MTL) is an inductive transfer mechanism designed to leverage useful information from multiple tasks to improve generalization performance compared to single-task learning. It has been extensively explored in traditional machine learning to address issues such as data sparsity and overfitting in neural networks. In this work, we apply MTL to problems in science and engineering governed by partial differential equations (PDEs). However, implementing MTL in this context is complex, as it requires task-specific modifications to accommodate various scenarios representing different physical processes. To this end, we present a multi-task deep operator network (MT-DeepONet) to learn solutions across various functional forms of source terms in a PDE and multiple geometries in a single concurrent training session. We introduce modifications in the branch network of the vanilla DeepONet to account for various functional forms of a parameterized coefficient in a PDE. Additionally, we handle parameterized geometries by introducing a binary mask in the branch network and incorporating it into the loss term to improve convergence and generalization to new geometry tasks. Our approach is demonstrated on three benchmark problems: (1) learning different functional forms of the source term in the Fisher equation; (2) learning multiple geometries in a 2D Darcy Flow problem and showcasing better transfer learning capabilities to new geometries; and (3) learning 3D parameterized geometries for a heat transfer problem and demonstrate the ability to predict on new but similar geometries. Our MT-DeepONet framework offers a novel approach to solving PDE problems in engineering and science under a unified umbrella based on synergistic learning that reduces the overall training cost for neural operators.
{"title":"Synergistic learning with multi-task DeepONet for efficient PDE problem solving.","authors":"Varun Kumar, Somdatta Goswami, Katiana Kontolati, Michael D Shields, George Em Karniadakis","doi":"10.1016/j.neunet.2024.107113","DOIUrl":"10.1016/j.neunet.2024.107113","url":null,"abstract":"<p><p>Multi-task learning (MTL) is an inductive transfer mechanism designed to leverage useful information from multiple tasks to improve generalization performance compared to single-task learning. It has been extensively explored in traditional machine learning to address issues such as data sparsity and overfitting in neural networks. In this work, we apply MTL to problems in science and engineering governed by partial differential equations (PDEs). However, implementing MTL in this context is complex, as it requires task-specific modifications to accommodate various scenarios representing different physical processes. To this end, we present a multi-task deep operator network (MT-DeepONet) to learn solutions across various functional forms of source terms in a PDE and multiple geometries in a single concurrent training session. We introduce modifications in the branch network of the vanilla DeepONet to account for various functional forms of a parameterized coefficient in a PDE. Additionally, we handle parameterized geometries by introducing a binary mask in the branch network and incorporating it into the loss term to improve convergence and generalization to new geometry tasks. Our approach is demonstrated on three benchmark problems: (1) learning different functional forms of the source term in the Fisher equation; (2) learning multiple geometries in a 2D Darcy Flow problem and showcasing better transfer learning capabilities to new geometries; and (3) learning 3D parameterized geometries for a heat transfer problem and demonstrate the ability to predict on new but similar geometries. Our MT-DeepONet framework offers a novel approach to solving PDE problems in engineering and science under a unified umbrella based on synergistic learning that reduces the overall training cost for neural operators.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"184 ","pages":"107113"},"PeriodicalIF":6.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142967318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-01Epub Date: 2024-12-31DOI: 10.1016/j.neunet.2024.107071
Azadeh Faroughi, Parham Moradi, Mahdi Jalili
Recommendation systems are vital tools for helping users discover content that suits their interests. Collaborative filtering methods are one of the techniques employed for analyzing interactions between users and items, which are typically stored in a sparse matrix. This inherent sparsity poses a challenge because it necessitates accurately and effectively filling in these gaps to provide users with meaningful and personalized recommendations. Our solution addresses sparsity in recommendations by incorporating diverse data sources, including trust statements and an imputation graph. The trust graph captures user relationships and trust levels, working in conjunction with an imputation graph, which is constructed by estimating the missing rates of each user based on the user-item matrix using the average rates of the most similar users. Combined with the user-item rating graph, an attention mechanism fine tunes the influence of these graphs, resulting in more personalized and effective recommendations. Our method consistently outperforms state-of-the-art recommenders in real-world dataset evaluations, underscoring its potential to strengthen recommendation systems and mitigate sparsity challenges.
{"title":"Enhancing Recommender Systems through Imputation and Social-Aware Graph Convolutional Neural Network.","authors":"Azadeh Faroughi, Parham Moradi, Mahdi Jalili","doi":"10.1016/j.neunet.2024.107071","DOIUrl":"10.1016/j.neunet.2024.107071","url":null,"abstract":"<p><p>Recommendation systems are vital tools for helping users discover content that suits their interests. Collaborative filtering methods are one of the techniques employed for analyzing interactions between users and items, which are typically stored in a sparse matrix. This inherent sparsity poses a challenge because it necessitates accurately and effectively filling in these gaps to provide users with meaningful and personalized recommendations. Our solution addresses sparsity in recommendations by incorporating diverse data sources, including trust statements and an imputation graph. The trust graph captures user relationships and trust levels, working in conjunction with an imputation graph, which is constructed by estimating the missing rates of each user based on the user-item matrix using the average rates of the most similar users. Combined with the user-item rating graph, an attention mechanism fine tunes the influence of these graphs, resulting in more personalized and effective recommendations. Our method consistently outperforms state-of-the-art recommenders in real-world dataset evaluations, underscoring its potential to strengthen recommendation systems and mitigate sparsity challenges.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"184 ","pages":"107071"},"PeriodicalIF":6.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142967247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-03DOI: 10.1016/j.neunet.2025.107311
Yichu Xu , Di Wang , Lefei Zhang , Liangpei Zhang
Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) A fixed receptive field overlooks the effective contextual scales required by various HSI objects; (2) invalid self-attention features in context fusion affect model performance. To address these limitations, we propose a novel Dual Selective Fusion Transformer Network (DSFormer) for HSI classification. DSFormer achieves joint spatial and spectral contextual modeling by flexibly selecting and fusing features across different receptive fields, effectively reducing unnecessary information interference by focusing on the most relevant spatial–spectral tokens. Specifically, we design a Kernel Selective Fusion Transformer Block (KSFTB) to learn an optimal receptive field by adaptively fusing spatial and spectral features across different scales, enhancing the model’s ability to accurately identify diverse HSI objects. Additionally, we introduce a Token Selective Fusion Transformer Block (TSFTB), which strategically selects and combines essential tokens during the spatial–spectral self-attention fusion process to capture the most crucial contexts. Extensive experiments conducted on four benchmark HSI datasets demonstrate that the proposed DSFormer significantly improves land cover classification accuracy, outperforming existing state-of-the-art methods. Specifically, DSFormer achieves overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59% in the Pavia University, Houston, Indian Pines, and Whu-HongHu datasets, respectively, reflecting improvements of 3.19%, 1.14%, 0.91%, and 2.80% over the previous model. The code will be available online at https://github.com/YichuXu/DSFormer.
{"title":"Dual selective fusion transformer network for hyperspectral image classification","authors":"Yichu Xu , Di Wang , Lefei Zhang , Liangpei Zhang","doi":"10.1016/j.neunet.2025.107311","DOIUrl":"10.1016/j.neunet.2025.107311","url":null,"abstract":"<div><div>Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) A fixed receptive field overlooks the effective contextual scales required by various HSI objects; (2) invalid self-attention features in context fusion affect model performance. To address these limitations, we propose a novel Dual Selective Fusion Transformer Network (DSFormer) for HSI classification. DSFormer achieves joint spatial and spectral contextual modeling by flexibly selecting and fusing features across different receptive fields, effectively reducing unnecessary information interference by focusing on the most relevant spatial–spectral tokens. Specifically, we design a Kernel Selective Fusion Transformer Block (KSFTB) to learn an optimal receptive field by adaptively fusing spatial and spectral features across different scales, enhancing the model’s ability to accurately identify diverse HSI objects. Additionally, we introduce a Token Selective Fusion Transformer Block (TSFTB), which strategically selects and combines essential tokens during the spatial–spectral self-attention fusion process to capture the most crucial contexts. Extensive experiments conducted on four benchmark HSI datasets demonstrate that the proposed DSFormer significantly improves land cover classification accuracy, outperforming existing state-of-the-art methods. Specifically, DSFormer achieves overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59% in the Pavia University, Houston, Indian Pines, and Whu-HongHu datasets, respectively, reflecting improvements of 3.19%, 1.14%, 0.91%, and 2.80% over the previous model. The code will be available online at <span><span>https://github.com/YichuXu/DSFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"187 ","pages":"Article 107311"},"PeriodicalIF":6.0,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143552996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-03DOI: 10.1016/j.neunet.2025.107312
Yiyao Liu , Jinyao Li , Yi Yang , Cheng Zhao , Yongtao Zhang , Peng Yang , Lei Dong , Xiaofei Deng , Ting Zhu , Tianfu Wang , Wei Jiang , Baiying Lei
Given the rapid increase in breast cancer incidence, the Automated Breast Volume Scanner (ABVS) is developed to screen breast tumours efficiently and accurately. However, reviewing ABVS images is a challenging task owing to the significant variations in sizes and shapes of breast tumours. We propose a novel 3D segmentation network (i.e., DST-C) that combines a convolutional neural network (CNN) with a dilated sampling self-attention Transformer (DST). In our network, the global features extracted from the DST branch are guided by the detailed local information provided by the CNN branch, which adapts to the diversity of tumour size and morphology. For medical images, especially ABVS images, the scarcity of annotation leads to difficulty in model training. Therefore, a self-supervised learning method based on a dual-path approach for mask image modelling is introduced to generate valuable representations of images. In addition, a unique postprocessing method is proposed to reduce the false-positive rate and improve the sensitivity simultaneously. The experimental results demonstrate that our model has achieved promising 3D segmentation and detection performance using our in-house dataset. Our code is available at: https://github.com/magnetliu/dstc-net.
{"title":"ABVS breast tumour segmentation via integrating CNN with dilated sampling self-attention and feature interaction Transformer","authors":"Yiyao Liu , Jinyao Li , Yi Yang , Cheng Zhao , Yongtao Zhang , Peng Yang , Lei Dong , Xiaofei Deng , Ting Zhu , Tianfu Wang , Wei Jiang , Baiying Lei","doi":"10.1016/j.neunet.2025.107312","DOIUrl":"10.1016/j.neunet.2025.107312","url":null,"abstract":"<div><div>Given the rapid increase in breast cancer incidence, the Automated Breast Volume Scanner (ABVS) is developed to screen breast tumours efficiently and accurately. However, reviewing ABVS images is a challenging task owing to the significant variations in sizes and shapes of breast tumours. We propose a novel 3D segmentation network (i.e., DST-C) that combines a convolutional neural network (CNN) with a dilated sampling self-attention Transformer (DST). In our network, the global features extracted from the DST branch are guided by the detailed local information provided by the CNN branch, which adapts to the diversity of tumour size and morphology. For medical images, especially ABVS images, the scarcity of annotation leads to difficulty in model training. Therefore, a self-supervised learning method based on a dual-path approach for mask image modelling is introduced to generate valuable representations of images. In addition, a unique postprocessing method is proposed to reduce the false-positive rate and improve the sensitivity simultaneously. The experimental results demonstrate that our model has achieved promising 3D segmentation and detection performance using our in-house dataset. Our code is available at: <span><span>https://github.com/magnetliu/dstc-net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"187 ","pages":"Article 107312"},"PeriodicalIF":6.0,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143552991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-03DOI: 10.1016/j.neunet.2025.107309
Wei Zhou, Kang Lin, Zhijie Zheng, Dihu Chen, Tao Su, Haifeng Hu
The objective of multi-label image classification (MLIC) task is to simultaneously identify multiple objects present in an image. Several researchers directly flatten 2D feature maps into 1D grid feature sequences, and utilize Transformer encoder to capture the correlations of grid features to learn object relationships. Although obtaining promising results, these Transformer-based methods lose spatial information. In addition, current attention-based models often focus only on salient feature regions, but ignore other potential useful features that contribute to MLIC task. To tackle these problems, we present a novel Dual Relation Transformer Network (DRTN) for MLIC task, which can be trained in an end-to-end manner. Concretely, to compensate for the loss of spatial information of grid features resulting from the flattening operation, we adopt a grid aggregation scheme to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, a new dual relation enhancement (DRE) module is proposed to capture correlations between objects using two different visual features, thereby complementing the advantages provided by both grid and pseudo-region features. After that, we design a new feature enhancement and erasure (FEE) module to learn discriminative features and mine additional potential valuable features. By using attention mechanism to discover the most salient feature regions and removing them with region-level erasure strategy, our FEE module is able to mine other potential useful features from the remaining parts. Further, we devise a novel contrastive learning (CL) module to encourage the foregrounds of salient and potential features to be closer, while pushing their foregrounds further away from background features. This manner compels our model to learn discriminative and valuable features more comprehensively. Extensive experiments demonstrate that DRTN method surpasses current MLIC models on three challenging benchmarks, i.e., MS-COCO 2014, PASCAL VOC 2007, and NUS-WIDE datasets.
{"title":"DRTN: Dual Relation Transformer Network with feature erasure and contrastive learning for multi-label image classification","authors":"Wei Zhou, Kang Lin, Zhijie Zheng, Dihu Chen, Tao Su, Haifeng Hu","doi":"10.1016/j.neunet.2025.107309","DOIUrl":"10.1016/j.neunet.2025.107309","url":null,"abstract":"<div><div>The objective of multi-label image classification (MLIC) task is to simultaneously identify multiple objects present in an image. Several researchers directly flatten 2D feature maps into 1D grid feature sequences, and utilize Transformer encoder to capture the correlations of grid features to learn object relationships. Although obtaining promising results, these Transformer-based methods lose spatial information. In addition, current attention-based models often focus only on salient feature regions, but ignore other potential useful features that contribute to MLIC task. To tackle these problems, we present a novel <strong>D</strong>ual <strong>R</strong>elation <strong>T</strong>ransformer <strong>N</strong>etwork (<strong>DRTN</strong>) for MLIC task, which can be trained in an end-to-end manner. Concretely, to compensate for the loss of spatial information of grid features resulting from the flattening operation, we adopt a grid aggregation scheme to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, a new dual relation enhancement (DRE) module is proposed to capture correlations between objects using two different visual features, thereby complementing the advantages provided by both grid and pseudo-region features. After that, we design a new feature enhancement and erasure (FEE) module to learn discriminative features and mine additional potential valuable features. By using attention mechanism to discover the most salient feature regions and removing them with region-level erasure strategy, our FEE module is able to mine other potential useful features from the remaining parts. Further, we devise a novel contrastive learning (CL) module to encourage the foregrounds of salient and potential features to be closer, while pushing their foregrounds further away from background features. This manner compels our model to learn discriminative and valuable features more comprehensively. Extensive experiments demonstrate that DRTN method surpasses current MLIC models on three challenging benchmarks, <em>i.e.</em>, MS-COCO 2014, PASCAL VOC 2007, and NUS-WIDE datasets.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"187 ","pages":"Article 107309"},"PeriodicalIF":6.0,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143552993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.
{"title":"Continual learning of conjugated visual representations through higher-order motion flows","authors":"Simone Marullo , Matteo Tiezzi , Marco Gori , Stefano Melacci","doi":"10.1016/j.neunet.2025.107296","DOIUrl":"10.1016/j.neunet.2025.107296","url":null,"abstract":"<div><div>Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named <em>motion-conjugated feature representations</em>. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"187 ","pages":"Article 107296"},"PeriodicalIF":6.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143563701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01DOI: 10.1016/j.neunet.2025.107307
Zhi Pang, Lina Wang, Fangchao Yu, Kai Zhao, Bo Zeng, Shuwang Xu
The marriage of deep neural network (DNN) and secure 2-party computation (2PC) enables private inference (PI) on the encrypted client-side data and server-side models with both privacy and accuracy guarantees, coming at the cost of orders of magnitude communication and latency penalties. Prior works on designing PI-friendly network architectures are confined to mitigating the overheads associated with non-linear (e.g., ReLU) operations, assuming other linear computations are free. Recent works have shown that linear convolutions can no longer be ignored and are responsible for the majority of communication in PI protocols. In this work, we present PrivCore, a framework that jointly optimizes the alternating linear and non-linear DNN operators via a careful co-design of sparse Winograd convolution and fine-grained activation reduction, to improve high-efficiency ciphertext computation without impacting the inference precision. Specifically, being aware of the incompatibility between the spatial pruning and Winograd convolution, we propose a two-tiered Winograd-aware structured pruning method that removes spatial filters and Winograd vectors from coarse to fine-grained for multiplication reduction, both of which are specifically optimized for Winograd convolution in a structured pattern. PrivCore further develops a novel sensitivity-based differentiable activation approximation to automate the selection of ineffectual ReLUs and polynomial options. PrivCore also supports the dynamic determination of coefficient-adaptive polynomial replacement to mitigate the accuracy degradation. Extensive experiments on various models and datasets consistently validate the effectiveness of PrivCore, achieving communication reduction with 1.8% higher accuracy compared with SENet (ICLR 2023) on CIFAR-100, and total communication reduction with iso-accuracy compared with CoPriv (NeurIPS 2023) on ImageNet.
{"title":"PrivCore: Multiplication-activation co-reduction for efficient private inference","authors":"Zhi Pang, Lina Wang, Fangchao Yu, Kai Zhao, Bo Zeng, Shuwang Xu","doi":"10.1016/j.neunet.2025.107307","DOIUrl":"10.1016/j.neunet.2025.107307","url":null,"abstract":"<div><div>The marriage of deep neural network (DNN) and secure 2-party computation (2PC) enables private inference (PI) on the encrypted client-side data and server-side models with both privacy and accuracy guarantees, coming at the cost of orders of magnitude communication and latency penalties. Prior works on designing PI-friendly network architectures are confined to mitigating the overheads associated with non-linear (e.g., ReLU) operations, assuming other linear computations are free. Recent works have shown that linear convolutions can no longer be ignored and are responsible for the majority of communication in PI protocols. In this work, we present <span>PrivCore</span>, a framework that jointly optimizes the alternating linear and non-linear DNN operators via a careful co-design of sparse Winograd convolution and fine-grained activation reduction, to improve high-efficiency ciphertext computation without impacting the inference precision. Specifically, being aware of the incompatibility between the spatial pruning and Winograd convolution, we propose a two-tiered Winograd-aware structured pruning method that removes spatial filters and Winograd vectors from coarse to fine-grained for multiplication reduction, both of which are specifically optimized for Winograd convolution in a structured pattern. <span>PrivCore</span> further develops a novel sensitivity-based differentiable activation approximation to automate the selection of ineffectual ReLUs and polynomial options. <span>PrivCore</span> also supports the dynamic determination of coefficient-adaptive polynomial replacement to mitigate the accuracy degradation. Extensive experiments on various models and datasets consistently validate the effectiveness of <span>PrivCore</span>, achieving <span><math><mrow><mn>2</mn><mo>.</mo><mn>2</mn><mo>×</mo></mrow></math></span> communication reduction with 1.8% higher accuracy compared with SENet (ICLR 2023) on CIFAR-100, and <span><math><mrow><mn>2</mn><mo>.</mo><mn>0</mn><mo>×</mo></mrow></math></span> total communication reduction with iso-accuracy compared with CoPriv (NeurIPS 2023) on ImageNet.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"187 ","pages":"Article 107307"},"PeriodicalIF":6.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143563700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}