Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128851
Yuxiao Li, Yu Xin, Jiangbo Qian, Yihong Dong
Text–video retrieval, as a fundamental task of cross-modal learning, relies on effectively establishing the semantic association between text and video. At present, mainstream semantic alignment methods for text–video adopt instance-level alignment strategies, ignoring the fine-grained concept association and the “concept-level alignment” characteristics of text–video. In this regard, we propose Shared Concept Prototypes and Concept-level Alignment (S2CA) to achieve concept-level alignment. Specifically, we utilize the text–video Shared Concept Prototypes mechanism to bridge the correspondence between text and video. On this basis, we use cross-attention and Gumbel-softmax to obtain Discrete Concept Allocation Matrices and then assign text and video tokens to corresponding concept prototypes. In this way, texts and videos are decoupled into multiple Conceptual Aggregated Features, thereby achieving Concept-level Alignment. In addition, we use CLIP as the teacher model and adopt the Align-Transform-Reconstruct distillation framework to strengthen the multimodal semantic learning ability. The extensive experiments on MSR-VTT, DiDeMo, ActivityNet and MSVD prove the effectiveness of our method.
{"title":"S2CA: Shared Concept Prototypes and Concept-level Alignment for text–video retrieval","authors":"Yuxiao Li, Yu Xin, Jiangbo Qian, Yihong Dong","doi":"10.1016/j.neucom.2024.128851","DOIUrl":"10.1016/j.neucom.2024.128851","url":null,"abstract":"<div><div>Text–video retrieval, as a fundamental task of cross-modal learning, relies on effectively establishing the semantic association between text and video. At present, mainstream semantic alignment methods for text–video adopt instance-level alignment strategies, ignoring the fine-grained concept association and the “concept-level alignment” characteristics of text–video. In this regard, we propose <strong>S</strong>hared <strong>C</strong>oncept Prototypes and <strong>C</strong>oncept-level <strong>A</strong>lignment (<strong>S2CA</strong>) to achieve concept-level alignment. Specifically, we utilize the text–video <strong>Shared Concept Prototypes</strong> mechanism to bridge the correspondence between text and video. On this basis, we use cross-attention and Gumbel-softmax to obtain <strong>Discrete Concept Allocation Matrices</strong> and then assign text and video tokens to corresponding concept prototypes. In this way, texts and videos are decoupled into multiple <strong>Conceptual Aggregated Features</strong>, thereby achieving <strong>Concept-level Alignment</strong>. In addition, we use CLIP as the teacher model and adopt the Align-Transform-Reconstruct distillation framework to strengthen the multimodal semantic learning ability. The extensive experiments on MSR-VTT, DiDeMo, ActivityNet and MSVD prove the effectiveness of our method.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128851"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128846
Simone Bonechi , Paolo Andreini , Barbara Toniella Corradini , Franco Scarselli
Recently, generative models for images have garnered remarkable attention, due to their effective generalization ability and their capability to generate highly detailed and realistic content. Indeed, the success of generative networks (e.g., BigGAN, StyleGAN, Diffusion Models) has driven researchers to develop increasingly powerful models. As a result, we have observed an unprecedented improvement in terms of both image resolution and realism, making generated images indistinguishable from real ones. In this work, we focus on a family of generative models known as Stable Diffusion Models (SDMs), which have recently emerged due to their ability to generate images in a multimodal setup (i.e., from a textual prompt) and have outperformed adversarial networks by learning to reverse a diffusion process. Given the complexity of these models that makes it hard to retrain them, researchers started to exploit pre-trained SDMs to perform downstream tasks (e.g., classification and segmentation), where semantics plays a fundamental role. In this context, understanding how well the model preserves semantic information may be crucial to improve its performance.
This paper presents an approach aimed at providing insights into the properties of a pre-trained SDM through the semantic lens. In particular, we analyze the features extracted by the U-Net within a SDM to explore whether and how the semantic information of an image is preserved in its internal representation. For this purpose, different distance measures are compared, and an ablation study is performed to select the layer (or combination of layers) of the U-Net that best preserves the semantic information. We also seek to understand whether semantics are preserved when the image undergoes simple transformations (e.g., rotation, flip, scale, padding, crop, and shift) and for a different number of diffusion denoising steps. To evaluate these properties, we consider popular benchmarks for semantic segmentation tasks (e.g., COCO, and Pascal-VOC). Our experiments suggest that the first encoder layer at resolution effectively preserves semantic information. However, increasing inference steps (even for a minimal amount of noise) and applying various image transformations can affect the diffusion U-Net’s internal feature representation. Additionally, we propose some examples taken from a video benchmark (DAVIS dataset), where we investigate if an object instance within a video preserves its internal representation even after several frames. Our findings suggest that the internal object representation remains consistent across multiple frames in a video, as long as the configuration changes are not excessive.
{"title":"An analysis of pre-trained stable diffusion models through a semantic lens","authors":"Simone Bonechi , Paolo Andreini , Barbara Toniella Corradini , Franco Scarselli","doi":"10.1016/j.neucom.2024.128846","DOIUrl":"10.1016/j.neucom.2024.128846","url":null,"abstract":"<div><div>Recently, generative models for images have garnered remarkable attention, due to their effective generalization ability and their capability to generate highly detailed and realistic content. Indeed, the success of generative networks (<em>e.g.</em>, BigGAN, StyleGAN, Diffusion Models) has driven researchers to develop increasingly powerful models. As a result, we have observed an unprecedented improvement in terms of both image resolution and realism, making generated images indistinguishable from real ones. In this work, we focus on a family of generative models known as Stable Diffusion Models (SDMs), which have recently emerged due to their ability to generate images in a multimodal setup (<em>i.e.</em>, from a textual prompt) and have outperformed adversarial networks by learning to reverse a diffusion process. Given the complexity of these models that makes it hard to retrain them, researchers started to exploit pre-trained SDMs to perform downstream tasks (<em>e.g.</em>, classification and segmentation), where semantics plays a fundamental role. In this context, <em>understanding how well the model preserves semantic information may be crucial to improve its performance.</em></div><div>This paper presents an approach aimed at providing insights into the properties of a pre-trained SDM through the semantic lens. In particular, we analyze the features extracted by the U-Net within a SDM to explore whether and how the semantic information of an image is preserved in its internal representation. For this purpose, different distance measures are compared, and an ablation study is performed to select the layer (or combination of layers) of the U-Net that best preserves the semantic information. We also seek to understand whether semantics are preserved when the image undergoes simple transformations (<em>e.g.</em>, rotation, flip, scale, padding, crop, and shift) and for a different number of diffusion denoising steps. To evaluate these properties, we consider popular benchmarks for semantic segmentation tasks (<em>e.g.</em>, COCO, and Pascal-VOC). Our experiments suggest that the first encoder layer at <span><math><mrow><mn>16</mn><mi>×</mi><mn>16</mn></mrow></math></span> resolution effectively preserves semantic information. However, increasing inference steps (even for a minimal amount of noise) and applying various image transformations can affect the diffusion U-Net’s internal feature representation. Additionally, we propose some examples taken from a video benchmark (DAVIS dataset), where we investigate if an object instance within a video preserves its internal representation even after several frames. Our findings suggest that the internal object representation remains consistent across multiple frames in a video, as long as the configuration changes are not excessive.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128846"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper studies the fixed-time event-triggered pinning synchronization problem for complex network using aperiodic intermittent control. A novel fixed-time stability lemma with aperiodic intermittent characteristic is first proposed. By designing appropriate event-triggered aperiodic intermittent pinning controller (ETAIPC) based on the average control rate, several conditions are derived to ensure the fixed-time synchronization. The upper bound of setting-time is independent of any initial values and only concerns with design parameters, network size and node dimension. A simple to execute selection algorithm is adopted to renew the pinning node set. The Zeno behavior is also excluded through a rigorous theoretical analysis. Simulation examples are employed to demonstrate the efficacy of the obtained method.
{"title":"Fixed-time event-triggered pinning synchronization of complex network via aperiodically intermittent control","authors":"Junru Zhang, Jian-An Wang, Jie Zhang, Mingjie Li, Zhicheng Zhao, Xinyu Wen","doi":"10.1016/j.neucom.2024.128818","DOIUrl":"10.1016/j.neucom.2024.128818","url":null,"abstract":"<div><div>This paper studies the fixed-time event-triggered pinning synchronization problem for complex network using aperiodic intermittent control. A novel fixed-time stability lemma with aperiodic intermittent characteristic is first proposed. By designing appropriate event-triggered aperiodic intermittent pinning controller (ETAIPC) based on the average control rate, several conditions are derived to ensure the fixed-time synchronization. The upper bound of setting-time is independent of any initial values and only concerns with design parameters, network size and node dimension. A simple to execute selection algorithm is adopted to renew the pinning node set. The Zeno behavior is also excluded through a rigorous theoretical analysis. Simulation examples are employed to demonstrate the efficacy of the obtained method.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128818"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128841
Jinkui Zhang , Shidong Zhai , Wei Zhu
This paper studies the design of minimal control effort for cluster synchronization (CS) in a diffusion-coupled nonlinear network under directed graph. Under the conditions that the directed graph satisfies the cluster input equivalence condition and the system possesses a bounded Jacobian matrix, we obtain CS of diffusion-coupled nonlinear network with non-diagonal coupling matrix. Based on matrix measure and balancing theorem, we obtain the local minimization controllers for the minimal control effort of CS. Finally, the theoretical results are validated through a numerical example involving a network of coupled FitzHugh–Nagumo neurons with a general topology of interactions.
{"title":"Minimum control of cluster synchronization effort in diffusion coupled nonlinear networks","authors":"Jinkui Zhang , Shidong Zhai , Wei Zhu","doi":"10.1016/j.neucom.2024.128841","DOIUrl":"10.1016/j.neucom.2024.128841","url":null,"abstract":"<div><div>This paper studies the design of minimal control effort for cluster synchronization (CS) in a diffusion-coupled nonlinear network under directed graph. Under the conditions that the directed graph satisfies the cluster input equivalence condition and the system possesses a bounded Jacobian matrix, we obtain CS of diffusion-coupled nonlinear network with non-diagonal coupling matrix. Based on matrix measure and balancing theorem, we obtain the local minimization controllers for the minimal control effort of CS. Finally, the theoretical results are validated through a numerical example involving a network of coupled FitzHugh–Nagumo neurons with a general topology of interactions.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128841"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128843
Yifang Yin, Sheng Zhang, Yicheng Zhang, Yi Zhang, Shili Xiang
Aircraft trajectory prediction aims to estimate the future movements of aircraft in a scene, which is a crucial step for intelligent air traffic management such as capacity estimation and conflict detection. Current approaches primarily rely on inputting absolute locations, which improves the prediction accuracy but limits the model’s generalization ability to unseen environments. To bridge the gap, we propose to alternatively learn aircraft’s intentions from a repository of historical trajectories. Based on the observation that aircraft traveling through the same airspace may exhibit comparable behaviors, we utilize a location-adaptive threshold to identify nearby neighbors for a given query aircraft within the repository. The retrieved candidates are next filtered based on contextual information, such as landing time and landing direction, to eliminate less relevant components. The resulting set of nearby candidates are referred to as the local history, which emphasizes the modeling of aircraft’s local behavior. Moreover, an attention-based local history encoder is presented to aggregate information from all nearby candidates to generate a latent feature for capturing the aircraft’s intention. This latent feature is robust to normalized input trajectories, relative to the current location of the target aircraft, thus improving the model’s generalization capability to unseen areas. Our proposed intention modeling method is model-agnostic, which can be leveraged as an additional condition by any trajectory prediction model for improved robustness and accuracy. For evaluation, we integrate the intention modeling component into our previous diffusion-based aircraft trajectory prediction framework. We conduct experiments on two real-world aircraft trajectory datasets in both towered and non-towered terminal airspace. The experimental results show that our method captures various maneuvering patterns effectively, outperforming existing methods by a large margin in terms of both ADE and FDE.
{"title":"Aircraft trajectory prediction in terminal airspace with intentions derived from local history","authors":"Yifang Yin, Sheng Zhang, Yicheng Zhang, Yi Zhang, Shili Xiang","doi":"10.1016/j.neucom.2024.128843","DOIUrl":"10.1016/j.neucom.2024.128843","url":null,"abstract":"<div><div>Aircraft trajectory prediction aims to estimate the future movements of aircraft in a scene, which is a crucial step for intelligent air traffic management such as capacity estimation and conflict detection. Current approaches primarily rely on inputting absolute locations, which improves the prediction accuracy but limits the model’s generalization ability to unseen environments. To bridge the gap, we propose to alternatively learn aircraft’s intentions from a repository of historical trajectories. Based on the observation that aircraft traveling through the same airspace may exhibit comparable behaviors, we utilize a location-adaptive threshold to identify nearby neighbors for a given query aircraft within the repository. The retrieved candidates are next filtered based on contextual information, such as landing time and landing direction, to eliminate less relevant components. The resulting set of nearby candidates are referred to as the local history, which emphasizes the modeling of aircraft’s local behavior. Moreover, an attention-based local history encoder is presented to aggregate information from all nearby candidates to generate a latent feature for capturing the aircraft’s intention. This latent feature is robust to normalized input trajectories, relative to the current location of the target aircraft, thus improving the model’s generalization capability to unseen areas. Our proposed intention modeling method is model-agnostic, which can be leveraged as an additional condition by any trajectory prediction model for improved robustness and accuracy. For evaluation, we integrate the intention modeling component into our previous diffusion-based aircraft trajectory prediction framework. We conduct experiments on two real-world aircraft trajectory datasets in both towered and non-towered terminal airspace. The experimental results show that our method captures various maneuvering patterns effectively, outperforming existing methods by a large margin in terms of both ADE and FDE.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"615 ","pages":"Article 128843"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128849
Yifan Ding , Ying Lei , Anqi Wang , Xiangrun Liu , Tuanfei Zhu , Yizhou Li
Zero-shot stance detection (ZSSD) is a task that involves identifying the author’s perspective on specific issues in text, particularly when the target topic has not been encountered during the model training process, to address rapidly evolving topics on social media. This paper introduces a ZSSD framework named KEL-CA. To enable the model to more effectively utilize transferable stance features for representing unseen targets, the framework incorporates a multi-layer contrastive learning and adversarial domain transfer module. Unlike traditional contrastive or adversarial learning, our framework captures both correlations and distinctions between invariant and specific features, as well as between different stance labels, and enhances the generalization ability and robustness of the features. Subsequently, to address the problem of insufficient information about the target context, we designed a dual external knowledge injection module that uses a large language model (LLM) to extract external knowledge from a Wikipedia-based local knowledge base and a Chain-of-Thought (COT) process to ensure the timeliness and relevance of the knowledge to infer the stances of unseen targets. Experimental results demonstrate that our approach outperforms existing models on two benchmark datasets, thereby validating its efficacy in ZSSD tasks.
{"title":"Adversarial contrastive representation training with external knowledge injection for zero-shot stance detection","authors":"Yifan Ding , Ying Lei , Anqi Wang , Xiangrun Liu , Tuanfei Zhu , Yizhou Li","doi":"10.1016/j.neucom.2024.128849","DOIUrl":"10.1016/j.neucom.2024.128849","url":null,"abstract":"<div><div>Zero-shot stance detection (ZSSD) is a task that involves identifying the author’s perspective on specific issues in text, particularly when the target topic has not been encountered during the model training process, to address rapidly evolving topics on social media. This paper introduces a ZSSD framework named KEL-CA. To enable the model to more effectively utilize transferable stance features for representing unseen targets, the framework incorporates a multi-layer contrastive learning and adversarial domain transfer module. Unlike traditional contrastive or adversarial learning, our framework captures both correlations and distinctions between invariant and specific features, as well as between different stance labels, and enhances the generalization ability and robustness of the features. Subsequently, to address the problem of insufficient information about the target context, we designed a dual external knowledge injection module that uses a large language model (LLM) to extract external knowledge from a Wikipedia-based local knowledge base and a Chain-of-Thought (COT) process to ensure the timeliness and relevance of the knowledge to infer the stances of unseen targets. Experimental results demonstrate that our approach outperforms existing models on two benchmark datasets, thereby validating its efficacy in ZSSD tasks.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128849"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128838
Siyu Long , Jianyu Wu , Yi Zhou , Fan Sha , Xinyu Dai
Designing neural networks for molecular modeling is a crucial task in the field of artificial intelligence. The goal is to utilize neural networks to understand and design molecules, which has significant implications for drug development and other real-world applications. Recently, with the advancement of deep learning, molecular modeling has made considerable progress. However, current methods are primarily data-driven, overlooking the role of domain knowledge, such as molecular shapes, in the modeling process. In this paper, we systematically investigate how incorporating molecular shape knowledge can enhance molecular modeling. Specifically, we design two deep neural networks, ShapePred and ShapeGen, to utilize molecular shapes in molecule prediction and generation. Experimental results demonstrate that integrating shape knowledge can significantly improve model performance. Notably, ShapePred exhibits strong performance across 11 molecule prediction datasets, while ShapeGen can more efficiently generate high-quality drug molecules based on given target proteins.
{"title":"Deep neural networks for knowledge-enhanced molecular modeling","authors":"Siyu Long , Jianyu Wu , Yi Zhou , Fan Sha , Xinyu Dai","doi":"10.1016/j.neucom.2024.128838","DOIUrl":"10.1016/j.neucom.2024.128838","url":null,"abstract":"<div><div>Designing neural networks for molecular modeling is a crucial task in the field of artificial intelligence. The goal is to utilize neural networks to understand and design molecules, which has significant implications for drug development and other real-world applications. Recently, with the advancement of deep learning, molecular modeling has made considerable progress. However, current methods are primarily data-driven, overlooking the role of domain knowledge, such as molecular shapes, in the modeling process. In this paper, we systematically investigate how incorporating molecular shape knowledge can enhance molecular modeling. Specifically, we design two deep neural networks, <span>ShapePred</span> and <span>ShapeGen</span>, to utilize molecular shapes in molecule prediction and generation. Experimental results demonstrate that integrating shape knowledge can significantly improve model performance. Notably, <span>ShapePred</span> exhibits strong performance across 11 molecule prediction datasets, while <span>ShapeGen</span> can more efficiently generate high-quality drug molecules based on given target proteins.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128838"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128834
Jun-Zhuo Zou , Ming-Xuan Chen , Li-Hua Gong
In current deep learning based watermarking technologies, it remains challenging to fully integrate the features of watermark and cover image. Most watermarking models with fixed-size kernel convolution exhibit restricted feature extraction ability, leading to incomplete feature fusion. To address this issue, a hierarchical residual fusion multi-scale convolution (HRFMS) module is designed. The method extracts image features from various receptive fields and implements feature interaction by residual connection. To produce watermarked image with high visual quality and attack resistance, a watermarking model based on the HRFMS is devised to achieve multi-scale feature fusion. Moreover, to minimize image distortion caused by watermark information, an attention mask layer is designed to guide the distribution of watermark information. The experimental results demonstrate that the invisibility and the robustness of the HRFMSNet are excellent. The watermarked images generated by the HRFMSNet are nearly visually indistinguishable from the cover images. The average peak signal-to-noise ratio of the watermarked images is 37.13 dB, and most of the bit error rates of the decoded messages are below 0.02.
{"title":"Invisible and robust watermarking model based on hierarchical residual fusion multi-scale convolution","authors":"Jun-Zhuo Zou , Ming-Xuan Chen , Li-Hua Gong","doi":"10.1016/j.neucom.2024.128834","DOIUrl":"10.1016/j.neucom.2024.128834","url":null,"abstract":"<div><div>In current deep learning based watermarking technologies, it remains challenging to fully integrate the features of watermark and cover image. Most watermarking models with fixed-size kernel convolution exhibit restricted feature extraction ability, leading to incomplete feature fusion. To address this issue, a hierarchical residual fusion multi-scale convolution (HRFMS) module is designed. The method extracts image features from various receptive fields and implements feature interaction by residual connection. To produce watermarked image with high visual quality and attack resistance, a watermarking model based on the HRFMS is devised to achieve multi-scale feature fusion. Moreover, to minimize image distortion caused by watermark information, an attention mask layer is designed to guide the distribution of watermark information. The experimental results demonstrate that the invisibility and the robustness of the HRFMSNet are excellent. The watermarked images generated by the HRFMSNet are nearly visually indistinguishable from the cover images. The average peak signal-to-noise ratio of the watermarked images is 37.13 dB, and most of the bit error rates of the decoded messages are below 0.02.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128834"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128819
Rong Xiao, Zhiyuan Hu, Jie Zhang, Chenwei Tang, Jiancheng Lv
Deep reinforcement learning (DRL) has shown promising results in solving robotic control and decision tasks, which can learn the high-dimensional state and action information well. Despite their successes, conventional neural-based DRL models are criticized for low energy efficiency, making them laborious to be widely applied in low-power electronics. With more biologically plausible plasticity principles, spiking neural networks (SNNs) are now considered an energy-efficient and robust alternative. The most existing dynamics and learning paradigms for spiking neurons with a common Leaky Integrate-and-Fire (LIF) neuron model often result in relatively low efficiency and poor robustness. To address these limitations, we propose a multi-attribute dynamic attenuation learning improved spiking actor network (MADA-SAN) for reinforcement learning to achieve effective decision-making. The resistance, membrane voltage and membrane current of spiking neurons are updated from a fixed value into dynamic attenuation. By enhancing the temporal relation dependencies in neurons, this model can learn the spatio-temporal relevance of complex continuous information well. Extensive experimental results show MADA-SAN performs better than its counterpart deep actor network on six continuous control tasks from OpenAI gym. Besides, we further validated the proposed MADA-LIF can achieve comparable performance with other state-of-the-art algorithms on MNIST and DVS-gesture recognition tasks.
{"title":"Multi-attribute dynamic attenuation learning improved spiking actor network","authors":"Rong Xiao, Zhiyuan Hu, Jie Zhang, Chenwei Tang, Jiancheng Lv","doi":"10.1016/j.neucom.2024.128819","DOIUrl":"10.1016/j.neucom.2024.128819","url":null,"abstract":"<div><div>Deep reinforcement learning (DRL) has shown promising results in solving robotic control and decision tasks, which can learn the high-dimensional state and action information well. Despite their successes, conventional neural-based DRL models are criticized for low energy efficiency, making them laborious to be widely applied in low-power electronics. With more biologically plausible plasticity principles, spiking neural networks (SNNs) are now considered an energy-efficient and robust alternative. The most existing dynamics and learning paradigms for spiking neurons with a common Leaky Integrate-and-Fire (LIF) neuron model often result in relatively low efficiency and poor robustness. To address these limitations, we propose a multi-attribute dynamic attenuation learning improved spiking actor network (MADA-SAN) for reinforcement learning to achieve effective decision-making. The resistance, membrane voltage and membrane current of spiking neurons are updated from a fixed value into dynamic attenuation. By enhancing the temporal relation dependencies in neurons, this model can learn the spatio-temporal relevance of complex continuous information well. Extensive experimental results show MADA-SAN performs better than its counterpart deep actor network on six continuous control tasks from OpenAI gym. Besides, we further validated the proposed MADA-LIF can achieve comparable performance with other state-of-the-art algorithms on MNIST and DVS-gesture recognition tasks.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128819"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.neucom.2024.128853
Weiqiu Wang , Zining Chen , Zhicheng Zhao , Fei Su
Pre-trained vision-language models, particularly those utilizing CLIP, have advanced various visual tasks. Parameter-Efficient Fine-Tuning (PEFT) on such models is a mainstream trend for downstream tasks. Despite advancements, long-tailed distribution still hampers image recognition performance in current PEFT schemes. Therefore, this paper proposes Token Embeddings Augmentation (TEA) to tackle long-tailed learning under PEFT paradigm. Based on patch token semantic mining, TEA uncovers category-specific semantic details within patch tokens to enhance token embeddings, named Patch-based Embeddings Augmentation (PEA). Then, a Probability Gate (PG) strategy is designed to effectively enrich semantic information of tail categories using enhanced embeddings. A Token Embeddings Consistency (TEC) loss is further introduced to prioritize category semantic information within tokens. Extensive experiments on multiple long-tailed distribution datasets show that our method improves the performance of various PEFT methods with different classification loss functions, especially for tail categories. Our optimal approach achieves the state-of-the-art results on multiple datasets with negligible parameters or inference latency, thus enhancing the practicality of PEFT in long-tailed distributions.
{"title":"Token Embeddings Augmentation benefits Parameter-Efficient Fine-Tuning under long-tailed distribution","authors":"Weiqiu Wang , Zining Chen , Zhicheng Zhao , Fei Su","doi":"10.1016/j.neucom.2024.128853","DOIUrl":"10.1016/j.neucom.2024.128853","url":null,"abstract":"<div><div>Pre-trained vision-language models, particularly those utilizing CLIP, have advanced various visual tasks. Parameter-Efficient Fine-Tuning (PEFT) on such models is a mainstream trend for downstream tasks. Despite advancements, long-tailed distribution still hampers image recognition performance in current PEFT schemes. Therefore, this paper proposes Token Embeddings Augmentation (TEA) to tackle long-tailed learning under PEFT paradigm. Based on patch token semantic mining, TEA uncovers category-specific semantic details within patch tokens to enhance token embeddings, named Patch-based Embeddings Augmentation (PEA). Then, a Probability Gate (PG) strategy is designed to effectively enrich semantic information of tail categories using enhanced embeddings. A Token Embeddings Consistency (TEC) loss is further introduced to prioritize category semantic information within tokens. Extensive experiments on multiple long-tailed distribution datasets show that our method improves the performance of various PEFT methods with different classification loss functions, especially for tail categories. Our optimal approach achieves the state-of-the-art results on multiple datasets with negligible parameters or inference latency, thus enhancing the practicality of PEFT in long-tailed distributions.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"615 ","pages":"Article 128853"},"PeriodicalIF":5.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}