Pub Date : 2025-01-21DOI: 10.1007/s10489-025-06239-1
Yukun Liu, Daming Shi
Few-shot learning has achieved great success in recent years, thanks to its requirement of limited number of labeled data. However, most of the state-of-the-art techniques of few-shot learning employ transfer learning, which still requires massive labeled data to train. To simulate the human learning mechanism, a deep model of few-shot learning is proposed to learn from one, or a few examples. First of all in this paper, we analyze and note that the problem with representative semi-supervised few-shot learning methods is getting stuck in local optimization and prototype bias problems. To address these challenges, we propose a new semi-supervised few-shot learning method with Convex Kullback-Leibler and critical descriptor prototypes, hereafter referred to as CKL. Specifically, CKL optimizes joint probability density via KL divergence, subsequently deriving a strictly convex function to facilitate global optimization in semi-supervised clustering. In addition, by incorporating dictionary learning, the critical descriptor facilitates the extraction of more prototypical features, thereby capturing more distinct feature information and avoiding the problem of prototype bias caused by limited labeled samples. Intensive experiments have been conducted on three popular benchmark datasets, and the experimental results show that this method significantly improves the classification ability of few-shot learning and obtains the most advanced performance. In the future, we will explore additional methods that can be integrated with deep learning to further uncover essential features within samples.
{"title":"A convex Kullback-Leibler divergence and critical-descriptor prototypes for semi-supervised few-shot learning","authors":"Yukun Liu, Daming Shi","doi":"10.1007/s10489-025-06239-1","DOIUrl":"10.1007/s10489-025-06239-1","url":null,"abstract":"<div><p>Few-shot learning has achieved great success in recent years, thanks to its requirement of limited number of labeled data. However, most of the state-of-the-art techniques of few-shot learning employ transfer learning, which still requires massive labeled data to train. To simulate the human learning mechanism, a deep model of few-shot learning is proposed to learn from one, or a few examples. First of all in this paper, we analyze and note that the problem with representative semi-supervised few-shot learning methods is getting stuck in local optimization and prototype bias problems. To address these challenges, we propose a new semi-supervised few-shot learning method with Convex Kullback-Leibler and critical descriptor prototypes, hereafter referred to as CKL. Specifically, CKL optimizes joint probability density via KL divergence, subsequently deriving a strictly convex function to facilitate global optimization in semi-supervised clustering. In addition, by incorporating dictionary learning, the critical descriptor facilitates the extraction of more prototypical features, thereby capturing more distinct feature information and avoiding the problem of prototype bias caused by limited labeled samples. Intensive experiments have been conducted on three popular benchmark datasets, and the experimental results show that this method significantly improves the classification ability of few-shot learning and obtains the most advanced performance. In the future, we will explore additional methods that can be integrated with deep learning to further uncover essential features within samples.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-21DOI: 10.1007/s10489-025-06255-1
Shenglei Pei, Qinghao Han, Zepu Hao, Hong Zhao
Deep subspace clustering networks (DSC-Nets), which combine deep autoencoders and self-expressive modules, have garnered widespread attention due to their outstanding performance. Within these networks, the autoencoder captures the latent representations of data by reconstructing the input data, while the self-expressive layer learns an affinity matrix based on these latent representations. This matrix guides spectral clustering, ultimately completing the clustering task. However, the latent representations learned solely through self-reconstruction by the autoencoder lack discriminative power. The quality of these latent representations directly affects the performance of the affinity matrix, which inevitably limits the clustering performance. To address this issue, we propose learning dissimilar relationships between samples using a classification module, and similar relationships using the self-expressive module. We integrate the information from both modules to construct a graph based on learned similarities, which is then embedded into the autoencoder network. Furthermore, we introduce a pseudo-label supervision module to guide the learning of higher-level similarities in the latent representations, thus achieving more discriminative latent features. Additionally, to enhance the quality of the affinity matrix, we employ an entropy norm constraint to improve connectivity within the subspaces. Experimental results on four public datasets demonstrate that our method achieves superior performance compared to other popular subspace clustering approaches.
{"title":"Deep subspace clustering via latent representation learning","authors":"Shenglei Pei, Qinghao Han, Zepu Hao, Hong Zhao","doi":"10.1007/s10489-025-06255-1","DOIUrl":"10.1007/s10489-025-06255-1","url":null,"abstract":"<div><p>Deep subspace clustering networks (DSC-Nets), which combine deep autoencoders and self-expressive modules, have garnered widespread attention due to their outstanding performance. Within these networks, the autoencoder captures the latent representations of data by reconstructing the input data, while the self-expressive layer learns an affinity matrix based on these latent representations. This matrix guides spectral clustering, ultimately completing the clustering task. However, the latent representations learned solely through self-reconstruction by the autoencoder lack discriminative power. The quality of these latent representations directly affects the performance of the affinity matrix, which inevitably limits the clustering performance. To address this issue, we propose learning dissimilar relationships between samples using a classification module, and similar relationships using the self-expressive module. We integrate the information from both modules to construct a graph based on learned similarities, which is then embedded into the autoencoder network. Furthermore, we introduce a pseudo-label supervision module to guide the learning of higher-level similarities in the latent representations, thus achieving more discriminative latent features. Additionally, to enhance the quality of the affinity matrix, we employ an entropy norm constraint to improve connectivity within the subspaces. Experimental results on four public datasets demonstrate that our method achieves superior performance compared to other popular subspace clustering approaches.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-21DOI: 10.1007/s10489-025-06285-9
El houssaine Hssayni, Ali Boufssasse, Nour-Eddine Joudar, Mohamed Ettaouil
Graph Neural Networks (GNNs) have emerged as a powerful tool for analyzing structured data represented as graphs. They offer significant contributions across various domains due to their ability to effectively capture and process complex relational information. However, most existing GNNs still suffer from undesirable phenomena such as non-robustness, overfitting, and over-smoothing. These challenges have raised significant interest among researchers. In this context, this work aims to address these issues by proposing a new vision of Dropout named A-DropEdge. First, it applies a message-passing layer to ensure the connection between nodes and avoid dropping in the input. Then, the information propagates through many branches with different random configurations to enhance the aggregation process. Moreover, consistency regularization is adopted to perform self-supervised learning. The experimental results on three graph data sets including Cora, Citeseer, and PubMed show the robustness and performance of the proposed approach in mitigating the over-smoothing problem.
{"title":"Novel dropout approach for mitigating over-smoothing in graph neural networks","authors":"El houssaine Hssayni, Ali Boufssasse, Nour-Eddine Joudar, Mohamed Ettaouil","doi":"10.1007/s10489-025-06285-9","DOIUrl":"10.1007/s10489-025-06285-9","url":null,"abstract":"<div><p>Graph Neural Networks (GNNs) have emerged as a powerful tool for analyzing structured data represented as graphs. They offer significant contributions across various domains due to their ability to effectively capture and process complex relational information. However, most existing GNNs still suffer from undesirable phenomena such as non-robustness, overfitting, and over-smoothing. These challenges have raised significant interest among researchers. In this context, this work aims to address these issues by proposing a new vision of Dropout named A-DropEdge. First, it applies a message-passing layer to ensure the connection between nodes and avoid dropping in the input. Then, the information propagates through many branches with different random configurations to enhance the aggregation process. Moreover, consistency regularization is adopted to perform self-supervised learning. The experimental results on three graph data sets including Cora, Citeseer, and PubMed show the robustness and performance of the proposed approach in mitigating the over-smoothing problem.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-21DOI: 10.1007/s10489-025-06234-6
Leilei Yan, Feihong He, Xiaohan Zheng, Li Zhang, Yiqi Zhang, Jiangzhen He, Weidong Du, Yansong Wang, Fanzhang Li
Metric-based few-shot image classification methods generally perform classification by comparing the distances between the query sample features and the prototypes of each class. These methods often focus on constructing prototype representations for each class or learning a metric, while neglecting the significance of the feature space itself. In this paper, we redirect the focus to feature space construction, with the goal of constructing a discriminative feature space for few-shot image classification tasks. To this end, we designed a contrastive prototype loss that incorporates the distribution of query samples with respect to class prototypes in the feature space, emphasizing intra-class compactness and inter-class separability, thereby guiding the model to learn a more discriminative feature space. Based on this loss, we propose a contrastive prototype loss based discriminative feature network (CPL-DFNet) to address few-shot image classification tasks. CPL-DFNet enhances sample utilization by fully leveraging the distance relationships between query samples and class prototypes in the feature space, creating more favorable conditions for few-shot image classification tasks and significantly improving classification performance. We conducted extensive experiments on both general and fine-grained few-shot image classification benchmark datasets to validate the effectiveness of the proposed CPL-DFNet method. The experimental results show that CPL-DFNet can effectively perform few-shot image classification tasks and outperforms many existing methods across various task scenarios, demonstrating significant performance advantages.
{"title":"Contrastive prototype loss based discriminative feature network for few-shot learning","authors":"Leilei Yan, Feihong He, Xiaohan Zheng, Li Zhang, Yiqi Zhang, Jiangzhen He, Weidong Du, Yansong Wang, Fanzhang Li","doi":"10.1007/s10489-025-06234-6","DOIUrl":"10.1007/s10489-025-06234-6","url":null,"abstract":"<div><p>Metric-based few-shot image classification methods generally perform classification by comparing the distances between the query sample features and the prototypes of each class. These methods often focus on constructing prototype representations for each class or learning a metric, while neglecting the significance of the feature space itself. In this paper, we redirect the focus to feature space construction, with the goal of constructing a discriminative feature space for few-shot image classification tasks. To this end, we designed a contrastive prototype loss that incorporates the distribution of query samples with respect to class prototypes in the feature space, emphasizing intra-class compactness and inter-class separability, thereby guiding the model to learn a more discriminative feature space. Based on this loss, we propose a contrastive prototype loss based discriminative feature network (CPL-DFNet) to address few-shot image classification tasks. CPL-DFNet enhances sample utilization by fully leveraging the distance relationships between query samples and class prototypes in the feature space, creating more favorable conditions for few-shot image classification tasks and significantly improving classification performance. We conducted extensive experiments on both general and fine-grained few-shot image classification benchmark datasets to validate the effectiveness of the proposed CPL-DFNet method. The experimental results show that CPL-DFNet can effectively perform few-shot image classification tasks and outperforms many existing methods across various task scenarios, demonstrating significant performance advantages.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-20DOI: 10.1007/s10489-024-06037-1
Yuan Chao, Huaiyang Zhu, Hengyu Lu
Multi-object tracking aims at estimating object bounding boxes and identity IDs in videos. Most tracking methods combine a detector and a Kalman filter using the IoU distance as a similarity metric for association matching of the previous trajectories with the current detection box. These methods usually suffer from ID switches and fragmented trajectories in response to congested and frequently occluded scenarios. To solve this problem, in this study, a simple and effective association method is proposed. First, a bottom edge cost matrix is introduced for the utilization of depth information to improve the data association and increase the robustness in the case of occlusion. Second, an asymmetric trajectory classification mechanism is proposed to distinguish the false-postive trajectories, and an activated trajectory matching strategy is introduced to reduce the interference of noise and transient objects in tracking. Finally, the trajectory deletion strategy is improved by introducing the number of trajectory state switches to delete the trajectories caused by spurious high-scoring detection boxes in real time, as a result, the number of fragmented trajectories is also reduced. These innovations achieve excellent performance on various benchmarks, including MOT17, MOT20, and especially DanceTrack where interactions and occlusions are frequent and severe. The code and models are available at https://github.com/djdodsjsjx/BAM-SORT/.
{"title":"BAM-SORT: border-guided activated matching for online multi-object tracking","authors":"Yuan Chao, Huaiyang Zhu, Hengyu Lu","doi":"10.1007/s10489-024-06037-1","DOIUrl":"10.1007/s10489-024-06037-1","url":null,"abstract":"<div><p>Multi-object tracking aims at estimating object bounding boxes and identity IDs in videos. Most tracking methods combine a detector and a Kalman filter using the IoU distance as a similarity metric for association matching of the previous trajectories with the current detection box. These methods usually suffer from ID switches and fragmented trajectories in response to congested and frequently occluded scenarios. To solve this problem, in this study, a simple and effective association method is proposed. First, a bottom edge cost matrix is introduced for the utilization of depth information to improve the data association and increase the robustness in the case of occlusion. Second, an asymmetric trajectory classification mechanism is proposed to distinguish the false-postive trajectories, and an activated trajectory matching strategy is introduced to reduce the interference of noise and transient objects in tracking. Finally, the trajectory deletion strategy is improved by introducing the number of trajectory state switches to delete the trajectories caused by spurious high-scoring detection boxes in real time, as a result, the number of fragmented trajectories is also reduced. These innovations achieve excellent performance on various benchmarks, including MOT17, MOT20, and especially DanceTrack where interactions and occlusions are frequent and severe. The code and models are available at https://github.com/djdodsjsjx/BAM-SORT/.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-20DOI: 10.1007/s10489-025-06242-6
Ziqiang Wei, Deng Chen, Yanduo Zhang, Dawei Wen, Xin Nie, Liang Xie
In the field of artificial intelligence, the portfolio management problem has received widespread attention. Portfolio models based on deep reinforcement learning enable intelligent investment decision-making. However, most models only consider modeling the temporal information of stocks, neglecting the correlation between stocks and the impact of overall market risk. Moreover, their trading strategies are often singular and fail to adapt to dynamic changes in the trading market. To address these issues, this paper proposes a Deep Reinforcement Learning Portfolio Model based on Mixture of Experts (MoEDRLPM). Firstly, a spatio-temporal adaptive embedding matrix is designed, temporal and spatial self-attention mechanisms are employed to extract the temporal information and correlations of stocks. Secondly, dynamically select the current optimal expert from the mixed expert pool through router. The expert makes decisions and aggregates to derive the portfolio weights. Next, market index data is utilized to model the current market risk and determine investment capital ratios. Finally, deep reinforcement learning is employed to optimize the portfolio strategy. This approach generates diverse trading strategies according to dynamic changes in the market environment. The proposed model is tested on the SSE50 and CSI300 datasets. Results show that the total returns of this model increase by 12% and 8%, respectively, while the Sharpe Ratios improve by 64% and 51%.
{"title":"Deep reinforcement learning portfolio model based on mixture of experts","authors":"Ziqiang Wei, Deng Chen, Yanduo Zhang, Dawei Wen, Xin Nie, Liang Xie","doi":"10.1007/s10489-025-06242-6","DOIUrl":"10.1007/s10489-025-06242-6","url":null,"abstract":"<div><p>In the field of artificial intelligence, the portfolio management problem has received widespread attention. Portfolio models based on deep reinforcement learning enable intelligent investment decision-making. However, most models only consider modeling the temporal information of stocks, neglecting the correlation between stocks and the impact of overall market risk. Moreover, their trading strategies are often singular and fail to adapt to dynamic changes in the trading market. To address these issues, this paper proposes a Deep Reinforcement Learning Portfolio Model based on Mixture of Experts (MoEDRLPM). Firstly, a spatio-temporal adaptive embedding matrix is designed, temporal and spatial self-attention mechanisms are employed to extract the temporal information and correlations of stocks. Secondly, dynamically select the current optimal expert from the mixed expert pool through router. The expert makes decisions and aggregates to derive the portfolio weights. Next, market index data is utilized to model the current market risk and determine investment capital ratios. Finally, deep reinforcement learning is employed to optimize the portfolio strategy. This approach generates diverse trading strategies according to dynamic changes in the market environment. The proposed model is tested on the SSE50 and CSI300 datasets. Results show that the total returns of this model increase by 12% and 8%, respectively, while the Sharpe Ratios improve by 64% and 51%.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-20DOI: 10.1007/s10489-025-06264-0
Junchao Ren, Qiao Zhang, Bingbing Kang, Yuxi Zhong, Min He, Yanliang Ge, Hongbo Bi
Camouflaged object detection (COD) aims to detect objects that blend in with their surroundings and is a challenging task in computer vision. High-level semantic information and low-level spatial information play important roles in localizing camouflaged objects and reinforcing spatial cues. However, current COD methods directly connect high-level features with low-level features, ignoring the importance of the respective features. In this paper, we design a Semantic-spatial guided Context Propagation Network (SCPNet) to efficiently mine semantic and spatial features while enhancing their feature representations. Firstly, we design a twin positioning module (TPM) to explore semantic cues to accurately locate camouflaged objects. Afterward, we introduce a spatial awareness module (SAM) to mine spatial cues in shallow features deeply. Finally, we develop a context propagation module (CPM) to assign semantic and spatial cues to multi-level features and enhance their feature representations. Experimental results show that our SCPNet outperforms state-of-the-art methods on three challenging datasets. Codes will be made available at https://github.com/RJC0608/SCPNet.
{"title":"Semantic-spatial guided context propagation network for camouflaged object detection","authors":"Junchao Ren, Qiao Zhang, Bingbing Kang, Yuxi Zhong, Min He, Yanliang Ge, Hongbo Bi","doi":"10.1007/s10489-025-06264-0","DOIUrl":"10.1007/s10489-025-06264-0","url":null,"abstract":"<div><p>Camouflaged object detection (COD) aims to detect objects that blend in with their surroundings and is a challenging task in computer vision. High-level semantic information and low-level spatial information play important roles in localizing camouflaged objects and reinforcing spatial cues. However, current COD methods directly connect high-level features with low-level features, ignoring the importance of the respective features. In this paper, we design a <i>S</i>emantic-spatial guided <i>C</i>ontext <i>P</i>ropagation <i>N</i>etwork (<i>SCPNet</i>) to efficiently mine semantic and spatial features while enhancing their feature representations. Firstly, we design a twin positioning module (TPM) to explore semantic cues to accurately locate camouflaged objects. Afterward, we introduce a spatial awareness module (SAM) to mine spatial cues in shallow features deeply. Finally, we develop a context propagation module (CPM) to assign semantic and spatial cues to multi-level features and enhance their feature representations. Experimental results show that our SCPNet outperforms state-of-the-art methods on three challenging datasets. Codes will be made available at https://github.com/RJC0608/SCPNet.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-20DOI: 10.1007/s10489-024-06187-2
Khalid Alattas, Qun Wu
{"title":"Retraction Note: A framework to evaluate the barriers for adopting the internet of medical things using the extended generalized TODIM method under the hesitant fuzzy environment","authors":"Khalid Alattas, Qun Wu","doi":"10.1007/s10489-024-06187-2","DOIUrl":"10.1007/s10489-024-06187-2","url":null,"abstract":"","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-20DOI: 10.1007/s10489-024-06214-2
Fengzhen Sun, Weidong Jin
This paper is about predictive learning, which is generating future frames given previous images. Suffering from the vanishing gradient problem, existing methods based on RNN and CNN can’t capture the long-term dependencies effectively. To overcome the above dilemma, we present MastNet a spatiotemporal framework for long-term predictive learning. In this paper, we design a Transformer-based encoder-decoder with hierarchical structure. As for the transformer block, we adopt the spatiotemporal window based self-attention to reduce computational complexity and the spatiotemporal shifted window partitioning approach. More importantly, we build a spatiotemporal autoencoder by the random clip mask strategy, which leads to better feature mining for temporal dependencies and spatial correlations. Furthermore, we insert an auxiliary prediction head, which can help our model generate higher-quality frames. Experimental results show that the proposed MastNet achieves the best results in accuracy and long-term prediction on two spatiotemporal datasets compared with the state-of-the-art models.
{"title":"A masked autoencoder network for spatiotemporal predictive learning","authors":"Fengzhen Sun, Weidong Jin","doi":"10.1007/s10489-024-06214-2","DOIUrl":"10.1007/s10489-024-06214-2","url":null,"abstract":"<div><p>This paper is about predictive learning, which is generating future frames given previous images. Suffering from the vanishing gradient problem, existing methods based on RNN and CNN can’t capture the long-term dependencies effectively. To overcome the above dilemma, we present MastNet a spatiotemporal framework for long-term predictive learning. In this paper, we design a Transformer-based encoder-decoder with hierarchical structure. As for the transformer block, we adopt the spatiotemporal window based self-attention to reduce computational complexity and the spatiotemporal shifted window partitioning approach. More importantly, we build a spatiotemporal autoencoder by the random clip mask strategy, which leads to better feature mining for temporal dependencies and spatial correlations. Furthermore, we insert an auxiliary prediction head, which can help our model generate higher-quality frames. Experimental results show that the proposed MastNet achieves the best results in accuracy and long-term prediction on two spatiotemporal datasets compared with the state-of-the-art models.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-20DOI: 10.1007/s10489-025-06235-5
Krutika Verma, Abyayananda Maiti
Deep learning networks have been trained using first-order-based methods. These methods often converge more quickly when combined with an adaptive step size, but they tend to settle at suboptimal points, especially when learning occurs in a large output space. When first-order-based methods are used with a constant step size, they oscillate near the zero-gradient region, which leads to slow convergence. However, these issues are exacerbated under nonconvexity, which can significantly diminish the performance of first-order methods. In this work, we propose a novel Boltzmann Probability Weighted Sine with a Cosine distance-based Adaptive Gradient (BSCAGrad) method. The step size in this method is carefully designed to mitigate the issue of slow convergence. Furthermore, it facilitates escape from suboptimal points, enabling the optimization process to progress more efficiently toward local minima. This is achieved by combining a Boltzmann probability-weighted sine function and cosine distance to calculate the step size. The Boltzmann probability-weighted sine function acts when the gradient vanishes and the cooling parameter remains moderate, a condition typically observed near suboptimal points. Moreover, using the sine function on the exponential moving average of the weight parameters leverages geometric information from the data. The cosine distance prevents zero in the step size. Together, these components accelerate convergence, improve stability, and guide the algorithm toward a better optimal solution. A theoretical analysis of the convergence rate under both convexity and nonconvexity is provided to substantiate the findings. The experimental results from language modeling, object detection, machine translation, and image classification tasks on a real-world benchmark dataset, including CIFAR10, CIFAR100, PennTreeBank, PASCALVOC and WMT2014, demonstrate that the proposed step size outperforms traditional baseline methods.
{"title":"Sine and cosine based learning rate for gradient descent method","authors":"Krutika Verma, Abyayananda Maiti","doi":"10.1007/s10489-025-06235-5","DOIUrl":"10.1007/s10489-025-06235-5","url":null,"abstract":"<p>Deep learning networks have been trained using first-order-based methods. These methods often converge more quickly when combined with an adaptive step size, but they tend to settle at suboptimal points, especially when learning occurs in a large output space. When first-order-based methods are used with a constant step size, they oscillate near the zero-gradient region, which leads to slow convergence. However, these issues are exacerbated under nonconvexity, which can significantly diminish the performance of first-order methods. In this work, we propose a novel <b>B</b>oltzmann Probability Weighted <b>S</b>ine with a <b>C</b>osine distance-based <b>A</b>daptive <b>Grad</b>ient (<i>BSCAGrad</i>) method. The step size in this method is carefully designed to mitigate the issue of slow convergence. Furthermore, it facilitates escape from suboptimal points, enabling the optimization process to progress more efficiently toward local minima. This is achieved by combining a Boltzmann probability-weighted sine function and cosine distance to calculate the step size. The Boltzmann probability-weighted sine function acts when the gradient vanishes and the cooling parameter remains moderate, a condition typically observed near suboptimal points. Moreover, using the sine function on the exponential moving average of the weight parameters leverages geometric information from the data. The cosine distance prevents zero in the step size. Together, these components accelerate convergence, improve stability, and guide the algorithm toward a better optimal solution. A theoretical analysis of the convergence rate under both convexity and nonconvexity is provided to substantiate the findings. The experimental results from language modeling, object detection, machine translation, and image classification tasks on a real-world benchmark dataset, including <i>CIFAR</i>10, <i>CIFAR</i>100, <i>PennTreeBank</i>, <i>PASCALVOC</i> and <i>WMT</i>2014, demonstrate that the proposed step size outperforms traditional baseline methods.</p>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}