首页 > 最新文献

Pattern Recognition最新文献

英文 中文
Scene-enhanced multi-scale temporal aware network for video moment retrieval
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-09 DOI: 10.1016/j.patcog.2025.111642
Di Wang , Yousheng Yu , Shaofeng Li , Haodi Zhong , Xiao Liang , Lin Zhao
Video moment retrieval aims to locate the target moment in an untrimmed video using a natural language query. Current methods to moment retrieval are typically tailored for scenarios where temporal localization information is often simple. Nevertheless, these methods overlook the scenarios where a video includes complex localization information, which makes it difficult to achieve precise retrieval across videos that encompass both complex and simple temporal localization information. To address this limitation, we propose a novel Scene-enhanced Multi-scale Temporal Aware Network (SMTAN) designed to adaptively extract different temporal localization information in different videos. Our method involves the comprehensive processing of video moments across fine-grained multiply scales and uses a prior knowledge of the scene for localization information enhancement. This method facilitates the construction of multi-scale temporal feature maps, enabling extraction of both complex and simple temporal localization information in different videos. Extensive experiments on two benchmark datasets demonstrate that our proposed network surpasses the state-of-the-art methods and achieves more accurate retrieval of different localization information across videos.
{"title":"Scene-enhanced multi-scale temporal aware network for video moment retrieval","authors":"Di Wang ,&nbsp;Yousheng Yu ,&nbsp;Shaofeng Li ,&nbsp;Haodi Zhong ,&nbsp;Xiao Liang ,&nbsp;Lin Zhao","doi":"10.1016/j.patcog.2025.111642","DOIUrl":"10.1016/j.patcog.2025.111642","url":null,"abstract":"<div><div>Video moment retrieval aims to locate the target moment in an untrimmed video using a natural language query. Current methods to moment retrieval are typically tailored for scenarios where temporal localization information is often simple. Nevertheless, these methods overlook the scenarios where a video includes complex localization information, which makes it difficult to achieve precise retrieval across videos that encompass both complex and simple temporal localization information. To address this limitation, we propose a novel Scene-enhanced Multi-scale Temporal Aware Network (SMTAN) designed to adaptively extract different temporal localization information in different videos. Our method involves the comprehensive processing of video moments across fine-grained multiply scales and uses a prior knowledge of the scene for localization information enhancement. This method facilitates the construction of multi-scale temporal feature maps, enabling extraction of both complex and simple temporal localization information in different videos. Extensive experiments on two benchmark datasets demonstrate that our proposed network surpasses the state-of-the-art methods and achieves more accurate retrieval of different localization information across videos.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111642"},"PeriodicalIF":7.5,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143807156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond mask: Rethinking guidance types in few-shot segmentation
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-07 DOI: 10.1016/j.patcog.2025.111635
Shijie Chang, Youwei Pang, Xiaoqi Zhao, Huchuan Lu, Lihe Zhang
Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, class, localization, and shape appearance. Existing work focuses on specific combinations of guidance, leading FSS into different research branches. Rethinking guidance types in FSS is expected to explore the efficient joint representation of the coupling between the support set and query set, giving rise to research trends in the weakly or strongly annotated guidance to meet the customized requirements of practical users. In this work, we provide the generalized FSS with seven guidance paradigms and develop a universal vision–language framework (UniFSS) to integrate prompts from text, mask, box, and image. Leveraging the advantages of large-scale pre-training vision–language models in textual and visual embeddings, UniFSS proposes high-level spatial correction and embedding interactive units to overcome the semantic ambiguity drawbacks typically encountered by pure visual matching methods when facing intra-class appearance diversities. Extensive experiments show that UniFSS significantly outperforms the state-of-the-art methods. Notably, the weakly annotated class-aware box paradigm even surpasses the finely annotated mask paradigm.
{"title":"Beyond mask: Rethinking guidance types in few-shot segmentation","authors":"Shijie Chang,&nbsp;Youwei Pang,&nbsp;Xiaoqi Zhao,&nbsp;Huchuan Lu,&nbsp;Lihe Zhang","doi":"10.1016/j.patcog.2025.111635","DOIUrl":"10.1016/j.patcog.2025.111635","url":null,"abstract":"<div><div>Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, class, localization, and shape appearance. Existing work focuses on specific combinations of guidance, leading FSS into different research branches. Rethinking guidance types in FSS is expected to explore the efficient joint representation of the coupling between the support set and query set, giving rise to research trends in the weakly or strongly annotated guidance to meet the customized requirements of practical users. In this work, we provide the generalized FSS with seven guidance paradigms and develop a universal vision–language framework (UniFSS) to integrate prompts from text, mask, box, and image. Leveraging the advantages of large-scale pre-training vision–language models in textual and visual embeddings, UniFSS proposes high-level spatial correction and embedding interactive units to overcome the semantic ambiguity drawbacks typically encountered by pure visual matching methods when facing intra-class appearance diversities. Extensive experiments show that UniFSS significantly outperforms the state-of-the-art methods. Notably, the weakly annotated class-aware box paradigm even surpasses the finely annotated mask paradigm.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111635"},"PeriodicalIF":7.5,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143799054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scaled robust linear embedding with adaptive neighbors preserving
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-06 DOI: 10.1016/j.patcog.2025.111625
Yunlong Gao , Qinting Wu , Xinjing Wang , Tingting Lin , Jinyan Pan , Chao Cao , Guifang Shao , Qingyuan Zhu , Feiping Nie
Manifold learning studies the invariability of geometry in continuous deformation. In recent years, feature space learning methods usually extract and preserve the essential structure of manifold data by preserving the affinity relationship between sample points in the embedded space, namely, the invariant property. However, in this paper, we find that only considering the affinity relationship cannot effectively extract and preserve the essential structure of data in the embedded space. Additionally, to solve the out-of-samples problem, manifold learning uses linear embedding instead of nonlinear embedding to preserve the manifold structure of data. However, linear embedding assumes that manifold data are global linear manifolds, thus the coupling of different local regions and the diversity in the spatial scales of different regions will further distort the original data and impair the efficiency of linear embedding for preserving the essential structure of data. To solve this problem, this paper proposes scaled robust linear embedding with adaptive neighbors preserving (SLE), which introduces the adaptive weighting based on local statistical characteristics to achieve flexible embedding for manifold data, where the adaptive weights can be regarded as the elastic deformation coefficients of local manifold structures of data. Due to the adaptive elastic deformation, SLE can reduces the gap between nonlinear embedding and linear embedding, thus improving the ability of linear embedding to preserve the essential structure of data. Moreover, SLE integrates the learning of elastic deformation coefficients, similarity learning, and subspace learning into a unified framework, therefore, the combination optimality of these three variables is guaranteed. An efficient alternative optimization algorithm is proposed to solve the challenging optimization problem, the theoretical analysis of its computational complexity and convergence is also performed in this paper. Eventually, SLE has been extensively experimented on both artificial and real-world datasets and compared with current state-of-the-art algorithms. The experimental results indicate that SLE has a strong ability in uncovering and preserving the essential structure of data in linear embedding space.
{"title":"Scaled robust linear embedding with adaptive neighbors preserving","authors":"Yunlong Gao ,&nbsp;Qinting Wu ,&nbsp;Xinjing Wang ,&nbsp;Tingting Lin ,&nbsp;Jinyan Pan ,&nbsp;Chao Cao ,&nbsp;Guifang Shao ,&nbsp;Qingyuan Zhu ,&nbsp;Feiping Nie","doi":"10.1016/j.patcog.2025.111625","DOIUrl":"10.1016/j.patcog.2025.111625","url":null,"abstract":"<div><div>Manifold learning studies the invariability of geometry in continuous deformation. In recent years, feature space learning methods usually extract and preserve the essential structure of manifold data by preserving the affinity relationship between sample points in the embedded space, namely, the invariant property. However, in this paper, we find that only considering the affinity relationship cannot effectively extract and preserve the essential structure of data in the embedded space. Additionally, to solve the out-of-samples problem, manifold learning uses linear embedding instead of nonlinear embedding to preserve the manifold structure of data. However, linear embedding assumes that manifold data are global linear manifolds, thus the coupling of different local regions and the diversity in the spatial scales of different regions will further distort the original data and impair the efficiency of linear embedding for preserving the essential structure of data. To solve this problem, this paper proposes scaled robust linear embedding with adaptive neighbors preserving (SLE), which introduces the adaptive weighting based on local statistical characteristics to achieve flexible embedding for manifold data, where the adaptive weights can be regarded as the elastic deformation coefficients of local manifold structures of data. Due to the adaptive elastic deformation, SLE can reduces the gap between nonlinear embedding and linear embedding, thus improving the ability of linear embedding to preserve the essential structure of data. Moreover, SLE integrates the learning of elastic deformation coefficients, similarity learning, and subspace learning into a unified framework, therefore, the combination optimality of these three variables is guaranteed. An efficient alternative optimization algorithm is proposed to solve the challenging optimization problem, the theoretical analysis of its computational complexity and convergence is also performed in this paper. Eventually, SLE has been extensively experimented on both artificial and real-world datasets and compared with current state-of-the-art algorithms. The experimental results indicate that SLE has a strong ability in uncovering and preserving the essential structure of data in linear embedding space.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111625"},"PeriodicalIF":7.5,"publicationDate":"2025-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143807264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FaTNET: Feature-alignment transformer network for human pose transfer
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-05 DOI: 10.1016/j.patcog.2025.111626
Yu Luo , Chengzhi Yuan , Lin Gao , Weiwei Xu , Xiaosong Yang , Pengjie Wang
Pose-guided person image generation involves converting an image of a person from a source pose to a target pose. This task presents significant challenges due to the extensive variability and occlusion. Existing methods heavily rely on CNN-based architectures, which are constrained by their local receptive fields and often struggle to preserve the details of style and shape. To address this problem, we propose a novel framework for human pose transfer with transformers, which can employ global dependencies and keep local features as well. The proposed framework consists of transformer encoder, feature alignment network and transformer synthetic network, enabling the generation of realistic person images with desired poses. The core idea of our framework is to obtain a novel prior image aligned with the target image through the feature alignment network in the embedded and disentangled feature space, and then synthesize the final fine image through the transformer synthetic network by recurrently warping the result of previous stage with the correlation matrix between aligned features and source images. In contrast to previous convolution and non-local methods, ours can employ the global receptive field and preserve detail features as well. The results of qualitative and quantitative experiments demonstrate the superiority of our model in human pose transfer.
{"title":"FaTNET: Feature-alignment transformer network for human pose transfer","authors":"Yu Luo ,&nbsp;Chengzhi Yuan ,&nbsp;Lin Gao ,&nbsp;Weiwei Xu ,&nbsp;Xiaosong Yang ,&nbsp;Pengjie Wang","doi":"10.1016/j.patcog.2025.111626","DOIUrl":"10.1016/j.patcog.2025.111626","url":null,"abstract":"<div><div>Pose-guided person image generation involves converting an image of a person from a source pose to a target pose. This task presents significant challenges due to the extensive variability and occlusion. Existing methods heavily rely on CNN-based architectures, which are constrained by their local receptive fields and often struggle to preserve the details of style and shape. To address this problem, we propose a novel framework for human pose transfer with transformers, which can employ global dependencies and keep local features as well. The proposed framework consists of transformer encoder, feature alignment network and transformer synthetic network, enabling the generation of realistic person images with desired poses. The core idea of our framework is to obtain a novel prior image aligned with the target image through the feature alignment network in the embedded and disentangled feature space, and then synthesize the final fine image through the transformer synthetic network by recurrently warping the result of previous stage with the correlation matrix between aligned features and source images. In contrast to previous convolution and non-local methods, ours can employ the global receptive field and preserve detail features as well. The results of qualitative and quantitative experiments demonstrate the superiority of our model in human pose transfer.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111626"},"PeriodicalIF":7.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generative compositor for few-shot visual information extraction
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-05 DOI: 10.1016/j.patcog.2025.111624
Zhibo Yang , Wei Hua , Sibo Song , Cong Yao , Yingying Zhu , Wenqing Cheng , Xiang Bai
Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant challenges. In this paper, we propose a novel generative model, named Generative Compositor, to address the challenge of few-shot VIE. The Generative Compositor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text and assembling them based on the provided prompts. Furthermore, three pre-training strategies are employed to enhance the model’s perception of spatial context information. Besides, a prompt-aware resampler is specially designed to enable efficient matching by leveraging the entity-semantic prior contained in prompts. The introduction of the prompt-based retrieval mechanism and the pre-training strategies enable the model to acquire more effective spatial and semantic clues with limited training samples. Experiments demonstrate that the proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.
{"title":"Generative compositor for few-shot visual information extraction","authors":"Zhibo Yang ,&nbsp;Wei Hua ,&nbsp;Sibo Song ,&nbsp;Cong Yao ,&nbsp;Yingying Zhu ,&nbsp;Wenqing Cheng ,&nbsp;Xiang Bai","doi":"10.1016/j.patcog.2025.111624","DOIUrl":"10.1016/j.patcog.2025.111624","url":null,"abstract":"<div><div>Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant challenges. In this paper, we propose a novel generative model, named Generative Compositor, to address the challenge of few-shot VIE. The Generative Compositor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text and assembling them based on the provided prompts. Furthermore, three pre-training strategies are employed to enhance the model’s perception of spatial context information. Besides, a prompt-aware resampler is specially designed to enable efficient matching by leveraging the entity-semantic prior contained in prompts. The introduction of the prompt-based retrieval mechanism and the pre-training strategies enable the model to acquire more effective spatial and semantic clues with limited training samples. Experiments demonstrate that the proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111624"},"PeriodicalIF":7.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Infrared small target detection based on hypergraph and asymmetric penalty function
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-05 DOI: 10.1016/j.patcog.2025.111634
Yuan Luo, Xiaorun Li, Shuhan Chen
Recently, infrared (IR) small target detection problem has attracted increasing attention. Component analysis-based techniques have been widely utilized, while they are faced with challenges such as low-rank background and sparse target estimation, and model construction. In this paper, an IR small target detection model with hypergraph Laplacian regularization and asymmetric penalty function-based regularization (HGLAPR) is proposed. Specifically, a spatial–temporal tensor is constructed. Then, we construct a hypergraph structure and design a hypergraph Laplacian regularization as well as a Laplace-based tensor nuclear norm for low-rank background estimation. Additionally, an asymmetric penalty function-based sparsity regularization is introduced for more accurate target estimation. To efficiently solve this model, we design an alternating direction method of multipliers (ADMM)-based optimization scheme. Extensive experiments conducted on six real IR sequences with complex scenarios illustrate the superiority of HGLAPR over ten state-of-the-art competitive methods in terms of target detectability, background suppressibility and overall performance.
{"title":"Infrared small target detection based on hypergraph and asymmetric penalty function","authors":"Yuan Luo,&nbsp;Xiaorun Li,&nbsp;Shuhan Chen","doi":"10.1016/j.patcog.2025.111634","DOIUrl":"10.1016/j.patcog.2025.111634","url":null,"abstract":"<div><div>Recently, infrared (IR) small target detection problem has attracted increasing attention. Component analysis-based techniques have been widely utilized, while they are faced with challenges such as low-rank background and sparse target estimation, and model construction. In this paper, an IR small target detection model with hypergraph Laplacian regularization and asymmetric penalty function-based regularization (HGLAPR) is proposed. Specifically, a spatial–temporal tensor is constructed. Then, we construct a hypergraph structure and design a hypergraph Laplacian regularization as well as a Laplace-based tensor nuclear norm for low-rank background estimation. Additionally, an asymmetric penalty function-based sparsity regularization is introduced for more accurate target estimation. To efficiently solve this model, we design an alternating direction method of multipliers (ADMM)-based optimization scheme. Extensive experiments conducted on six real IR sequences with complex scenarios illustrate the superiority of HGLAPR over ten state-of-the-art competitive methods in terms of target detectability, background suppressibility and overall performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111634"},"PeriodicalIF":7.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143799052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSA: Cross-scale alignment with adaptive semantic aggregation and filter for image–text retrieval
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-05 DOI: 10.1016/j.patcog.2025.111647
Zheng Liu , Junhao Xu , Shanshan Gao , Zhumin Chen
Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: https://github.com/xjh0805/CSA.
{"title":"CSA: Cross-scale alignment with adaptive semantic aggregation and filter for image–text retrieval","authors":"Zheng Liu ,&nbsp;Junhao Xu ,&nbsp;Shanshan Gao ,&nbsp;Zhumin Chen","doi":"10.1016/j.patcog.2025.111647","DOIUrl":"10.1016/j.patcog.2025.111647","url":null,"abstract":"<div><div>Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: <span><span>https://github.com/xjh0805/CSA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111647"},"PeriodicalIF":7.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stochastic limited memory bundle algorithm for clustering in big data
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-05 DOI: 10.1016/j.patcog.2025.111654
Napsu Karmitsa , Ville-Pekka Eronen , Marko M. Mäkelä , Tapio Pahikkala , Antti Airola
Clustering is a crucial task in data mining and machine learning. In this paper, we propose an efficient algorithm, Big-Clust, for solving minimum sum-of-squares clustering problems in large and big datasets. We first develop a novel stochastic limited memory bundle algorithm (SLMBA) for large-scale nonsmooth finite-sum optimization problems and then formulate the clustering problem accordingly. The Big-Clustalgorithm — a stochastic adaptation of the incremental clustering methodology — aims to find the global or a high-quality local solution for the clustering problem. It detects good starting points, i.e., initial cluster centers, for the SLMBA, applied as an underlying solver. We evaluate Big-Cluston several real-world datasets with numerous data points and features, comparing its performance with other clustering algorithms designed for large and big data. Numerical results demonstrate the efficiency of the proposed algorithm and the high quality of the found solutions on par with the best existing methods.
{"title":"Stochastic limited memory bundle algorithm for clustering in big data","authors":"Napsu Karmitsa ,&nbsp;Ville-Pekka Eronen ,&nbsp;Marko M. Mäkelä ,&nbsp;Tapio Pahikkala ,&nbsp;Antti Airola","doi":"10.1016/j.patcog.2025.111654","DOIUrl":"10.1016/j.patcog.2025.111654","url":null,"abstract":"<div><div>Clustering is a crucial task in data mining and machine learning. In this paper, we propose an efficient algorithm, <span>Big-Clust</span>, for solving minimum sum-of-squares clustering problems in large and big datasets. We first develop a novel stochastic limited memory bundle algorithm (<span>SLMBA</span>) for large-scale nonsmooth finite-sum optimization problems and then formulate the clustering problem accordingly. The <span>Big-Clust</span>algorithm — a stochastic adaptation of the incremental clustering methodology — aims to find the global or a high-quality local solution for the clustering problem. It detects good starting points, i.e., initial cluster centers, for the <span>SLMBA</span>, applied as an underlying solver. We evaluate <span>Big-Clust</span>on several real-world datasets with numerous data points and features, comparing its performance with other clustering algorithms designed for large and big data. Numerical results demonstrate the efficiency of the proposed algorithm and the high quality of the found solutions on par with the best existing methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111654"},"PeriodicalIF":7.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143807260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boundary-aware and cross-modal fusion network for enhanced multi-modal brain tumor segmentation 用于增强多模态脑肿瘤分割的边界感知和跨模态融合网络
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-04 DOI: 10.1016/j.patcog.2025.111637
Tongxue Zhou
In recent years, brain tumor segmentation has emerged as a critical area of focus in medical image analysis. Accurate tumor delineation is essential for effective treatment planning and patient monitoring. Many existing algorithms struggle with accurately delineating complex tumor boundaries, particularly in cases where tumors exhibit heterogeneous features or blend with surrounding healthy tissues. In this paper, I propose a novel boundary-aware multi-modal brain tumor segmentation network, which integrates four key contributions to improve segmentation accuracy. First, I introduce a Boundary Extraction Module (BEM) to capture essential boundary information for segmentation. Second, I present a Boundary Guidance Module (BGM) to guide the segmentation process by incorporating boundary-specific information. Third, I design a Boundary Supervision Module (BSM) to enhance segmentation accuracy by providing multi-level boundary supervision. Lastly, I propose a Cross-feature Fusion (CFF) that integrates complementary information from different MRI modalities to enhance overall segmentation performance. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods, achieving superior tumor segmentation accuracy across brain tumor segmentation datasets, thereby indicating its potential for clinical applications in neuroimaging.
近年来,脑肿瘤分割已成为医学图像分析的一个关键领域。准确的肿瘤划分对于有效的治疗计划和患者监测至关重要。许多现有算法都难以准确划分复杂的肿瘤边界,尤其是在肿瘤呈现异质特征或与周围健康组织混合的情况下。在本文中,我提出了一种新颖的边界感知多模态脑肿瘤分割网络,该网络集成了四个关键贡献以提高分割准确性。首先,我引入了边界提取模块(BEM)来捕捉分割所需的重要边界信息。其次,我提出了边界指导模块(BGM),通过纳入特定边界信息来指导分割过程。第三,我设计了一个边界监督模块(BSM),通过提供多级边界监督来提高分割的准确性。最后,我提出了交叉特征融合 (CFF),它整合了不同核磁共振成像模式的互补信息,以提高整体分割性能。实验结果表明,所提出的模型优于最先进的方法,在脑肿瘤分割数据集上实现了卓越的肿瘤分割准确性,从而显示了其在神经成像领域的临床应用潜力。
{"title":"Boundary-aware and cross-modal fusion network for enhanced multi-modal brain tumor segmentation","authors":"Tongxue Zhou","doi":"10.1016/j.patcog.2025.111637","DOIUrl":"10.1016/j.patcog.2025.111637","url":null,"abstract":"<div><div>In recent years, brain tumor segmentation has emerged as a critical area of focus in medical image analysis. Accurate tumor delineation is essential for effective treatment planning and patient monitoring. Many existing algorithms struggle with accurately delineating complex tumor boundaries, particularly in cases where tumors exhibit heterogeneous features or blend with surrounding healthy tissues. In this paper, I propose a novel boundary-aware multi-modal brain tumor segmentation network, which integrates four key contributions to improve segmentation accuracy. First, I introduce a Boundary Extraction Module (BEM) to capture essential boundary information for segmentation. Second, I present a Boundary Guidance Module (BGM) to guide the segmentation process by incorporating boundary-specific information. Third, I design a Boundary Supervision Module (BSM) to enhance segmentation accuracy by providing multi-level boundary supervision. Lastly, I propose a Cross-feature Fusion (CFF) that integrates complementary information from different MRI modalities to enhance overall segmentation performance. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods, achieving superior tumor segmentation accuracy across brain tumor segmentation datasets, thereby indicating its potential for clinical applications in neuroimaging.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111637"},"PeriodicalIF":7.5,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143786240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implicit Image-to-Image Schrödinger Bridge for image restoration
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-04 DOI: 10.1016/j.patcog.2025.111627
Yuang Wang , Siyeop Yoon , Pengfei Jin , Matthew Tivnan , Sifan Song , Zhennong Chen , Rui Hu , Li Zhang , Quanzheng Li , Zhiqiang Chen , Dufan Wu
Diffusion-based models have demonstrated remarkable effectiveness in image restoration tasks; however, their iterative denoising process, which starts from Gaussian noise, often leads to slow inference speeds. The Image-to-Image Schrödinger Bridge (I2SB) offers a promising alternative by initializing the generative process from corrupted images while leveraging training techniques from score-based diffusion models. In this paper, we introduce the Implicit Image-to-Image Schrödinger Bridge (I3SB) to further accelerate the generative process of I2SB. I3SB restructures the generative process into a non-Markovian framework by incorporating the initial corrupted image at each generative step, effectively preserving and utilizing its information. To enable direct use of pretrained I2SB models without additional training, we ensure consistency in marginal distributions. Extensive experiments across many image corruptions—including noise, low resolution, JPEG compression, and sparse sampling—and multiple image modalities—such as natural, human face, and medical images— demonstrate the acceleration benefits of I3SB. Compared to I2SB, I3SB achieves the same perceptual quality with fewer generative steps, while maintaining or improving fidelity to the ground truth.
{"title":"Implicit Image-to-Image Schrödinger Bridge for image restoration","authors":"Yuang Wang ,&nbsp;Siyeop Yoon ,&nbsp;Pengfei Jin ,&nbsp;Matthew Tivnan ,&nbsp;Sifan Song ,&nbsp;Zhennong Chen ,&nbsp;Rui Hu ,&nbsp;Li Zhang ,&nbsp;Quanzheng Li ,&nbsp;Zhiqiang Chen ,&nbsp;Dufan Wu","doi":"10.1016/j.patcog.2025.111627","DOIUrl":"10.1016/j.patcog.2025.111627","url":null,"abstract":"<div><div>Diffusion-based models have demonstrated remarkable effectiveness in image restoration tasks; however, their iterative denoising process, which starts from Gaussian noise, often leads to slow inference speeds. The Image-to-Image Schrödinger Bridge (I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>SB) offers a promising alternative by initializing the generative process from corrupted images while leveraging training techniques from score-based diffusion models. In this paper, we introduce the Implicit Image-to-Image Schrödinger Bridge (I<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>SB) to further accelerate the generative process of I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>SB. I<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>SB restructures the generative process into a non-Markovian framework by incorporating the initial corrupted image at each generative step, effectively preserving and utilizing its information. To enable direct use of pretrained I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>SB models without additional training, we ensure consistency in marginal distributions. Extensive experiments across many image corruptions—including noise, low resolution, JPEG compression, and sparse sampling—and multiple image modalities—such as natural, human face, and medical images— demonstrate the acceleration benefits of I<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>SB. Compared to I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>SB, I<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>SB achieves the same perceptual quality with fewer generative steps, while maintaining or improving fidelity to the ground truth.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111627"},"PeriodicalIF":7.5,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1