Multimodal knowledge graph completion (MKGC) has been a popular research topic in recent years. However, existing methods rarely consider the alignment of different entity modalities in the process of multimodal fusion, and often lack sufficient attention to the semantic information conveyed by relations, thus resulting in unsatisfactory completion performance. To address these two issues, we propose a new MKGC model called C2RS. This model first designs a cross-modal consistency contrastive learning task to align different entity modalities for accurate entity representation. Then, C2RS develops a relation semantic encoding module based on the distributions of knowledge graph (KG) triples to extract the semantic information of relations for comprehensive relation representation. Finally, we encode the candidate triples with a triple encoder and identify the correct entities through a scoring function to complete the multimodal KG. According to the extensive experiments on three public MKGC datasets, C2RS obviously outperforms the baseline methods.
{"title":"C2RS: Multimodal Knowledge Graph Completion With Cross-Modal Consistency and Relation Semantics","authors":"Yulou Shu;Wengen Li;Jiaqi Wang;Yichao Zhang;Jihong Guan;Shuigeng Zhou","doi":"10.1109/TAI.2025.3548621","DOIUrl":"https://doi.org/10.1109/TAI.2025.3548621","url":null,"abstract":"Multimodal knowledge graph completion (MKGC) has been a popular research topic in recent years. However, existing methods rarely consider the alignment of different entity modalities in the process of multimodal fusion, and often lack sufficient attention to the semantic information conveyed by relations, thus resulting in unsatisfactory completion performance. To address these two issues, we propose a new MKGC model called C<sup>2</sup>RS. This model first designs a cross-modal consistency contrastive learning task to align different entity modalities for accurate entity representation. Then, C<sup>2</sup>RS develops a relation semantic encoding module based on the distributions of knowledge graph (KG) triples to extract the semantic information of relations for comprehensive relation representation. Finally, we encode the candidate triples with a triple encoder and identify the correct entities through a scoring function to complete the multimodal KG. According to the extensive experiments on three public MKGC datasets, C<sup>2</sup>RS obviously outperforms the baseline methods.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 11","pages":"2940-2952"},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145456009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Retinal diseases, such as glaucoma, age-related macular degeneration, and high myopia, are major contributors to global vision loss, emphasizing the need for early detection and intervention. Current deep learning approaches for diagnosing retinal diseases using fundus images primarily focus on single-disease classification due to the scarcity and expense of diverse datasets. This limitation restricts their generalization across multiple ocular diseases and impedes transfer learning to untrained disease types. In this article, we introduce a novel model-agnostic meta-learning framework, called mirror meta-learning (MML), which incorporates an autoencoder module to supervise the backpropagation path in few-shot learning, enhancing model initialization and adaptation. MML’s effectiveness is validated using four publicly available retinal disease binary classification datasets and a proprietary high myopia dataset. In addition, MML demonstrates robustness when tested on three well-established few-shot learning datasets. Our results show the proposed model’s superiority in terms of performance and generalizability in ocular disease classification tasks.
{"title":"Learn to Learn: A Mirror Meta-Learning Method for Retinal Disease Diagnosis on Fundus Images","authors":"Haoran Peng;Jianqiang Li;Wenxiu Cheng;Linna Zhao;Yu Guan;Zhaosheng Li;Li Li;Xi Xu","doi":"10.1109/TAI.2025.3566082","DOIUrl":"https://doi.org/10.1109/TAI.2025.3566082","url":null,"abstract":"Retinal diseases, such as glaucoma, age-related macular degeneration, and high myopia, are major contributors to global vision loss, emphasizing the need for early detection and intervention. Current deep learning approaches for diagnosing retinal diseases using fundus images primarily focus on single-disease classification due to the scarcity and expense of diverse datasets. This limitation restricts their generalization across multiple ocular diseases and impedes transfer learning to untrained disease types. In this article, we introduce a novel model-agnostic meta-learning framework, called mirror meta-learning (MML), which incorporates an autoencoder module to supervise the backpropagation path in few-shot learning, enhancing model initialization and adaptation. MML’s effectiveness is validated using four publicly available retinal disease binary classification datasets and a proprietary high myopia dataset. In addition, MML demonstrates robustness when tested on three well-established few-shot learning datasets. Our results show the proposed model’s superiority in terms of performance and generalizability in ocular disease classification tasks.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 12","pages":"3391-3405"},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-05DOI: 10.1109/TAI.2025.3566923
Yange Li;Fan Yu;Qun Chen
In public spaces, high pedestrian concentrations usually lead to congestion and may even pose trampling risks. Upon reaching a specific density threshold, implementing control measures becomes necessary to regulate pedestrian inflows. Therefore, detecting and identifying crowded areas is crucial for pedestrian flow control. Crowd counting is a key technique for achieving this goal. Recently, researchers have dedicated significant efforts to designing convolutional neural networks with various architectures for solving this problem. However, the existing models have structures with high computing power requirements for extreme situations, making it difficult for them to run on edge devices such as surveillance computers. In this article, we propose a lightweight crowd counting model with a dynamic convolutional kernel for the crowd counting task. The model is built via an encoder–decoder structure. The encoder extracts high-quality features through inverted residual layers implemented via MobileNetV2, which are replaced by a dynamic convolutional kernel. The decoder generates a density map through upsampling and linear layers. A skip connection structure is added to facilitate information exchange between the codecs and reduce the loss of information. Moreover, a training strategy based on curriculum reinforcement learning is presented. This strategy facilitates the integration of samples from diverse datasets, and the difficulty level of each sampling step is dynamically adjusted with a reinforcement learning model. In addition, this strategy can be used to organize the training sequence in each iteration on the basis of sample complexity, thereby achieving enhanced training stability and improved model performance. Comprehensive experimental evidence demonstrates that our model produces superior outcomes to those of competing methods across several benchmark datasets.
{"title":"Lightweight Dynamic Convolutional Network for Crowd Counting Based on Curriculum Reinforcement Learning","authors":"Yange Li;Fan Yu;Qun Chen","doi":"10.1109/TAI.2025.3566923","DOIUrl":"https://doi.org/10.1109/TAI.2025.3566923","url":null,"abstract":"In public spaces, high pedestrian concentrations usually lead to congestion and may even pose trampling risks. Upon reaching a specific density threshold, implementing control measures becomes necessary to regulate pedestrian inflows. Therefore, detecting and identifying crowded areas is crucial for pedestrian flow control. Crowd counting is a key technique for achieving this goal. Recently, researchers have dedicated significant efforts to designing convolutional neural networks with various architectures for solving this problem. However, the existing models have structures with high computing power requirements for extreme situations, making it difficult for them to run on edge devices such as surveillance computers. In this article, we propose a lightweight crowd counting model with a dynamic convolutional kernel for the crowd counting task. The model is built via an encoder–decoder structure. The encoder extracts high-quality features through inverted residual layers implemented via MobileNetV2, which are replaced by a dynamic convolutional kernel. The decoder generates a density map through upsampling and linear layers. A skip connection structure is added to facilitate information exchange between the codecs and reduce the loss of information. Moreover, a training strategy based on curriculum reinforcement learning is presented. This strategy facilitates the integration of samples from diverse datasets, and the difficulty level of each sampling step is dynamically adjusted with a reinforcement learning model. In addition, this strategy can be used to organize the training sequence in each iteration on the basis of sample complexity, thereby achieving enhanced training stability and improved model performance. Comprehensive experimental evidence demonstrates that our model produces superior outcomes to those of competing methods across several benchmark datasets.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 11","pages":"3115-3131"},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-05DOI: 10.1109/TAI.2025.3566925
Yi Zheng;Xuanbin Ding;Xiang Zhao;Xiaoqin Pan;Lei Zhou
The K-nearest neighbors (kNNs) algorithm, a cornerstone of supervised learning, relies on similarity measures constrained by real-number-based distance metrics. A critical limitation of traditional kNN research lies in its confinement to the real-number domain, which inherently restricts its ability to model nonlinear feature interactions in high-dimensional data and amplifies sensitivity to feature redundancy and class imbalance. These limitations arise from the inherent linearity and unidimensional nature of real-number representations, which restrict their ability to model complex feature interdependencies. To transcend these limitations, this article proposes ordered pairs of normalized real numbers (OPNs)-kNN, a novel framework grounded in OPNs. Departing from the conventional real-number paradigm, OPNs-kNN constructs feature pairs as multidimensional OPNs tuples and employs a generalized OPNs-valued metric to explicitly model nonlinear relationships, thereby addressing the inherent shortcomings of real-number-based kNN. Extensive experiments on nine University of California, Irvine (UCI) benchmark datasets (e.g., glass, wines, and seeds) demonstrate that OPNs-kNN achieves statistically significant improvements in classification accuracy, precision, recall, and F1-score compared with traditional kNN and its enhanced variants. This work pioneers a nonreal-number computational framework, proving that moving beyond real-number constraints enables more expressive representations of data relationships, opening new directions for designing robust machine learning models in complex domains.
{"title":"K-Nearest Neighbor Algorithm Based on the Framework of Ordered Pair of Normalized Real Numbers","authors":"Yi Zheng;Xuanbin Ding;Xiang Zhao;Xiaoqin Pan;Lei Zhou","doi":"10.1109/TAI.2025.3566925","DOIUrl":"https://doi.org/10.1109/TAI.2025.3566925","url":null,"abstract":"The K-nearest neighbors (kNNs) algorithm, a cornerstone of supervised learning, relies on similarity measures constrained by real-number-based distance metrics. A critical limitation of traditional kNN research lies in its confinement to the real-number domain, which inherently restricts its ability to model nonlinear feature interactions in high-dimensional data and amplifies sensitivity to feature redundancy and class imbalance. These limitations arise from the inherent linearity and unidimensional nature of real-number representations, which restrict their ability to model complex feature interdependencies. To transcend these limitations, this article proposes ordered pairs of normalized real numbers (OPNs)-kNN, a novel framework grounded in OPNs. Departing from the conventional real-number paradigm, OPNs-kNN constructs feature pairs as multidimensional OPNs tuples and employs a generalized OPNs-valued metric to explicitly model nonlinear relationships, thereby addressing the inherent shortcomings of real-number-based kNN. Extensive experiments on nine University of California, Irvine (UCI) benchmark datasets (e.g., glass, wines, and seeds) demonstrate that OPNs-kNN achieves statistically significant improvements in classification accuracy, precision, recall, and F1-score compared with traditional kNN and its enhanced variants. This work pioneers a nonreal-number computational framework, proving that moving beyond real-number constraints enables more expressive representations of data relationships, opening new directions for designing robust machine learning models in complex domains.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 11","pages":"3132-3147"},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145428944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1109/TAI.2025.3547878
Mohsen Saffari;Mahdi Khodayar;Mohammad E. Khodayar;Seyed Saeed Fazlhashemi
Accurate fault classification and location are critical to ensure the reliability and resilience of large-scale power distribution systems (PDSs). The existing data-driven works in this area struggle to capture essential space-time correlations of PDS measurements and often rely on deterministic and shallow neural architectures. Furthermore, they encounter challenges such as over-smoothing and the inability to capture deep correlations. To overcome these limitations, a novel deep space-time generative graph convolutional autoencoder (SGGCA) is proposed. First, the PDS is modeled as a space-time graph where the nodes and edges show the bus measurements and line impedance values, respectively. The proposed SGGCA's encoder captures deep correlations of the space-time graph using a new graph convolution with early connections and identity transformations to mitigate the over-smoothing. Our encoder encompasses a new recurrent method to adjust graph convolution parameters without relying on node embeddings on the temporal dimension. Additionally, it incorporates generative modeling by capturing the probability distribution function of the latent representation through a conditional normalizing flow model. The extracted generative space-time features are enhanced by a multi-head attention mechanism to better capture task-relevant characteristics of the PDS measurements. The extracted features are fed to sparse decoders to classify and locate the faults in the PDS. The feature sparsity of decoders ensures a high generalization capacity and avoids overfitting. The proposed method is evaluated on the IEEE 69-bus and 123-bus systems. It achieves substantial improvements in fault classification accuracy by 3.33% and 6.26% and enhances fault location accuracy by 6.33% and 5.73% for the respective PDSs compared with state-of-the-art models.
{"title":"Deep Graph Convolutional Autoencoder With Conditional Normalizing Flow for Power Distribution Systems Fault Classification and Location","authors":"Mohsen Saffari;Mahdi Khodayar;Mohammad E. Khodayar;Seyed Saeed Fazlhashemi","doi":"10.1109/TAI.2025.3547878","DOIUrl":"https://doi.org/10.1109/TAI.2025.3547878","url":null,"abstract":"Accurate fault classification and location are critical to ensure the reliability and resilience of large-scale power distribution systems (PDSs). The existing data-driven works in this area struggle to capture essential space-time correlations of PDS measurements and often rely on deterministic and shallow neural architectures. Furthermore, they encounter challenges such as over-smoothing and the inability to capture deep correlations. To overcome these limitations, a novel deep space-time generative graph convolutional autoencoder (SGGCA) is proposed. First, the PDS is modeled as a space-time graph where the nodes and edges show the bus measurements and line impedance values, respectively. The proposed SGGCA's encoder captures deep correlations of the space-time graph using a new graph convolution with early connections and identity transformations to mitigate the over-smoothing. Our encoder encompasses a new recurrent method to adjust graph convolution parameters without relying on node embeddings on the temporal dimension. Additionally, it incorporates generative modeling by capturing the probability distribution function of the latent representation through a conditional normalizing flow model. The extracted generative space-time features are enhanced by a multi-head attention mechanism to better capture task-relevant characteristics of the PDS measurements. The extracted features are fed to sparse decoders to classify and locate the faults in the PDS. The feature sparsity of decoders ensures a high generalization capacity and avoids overfitting. The proposed method is evaluated on the IEEE 69-bus and 123-bus systems. It achieves substantial improvements in fault classification accuracy by 3.33% and 6.26% and enhances fault location accuracy by 6.33% and 5.73% for the respective PDSs compared with state-of-the-art models.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 9","pages":"2448-2463"},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intelligent optimization of a solar power tower heliostat field (SPTHF) is critical for harnessing solar energy in various scenarios. However, existing SPTHF optimization methods are typically based on specific geometric layout constraints and assume that each heliostat has the same size and height. As a result, these methods are not flexible or practical in many real-world SPTHF application scenarios. Therefore, this article proposes a novel flexible SPTHF (FSPTHF) model that is more practical and involves fewer assumptions. This model enables the use of different layouts and simultaneous optimization of the parameters of each heliostat. As an FSPTHF can involve hundreds or even thousands of heliostats, optimizing the parameters of all heliostats results in a challenging large-scale optimization problem. To efficiently solve this problem, this article proposes a matrix-based differential evolution algorithm, called HMDE, for large-scale heliostat design. The HMDE uses a matrix-based encoding and representation method to improve optimization accuracy and convergence speed, incorporating two novel designs. First, a dual elite-based mutation method is proposed to enhance the convergence speed of HMDE by learning from multiple elite individuals. Second, a multi-level crossover method is proposed to improve the optimization accuracy and convergence speed by integrating element-level and vector-level crossover based on matrix representation. Extensive experiments were conducted on 30 problem instances based on real-world data with three different layouts and problem dimensions up to 12 000, where state-of-the-art algorithms were used for comparison. The experimental results show that the proposed HMDE can effectively solve large-scale FSPTHF optimization problems.
{"title":"Large-Scale Heliostat Field Optimization for Solar Power Tower System Using Matrix-Based Differential Evolution","authors":"Dan-Ting Duan;Jian-Yu Li;Bing Sun;Xiao-Fang Liu;Qiang Yang;Qi-Jia Jiang;Zhi-Hui Zhan;Sam Kwong;Jun Zhang","doi":"10.1109/TAI.2025.3545813","DOIUrl":"https://doi.org/10.1109/TAI.2025.3545813","url":null,"abstract":"Intelligent optimization of a solar power tower heliostat field (SPTHF) is critical for harnessing solar energy in various scenarios. However, existing SPTHF optimization methods are typically based on specific geometric layout constraints and assume that each heliostat has the same size and height. As a result, these methods are not flexible or practical in many real-world SPTHF application scenarios. Therefore, this article proposes a novel flexible SPTHF (FSPTHF) model that is more practical and involves fewer assumptions. This model enables the use of different layouts and simultaneous optimization of the parameters of each heliostat. As an FSPTHF can involve hundreds or even thousands of heliostats, optimizing the parameters of all heliostats results in a challenging large-scale optimization problem. To efficiently solve this problem, this article proposes a matrix-based differential evolution algorithm, called HMDE, for large-scale heliostat design. The HMDE uses a matrix-based encoding and representation method to improve optimization accuracy and convergence speed, incorporating two novel designs. First, a dual elite-based mutation method is proposed to enhance the convergence speed of HMDE by learning from multiple elite individuals. Second, a multi-level crossover method is proposed to improve the optimization accuracy and convergence speed by integrating element-level and vector-level crossover based on matrix representation. Extensive experiments were conducted on 30 problem instances based on real-world data with three different layouts and problem dimensions up to 12 000, where state-of-the-art algorithms were used for comparison. The experimental results show that the proposed HMDE can effectively solve large-scale FSPTHF optimization problems.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 9","pages":"2422-2436"},"PeriodicalIF":0.0,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10908719","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diverse training partners in multiagent tasks are crucial for training a robust and adaptable cooperative agent. Prior methods often rely on state-action information to diversify partners’ behaviors, but this can lead to minor variations instead of diverse behaviors and solutions. We address this limitation by introducing a novel training objective based on “policy compatibility.” Our method learns diverse behaviors by encouraging agents within a team to be compatible with each other while being incompatible with agents from other teams. We theoretically prove that incompatible policies are inherently dissimilar, allowing us to use policy compatibility as a proxy for diversity. We call this method learning incompatible policies for $n$-player cooperative games ($n$-LIPO). We propose to further diversify individual policies by incorporating a mutual information objective using state-action information. We empirically demonstrate that $n$-LIPO effectively generates diverse joint policies in various two-player and multi-player cooperative environments. In a complex cooperative task, two-player multi-recipe Overcooked, we find that $n$-LIPO generates a population of behaviorally diverse partners. These populations are then used to train robust generalist agents that can generalize better than using baseline populations. Finally, we demonstrate that $n$-LIPO can be applied to a high-dimensional StarCraft multiagent challenge (SMAC) multiplayer cooperative environment to discover diverse winning strategies when only a single goal exists. Additional visualization can also be accessed at https://sites.google.com/view/n-lipo/home.
{"title":"$n$-LIPO: Framework for Diverse Cooperative Agent Generation Using Policy Compatibility","authors":"Rujikorn Charakorn;Poramate Manoonpong;Nat Dilokthanakul","doi":"10.1109/TAI.2025.3566067","DOIUrl":"https://doi.org/10.1109/TAI.2025.3566067","url":null,"abstract":"Diverse training partners in multiagent tasks are crucial for training a robust and adaptable cooperative agent. Prior methods often rely on state-action information to diversify partners’ behaviors, but this can lead to minor variations instead of diverse behaviors and solutions. We address this limitation by introducing a novel training objective based on “policy compatibility.” Our method learns diverse behaviors by encouraging agents within a team to be compatible with each other while being incompatible with agents from other teams. We theoretically prove that incompatible policies are inherently dissimilar, allowing us to use policy compatibility as a proxy for diversity. We call this method <italic>learning incompatible policies for</i> <inline-formula><tex-math>$n$</tex-math></inline-formula> <italic>-player cooperative games</i> (<inline-formula><tex-math>$n$</tex-math></inline-formula>-LIPO). We propose to further diversify individual policies by incorporating a mutual information objective using state-action information. We empirically demonstrate that <inline-formula><tex-math>$n$</tex-math></inline-formula>-LIPO effectively generates diverse joint policies in various two-player and multi-player cooperative environments. In a complex cooperative task, two-player multi-recipe Overcooked, we find that <inline-formula><tex-math>$n$</tex-math></inline-formula>-LIPO generates a population of behaviorally diverse partners. These populations are then used to train robust generalist agents that can generalize better than using baseline populations. Finally, we demonstrate that <inline-formula><tex-math>$n$</tex-math></inline-formula>-LIPO can be applied to a high-dimensional StarCraft multiagent challenge (SMAC) multiplayer cooperative environment to discover diverse winning strategies when only a single goal exists. Additional visualization can also be accessed at <uri>https://sites.google.com/view/n-lipo/home</uri>.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 11","pages":"3100-3114"},"PeriodicalIF":0.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-28DOI: 10.1109/TAI.2025.3545791
Ling Xiao;Toshihiko Yamasaki
This article addresses fine-grained fashion image retrieval (FIR), which aims at the detailed and precise retrieval of fashion items from extensive databases. Conventional fine-grained FIR methods design complex attention modules to enhance attribute-aware feature discrimination. However, they often ignore the multiview characteristics of real-world fashion data, leading to diminished model accuracy. Furthermore, our empirical analysis revealed that the straightforward application of standard contrastive learning methods to fine-grained FIR often yields suboptimal results. To alleviate this issue, we propose a novel weak geometrical distortion-based contrastive learning (GeoDCL) strategy. Specifically, GeoDCL incorporates both a novel positive pair design and a novel contrastive loss. GeoDCL can be seamlessly integrated into state-of-the-art (SOTA) fine-grained FIR methods during the training stage to enhance performance during inference. When GeoDCL is applied, the model structures of SOTA methods require no modifications. Additionally, GeoDCL is not utilized during inference, ensuring no increase in inference time. Experiments on the FashionAI, DeepFashion, and Zappos50K datasets verified GeoDCL's effectiveness in consistently improving SOTA models. In particular, GeoDCL drastically improved ASENet_V2 from 60.76% to 66.48% in mAP on the FashionAI dataset.
{"title":"GeoDCL: Weak Geometrical Distortion-Based Contrastive Learning for Fine-Grained Fashion Image Retrieval","authors":"Ling Xiao;Toshihiko Yamasaki","doi":"10.1109/TAI.2025.3545791","DOIUrl":"https://doi.org/10.1109/TAI.2025.3545791","url":null,"abstract":"This article addresses fine-grained fashion image retrieval (FIR), which aims at the detailed and precise retrieval of fashion items from extensive databases. Conventional fine-grained FIR methods design complex attention modules to enhance attribute-aware feature discrimination. However, they often ignore the multiview characteristics of real-world fashion data, leading to diminished model accuracy. Furthermore, our empirical analysis revealed that the straightforward application of standard contrastive learning methods to fine-grained FIR often yields suboptimal results. To alleviate this issue, we propose a novel weak geometrical distortion-based contrastive learning (GeoDCL) strategy. Specifically, GeoDCL incorporates both a novel positive pair design and a novel contrastive loss. GeoDCL can be seamlessly integrated into state-of-the-art (SOTA) fine-grained FIR methods during the training stage to enhance performance during inference. When GeoDCL is applied, the model structures of SOTA methods require no modifications. Additionally, GeoDCL is not utilized during inference, ensuring no increase in inference time. Experiments on the FashionAI, DeepFashion, and Zappos50K datasets verified GeoDCL's effectiveness in consistently improving SOTA models. In particular, GeoDCL drastically improved ASENet_V2 from 60.76% to 66.48% in mAP on the FashionAI dataset.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 9","pages":"2409-2421"},"PeriodicalIF":0.0,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}