Pub Date : 2024-10-28DOI: 10.1016/j.ins.2024.121591
AN. Surya, J. Vimala
As a hybrid fuzzy extension of the complex non-linear Diophantine fuzzy set, the complex non-linear Diophantine fuzzy hypersoft set was developed by fusing it with the hypersoft set. To address multi-sub-attributed real-world similarity problems within complex non-linear Diophantine fuzzy ambiance, this study proposes distance measures and five innovative similarity measures such as Jaccard similarity measure, exponential similarity measure, cosine similarity measure, similarity measure based on cos function, and similarity measure based on cot function for complex non-linear Diophantine fuzzy hypersoft set. Furthermore, based on proposed similarity measures, a highly effective algorithm is provided for handling decision-making issues exquisitely in the pattern recognition field, along with an illustrative example of mineral identification. Then, to demonstrate the validity, reliability, robustness, and superiority of the proposed notion and algorithm, a detailed comparative study with proper discussion has been presented in the study.
{"title":"Similarity measure for complex non-linear Diophantine fuzzy hypersoft set and its application in pattern recognition","authors":"AN. Surya, J. Vimala","doi":"10.1016/j.ins.2024.121591","DOIUrl":"10.1016/j.ins.2024.121591","url":null,"abstract":"<div><div>As a hybrid fuzzy extension of the complex non-linear Diophantine fuzzy set, the complex non-linear Diophantine fuzzy hypersoft set was developed by fusing it with the hypersoft set. To address multi-sub-attributed real-world similarity problems within complex non-linear Diophantine fuzzy ambiance, this study proposes distance measures and five innovative similarity measures such as Jaccard similarity measure, exponential similarity measure, cosine similarity measure, similarity measure based on cos function, and similarity measure based on cot function for complex non-linear Diophantine fuzzy hypersoft set. Furthermore, based on proposed similarity measures, a highly effective algorithm is provided for handling decision-making issues exquisitely in the pattern recognition field, along with an illustrative example of mineral identification. Then, to demonstrate the validity, reliability, robustness, and superiority of the proposed notion and algorithm, a detailed comparative study with proper discussion has been presented in the study.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121591"},"PeriodicalIF":8.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.ins.2024.121592
Ping Liu , Yazhou Song , Junjie Hou , Yanwei Xu
With the rising awareness of health and wellness, accurate ultraviolet (UV) radiation forecasts have become crucial for planning and conducting outdoor activities safely, particularly in the context of global sporting events arrangement and recommendation with definite constraint on environmental conditions. The dynamic nature of UV exposure, influenced by factors such as solar zenith angles, cloud cover, and atmospheric conditions, makes accurate UV radiation data forecasting difficult and challenging. To cope with these challenges, we present an innovative approach for predicting the UV radiation levels of a certain region during a certain time period using Empirical Mode Decomposition (EMD), a robust method for analyzing non-linear and non-stationary data. Our model is specifically designed for urban areas, where outdoor events are common, and integrates meteorological data with historical UV radiation records from specific geographic regions and time periods. The EMD-based model offers precise predictions of UV levels, essential for event organizers and city planners to make informed decisions regarding the scheduling, relocation and recommendation of events to minimize health risks associated with UV exposure. At last, the effectiveness of this model is validated through various experiments across different spatial and temporal contexts based on the Urban-Air dataset (recording 2,891,393 Air Quality Index data that cover four major Chinese cities), demonstrating its adaptability and accuracy under diverse environmental conditions.
{"title":"EMD-based ultraviolet radiation prediction for sport events recommendation with environmental constraint","authors":"Ping Liu , Yazhou Song , Junjie Hou , Yanwei Xu","doi":"10.1016/j.ins.2024.121592","DOIUrl":"10.1016/j.ins.2024.121592","url":null,"abstract":"<div><div>With the rising awareness of health and wellness, accurate ultraviolet (UV) radiation forecasts have become crucial for planning and conducting outdoor activities safely, particularly in the context of global sporting events arrangement and recommendation with definite constraint on environmental conditions. The dynamic nature of UV exposure, influenced by factors such as solar zenith angles, cloud cover, and atmospheric conditions, makes accurate UV radiation data forecasting difficult and challenging. To cope with these challenges, we present an innovative approach for predicting the UV radiation levels of a certain region during a certain time period using Empirical Mode Decomposition (EMD), a robust method for analyzing non-linear and non-stationary data. Our model is specifically designed for urban areas, where outdoor events are common, and integrates meteorological data with historical UV radiation records from specific geographic regions and time periods. The EMD-based model offers precise predictions of UV levels, essential for event organizers and city planners to make informed decisions regarding the scheduling, relocation and recommendation of events to minimize health risks associated with UV exposure. At last, the effectiveness of this model is validated through various experiments across different spatial and temporal contexts based on the Urban-Air dataset (recording 2,891,393 Air Quality Index data that cover four major Chinese cities), demonstrating its adaptability and accuracy under diverse environmental conditions.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121592"},"PeriodicalIF":8.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142561010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.ins.2024.121560
Qian Hu , Zhong Yuan , Jusheng Mi , Jun Zhang
The purpose of conditional anomaly detection is to identify samples that significantly deviate from the majority of other samples under specific conditions within a dataset. It has been successfully applied to numerous practical scenarios such as forest fire prevention, gas well leakage detection, and remote sensing data analysis. Aiming at the issue of conditional anomaly detection, this paper utilizes the characteristics of fuzzy rough set theory to construct a conditional anomaly detection method that can effectively handle numerical or mixed attribute data. By defining the fuzzy inner boundary, the subset of contextual data is first divided into two parts, i.e. the fuzzy lower approximation and the fuzzy inner boundary. Subsequently, the fuzzy inner boundary is further divided into two distinct segments: the fuzzy abnormal boundary and the fuzzy main boundary. So far, three-way regions can be obtained, i.e., the fuzzy abnormal boundary, the fuzzy main boundary, and the fuzzy lower approximation. Then, a fuzzy-rough conditional anomaly detection model is constructed based on the above three-way regions. Finally, a related algorithm is proposed for the detection model and its effectiveness is verified by data experiments.
{"title":"Detecting fuzzy-rough conditional anomalies","authors":"Qian Hu , Zhong Yuan , Jusheng Mi , Jun Zhang","doi":"10.1016/j.ins.2024.121560","DOIUrl":"10.1016/j.ins.2024.121560","url":null,"abstract":"<div><div>The purpose of conditional anomaly detection is to identify samples that significantly deviate from the majority of other samples under specific conditions within a dataset. It has been successfully applied to numerous practical scenarios such as forest fire prevention, gas well leakage detection, and remote sensing data analysis. Aiming at the issue of conditional anomaly detection, this paper utilizes the characteristics of fuzzy rough set theory to construct a conditional anomaly detection method that can effectively handle numerical or mixed attribute data. By defining the fuzzy inner boundary, the subset of contextual data is first divided into two parts, i.e. the fuzzy lower approximation and the fuzzy inner boundary. Subsequently, the fuzzy inner boundary is further divided into two distinct segments: the fuzzy abnormal boundary and the fuzzy main boundary. So far, three-way regions can be obtained, i.e., the fuzzy abnormal boundary, the fuzzy main boundary, and the fuzzy lower approximation. Then, a fuzzy-rough conditional anomaly detection model is constructed based on the above three-way regions. Finally, a related algorithm is proposed for the detection model and its effectiveness is verified by data experiments.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121560"},"PeriodicalIF":8.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.ins.2024.121587
Diego García-Gil , David López , Daniel Argüelles-Martino , Jacinto Carrasco , Ignacio Aguilera-Martos , Julián Luengo , Francisco Herrera
Background
Anomaly detection is the process of identifying observations that differ greatly from the majority of data. Unsupervised anomaly detection aims to find outliers in data that is not labeled, therefore, the anomalous instances are unknown. The exponential data generation has led to the era of Big Data. This scenario brings new challenges to classic anomaly detection problems due to the massive and unsupervised accumulation of data. Traditional methods are not able to cop up with computing and time requirements of Big Data problems.
Methods
In this paper, we propose four distributed algorithm designs for Big Data anomaly detection problems: HBOS_BD, LODA_BD, LSCP_BD, and XGBOD_BD. They have been designed following the MapReduce distributed methodology in order to be capable of handling Big Data problems.
Results
These algorithms have been integrated into an Spark Package, focused on static and dynamic Big Data anomaly detection tasks, namely AnomalyDSD. Experiments using a real-world case of study have shown the performance and validity of the proposals for Big Data problems.
Conclusions
With this proposal, we have enabled the practitioner to efficiently and effectively detect anomalies in Big Data datasets, where the early detection of an anomaly can lead to a proper and timely decision.
{"title":"Developing Big Data anomaly dynamic and static detection algorithms: AnomalyDSD spark package","authors":"Diego García-Gil , David López , Daniel Argüelles-Martino , Jacinto Carrasco , Ignacio Aguilera-Martos , Julián Luengo , Francisco Herrera","doi":"10.1016/j.ins.2024.121587","DOIUrl":"10.1016/j.ins.2024.121587","url":null,"abstract":"<div><h3>Background</h3><div>Anomaly detection is the process of identifying observations that differ greatly from the majority of data. Unsupervised anomaly detection aims to find outliers in data that is not labeled, therefore, the anomalous instances are unknown. The exponential data generation has led to the era of Big Data. This scenario brings new challenges to classic anomaly detection problems due to the massive and unsupervised accumulation of data. Traditional methods are not able to cop up with computing and time requirements of Big Data problems.</div></div><div><h3>Methods</h3><div>In this paper, we propose four distributed algorithm designs for Big Data anomaly detection problems: HBOS_BD, LODA_BD, LSCP_BD, and XGBOD_BD. They have been designed following the MapReduce distributed methodology in order to be capable of handling Big Data problems.</div></div><div><h3>Results</h3><div>These algorithms have been integrated into an Spark Package, focused on static and dynamic Big Data anomaly detection tasks, namely AnomalyDSD. Experiments using a real-world case of study have shown the performance and validity of the proposals for Big Data problems.</div></div><div><h3>Conclusions</h3><div>With this proposal, we have enabled the practitioner to efficiently and effectively detect anomalies in Big Data datasets, where the early detection of an anomaly can lead to a proper and timely decision.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121587"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.ins.2024.121588
Lianyu Hu, Mudi Jiang, Xinying Liu, Zengyou He
Numerous clustering algorithms prioritize accuracy, but in high-risk domains, the interpretability of clustering methods is crucial as well. The inherent heterogeneity of categorical data makes it particularly challenging for users to comprehend clustering outcomes. Currently, the majority of interpretable clustering methods are tailored for numerical data and utilize decision tree models, leaving interpretable clustering for categorical data as a less explored domain. Additionally, existing interpretable clustering algorithms often depend on external, potentially non-interpretable algorithms and lack transparency in the decision-making process during tree construction. In this paper, we tackle the problem of interpretable categorical data clustering by growing a decision tree in a statistically meaningful manner. We formulate the evaluation of candidate splits as a multivariate two-sample testing problem, where a single p-value is derived by combining significance evidence from all individual categories. This approach provides a reliable and controllable method for selecting the optimal split while determining its statistical significance. Extensive experimental results on real-world data sets demonstrate that our algorithm achieves comparable performance in terms of cluster quality, running efficiency, and explainability relative to its counterparts.
许多聚类算法都将准确性放在首位,但在高风险领域,聚类方法的可解释性也至关重要。分类数据固有的异质性使用户理解聚类结果尤其具有挑战性。目前,大多数可解释聚类方法都是为数值数据量身定制的,并使用决策树模型,因此分类数据的可解释聚类方法还处于探索阶段。此外,现有的可解释聚类算法通常依赖于外部的、潜在的不可解释算法,并且在树构建过程中缺乏决策过程的透明度。在本文中,我们通过以有统计意义的方式生长决策树来解决可解释的分类数据聚类问题。我们将对候选分割的评估表述为一个多变量双样本检验问题,通过综合所有单个类别的显著性证据得出一个单一的 p 值。这种方法提供了一种可靠、可控的方法,用于选择最佳分割,同时确定其统计意义。在真实世界数据集上的大量实验结果表明,我们的算法在聚类质量、运行效率和可解释性等方面都达到了与同类算法相当的性能。
{"title":"Significance-based decision tree for interpretable categorical data clustering","authors":"Lianyu Hu, Mudi Jiang, Xinying Liu, Zengyou He","doi":"10.1016/j.ins.2024.121588","DOIUrl":"10.1016/j.ins.2024.121588","url":null,"abstract":"<div><div>Numerous clustering algorithms prioritize accuracy, but in high-risk domains, the interpretability of clustering methods is crucial as well. The inherent heterogeneity of categorical data makes it particularly challenging for users to comprehend clustering outcomes. Currently, the majority of interpretable clustering methods are tailored for numerical data and utilize decision tree models, leaving interpretable clustering for categorical data as a less explored domain. Additionally, existing interpretable clustering algorithms often depend on external, potentially non-interpretable algorithms and lack transparency in the decision-making process during tree construction. In this paper, we tackle the problem of interpretable categorical data clustering by growing a decision tree in a statistically meaningful manner. We formulate the evaluation of candidate splits as a multivariate two-sample testing problem, where a single <em>p</em>-value is derived by combining significance evidence from all individual categories. This approach provides a reliable and controllable method for selecting the optimal split while determining its statistical significance. Extensive experimental results on real-world data sets demonstrate that our algorithm achieves comparable performance in terms of cluster quality, running efficiency, and explainability relative to its counterparts.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121588"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142539058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.ins.2024.121565
Hengqian Wang , Lei Chen , Kuangrong Hao , Xin Cai , Bing Wei
In modern industrial processes, the high acquisition cost of labeled data can lead to a large number of unlabeled samples, which greatly impacts the accuracy of traditional soft sensor models. To this end, this paper proposes a novel semi-supervised soft sensor framework that can fully utilize the unlabeled data to expand the original labeled data, and ultimately improve the prediction accuracy. Specifically, an indeterminate variational autoencoder (IVAE) is first proposed to obtain pseudo-labels and their uncertainties for unlabeled data. On this basis, the IVAE-based self-training (ST-IVAE) framework is further naturally proposed to expand the original small labeled dataset through continuous circulation. Among them, a variance-based oversampling (VOS) strategy is introduced to better utilize the pseudo-label uncertainty. By determining similar sample sets through the comparison of Kullback-Leibler (KL) divergence obtained by the proposed IVAE model, each sample can be independently modeled for prediction. The effectiveness of the proposed semi-supervised framework is verified on two real industrial processes. Comparable results illustrate that the ST-IVAE framework can still predict well even in the presence of missing input data compared to state-of-the-art methodologies in addressing semi-supervised soft sensing challenges.
{"title":"A novel self-training framework for semi-supervised soft sensor modeling based on indeterminate variational autoencoder","authors":"Hengqian Wang , Lei Chen , Kuangrong Hao , Xin Cai , Bing Wei","doi":"10.1016/j.ins.2024.121565","DOIUrl":"10.1016/j.ins.2024.121565","url":null,"abstract":"<div><div>In modern industrial processes, the high acquisition cost of labeled data can lead to a large number of unlabeled samples, which greatly impacts the accuracy of traditional soft sensor models. To this end, this paper proposes a novel semi-supervised soft sensor framework that can fully utilize the unlabeled data to expand the original labeled data, and ultimately improve the prediction accuracy. Specifically, an indeterminate variational autoencoder (IVAE) is first proposed to obtain pseudo-labels and their uncertainties for unlabeled data. On this basis, the IVAE-based self-training (ST-IVAE) framework is further naturally proposed to expand the original small labeled dataset through continuous circulation. Among them, a variance-based oversampling (VOS) strategy is introduced to better utilize the pseudo-label uncertainty. By determining similar sample sets through the comparison of Kullback-Leibler (KL) divergence obtained by the proposed IVAE model, each sample can be independently modeled for prediction. The effectiveness of the proposed semi-supervised framework is verified on two real industrial processes. Comparable results illustrate that the ST-IVAE framework can still predict well even in the presence of missing input data compared to state-of-the-art methodologies in addressing semi-supervised soft sensing challenges.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121565"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24DOI: 10.1016/j.ins.2024.121580
Ali S. Hadi , Rida Moustafa
With the advent of data technologies, we have various types of data, such as structured, unstructured and semi-structured. Performing certain statistical or machine learning techniques may require careful preprocessing or pretreatment of the data to make them suitable for analysis. For example, given a data matrix X, which represents n multivariate observations or cases on p variables or features, the columns/rows of X may be pretreated before applying statistical or machine learning techniques to the data. While centering and/or scaling the variables do not alter the correlation structure nor the graphical representation of the data, centering/scaling the observations do. We investigate various row pretreatment methods more closely and show with theoretical proofs and numerical examples that centering/scaling the rows of X changes both the graphical structure of the observations in the multi-dimensional space and the correlation structure among the variables. There may be good reasons for performing row centering/scaling on the data and we are not against it, but analysts who use such row operations should be aware of the geometrical and correlation structures one has performed on the data and should also demonstrate that the process results in a new, more appropriate structure for their questions.
随着数据技术的发展,我们拥有了各种类型的数据,如结构化、非结构化和半结构化数据。在执行某些统计或机器学习技术时,可能需要对数据进行仔细的预处理或预处理后才能使其适合分析。例如,给定的数据矩阵 X 表示 p 个变量或特征的 n 个多元观测值或案例,在对数据应用统计或机器学习技术之前,可以对 X 的列/行进行预处理。虽然变量的居中和/或缩放不会改变数据的相关结构或图形表示,但观测数据的居中和/或缩放却会改变数据的相关结构或图形表示。我们对各种行预处理方法进行了更深入的研究,并通过理论证明和数值示例表明,对 X 行进行居中/缩放会改变多维空间中观测数据的图形结构和变量间的相关结构。对数据进行行居中/缩放处理可能有很好的理由,我们并不反对这样做,但使用这种行操作的分析师应该意识到对数据进行的几何结构和相关结构,并且还应该证明这一过程会为他们的问题带来新的、更合适的结构。
{"title":"Some notes on the consequences of pretreatment of multivariate data","authors":"Ali S. Hadi , Rida Moustafa","doi":"10.1016/j.ins.2024.121580","DOIUrl":"10.1016/j.ins.2024.121580","url":null,"abstract":"<div><div>With the advent of data technologies, we have various types of data, such as structured, unstructured and semi-structured. Performing certain statistical or machine learning techniques may require careful preprocessing or pretreatment of the data to make them suitable for analysis. For example, given a data matrix <strong>X</strong>, which represents <em>n</em> multivariate observations or cases on <em>p</em> variables or features, the columns/rows of <strong>X</strong> may be pretreated before applying statistical or machine learning techniques to the data. While centering and/or scaling the variables do not alter the correlation structure nor the graphical representation of the data, centering/scaling the observations do. We investigate various row pretreatment methods more closely and show with theoretical proofs and numerical examples that centering/scaling the rows of <strong>X</strong> changes both the graphical structure of the observations in the multi-dimensional space and the correlation structure among the variables. There may be good reasons for performing row centering/scaling on the data and we are not against it, but analysts who use such row operations should be aware of the geometrical and correlation structures one has performed on the data and should also demonstrate that the process results in a new, more appropriate structure for their questions.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121580"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The class imbalance issue is generally intrinsic in multi-label datasets due to the fact that they have a large number of labels and each sample is associated with only a few of them. This causes the trained multi-label classifier to be biased towards the majority labels. Multi-label oversampling methods have been proposed to handle this issue, and they fall into clone-based and Synthetic Minority Oversampling TEchnique-based (SMOTE-based) ones. However, the former duplicates minority samples and may result in over-fitting whereas the latter may generate unreliable synthetic samples. In this work, we propose a Diversity and Reliability-enhanced SMOTE for multi-label learning (DR-SMOTE). In it, the minority classes are determined according to their label imbalance ratios. A reliable minority sample is used as a seed to generate a synthetic one while a reference sample is selected for it to confine the synthesis region. Features of the synthetic samples are determined probabilistically in this region and their labels are set identically to those of the seeds. We carry out experiments with eleven multi-label datasets to compare DR-SMOTE against seven existing resampling methods based on four base multi-label classifiers. The experimental results demonstrate DR-SMOTE’s superiority over its peers in terms of several evaluation metrics.
{"title":"A diversity and reliability-enhanced synthetic minority oversampling technique for multi-label learning","authors":"Yanlu Gong , Quanwang Wu , Mengchu Zhou , Chao Chen","doi":"10.1016/j.ins.2024.121579","DOIUrl":"10.1016/j.ins.2024.121579","url":null,"abstract":"<div><div>The class imbalance issue is generally intrinsic in multi-label datasets due to the fact that they have a large number of labels and each sample is associated with only a few of them. This causes the trained multi-label classifier to be biased towards the majority labels. Multi-label oversampling methods have been proposed to handle this issue, and they fall into clone-based and Synthetic Minority Oversampling TEchnique-based (SMOTE-based) ones. However, the former duplicates minority samples and may result in over-fitting whereas the latter may generate unreliable synthetic samples. In this work, we propose a Diversity and Reliability-enhanced SMOTE for multi-label learning (DR-SMOTE). In it, the minority classes are determined according to their label imbalance ratios. A reliable minority sample is used as a seed to generate a synthetic one while a reference sample is selected for it to confine the synthesis region. Features of the synthetic samples are determined probabilistically in this region and their labels are set identically to those of the seeds. We carry out experiments with eleven multi-label datasets to compare DR-SMOTE against seven existing resampling methods based on four base multi-label classifiers. The experimental results demonstrate DR-SMOTE’s superiority over its peers in terms of several evaluation metrics.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121579"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24DOI: 10.1016/j.ins.2024.121583
Zhezhe Xing , Yuxin Ye , Rui Song , Yun Teng , Ziheng Li , Jiawen Liu
Few-Shot Relation Classification (FSRC) aims to predict novel relationships by learning from limited samples. Graph Neural Network (GNN) approaches for FSRC constructs data as graphs, effectively capturing sample features through graph representation learning. However, they often face several challenges: 1) They tend to neglect the interactions between samples from different support sets and overlook the implicit noise in labels, leading to sub-optimal sample feature generation. 2) They struggle to deeply mine the diverse semantic information present in FSRC data. 3) Over-smoothing and overfitting limit the model's depth and adversely affect overall performance. To address these issues, we propose a Sample Representation Enhancement model based on Heterogeneous Graph Neural Network (SRE-HGNN) for FSRC. This method leverages inter-sample and inter-class associations (i.e., label mutual attention) to effectively fuse features and generate more expressive sample representations. Edge-heterogeneous GNNs are employed to enhance sample features by capturing heterogeneous information of varying depths through different edge attentions. Additionally, we introduce an attention-based neighbor node culling method, enabling the model to stack higher levels and extract deeper inter-sample associations, thereby improving performance. Finally, experiments are conducted for the FSRC task, and SRE-HGNN achieves an average accuracy improvement of 1.84% and 1.02% across two public datasets.
{"title":"Sample feature enhancement model based on heterogeneous graph representation learning for few-shot relation classification","authors":"Zhezhe Xing , Yuxin Ye , Rui Song , Yun Teng , Ziheng Li , Jiawen Liu","doi":"10.1016/j.ins.2024.121583","DOIUrl":"10.1016/j.ins.2024.121583","url":null,"abstract":"<div><div>Few-Shot Relation Classification (FSRC) aims to predict novel relationships by learning from limited samples. Graph Neural Network (GNN) approaches for FSRC constructs data as graphs, effectively capturing sample features through graph representation learning. However, they often face several challenges: 1) They tend to neglect the interactions between samples from different support sets and overlook the implicit noise in labels, leading to sub-optimal sample feature generation. 2) They struggle to deeply mine the diverse semantic information present in FSRC data. 3) Over-smoothing and overfitting limit the model's depth and adversely affect overall performance. To address these issues, we propose a Sample Representation Enhancement model based on Heterogeneous Graph Neural Network (SRE-HGNN) for FSRC. This method leverages inter-sample and inter-class associations (i.e., label mutual attention) to effectively fuse features and generate more expressive sample representations. Edge-heterogeneous GNNs are employed to enhance sample features by capturing heterogeneous information of varying depths through different edge attentions. Additionally, we introduce an attention-based neighbor node culling method, enabling the model to stack higher levels and extract deeper inter-sample associations, thereby improving performance. Finally, experiments are conducted for the FSRC task, and SRE-HGNN achieves an average accuracy improvement of 1.84% and 1.02% across two public datasets.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121583"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142539059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24DOI: 10.1016/j.ins.2024.121584
Jianfeng Deng, Dongmei Wang, Jinan Gu, Chen Chen
Imbalanced datasets present a challenging problem in machine learning and artificial intelligence. Since most models typically assume balanced data distributions, imbalanced positive and negative examples can lead to significant bias in prediction or classification tasks. Current over-sampling methods frequently encounter issues like overfitting and boundary bias. A novel imbalanced data augmentation technique called SVM-GA over-sampling (SGO) is proposed in this paper, which integrates Support Vector Machines (SVM) with Genetic Algorithms (GA). Our approach leverages SVM to identify the decision boundary and uses GA to generate new minority samples along this boundary, effectively addressing both over-fitting and boundary biases. It has been experimentally validated that SGO outperforms the traditional methods on most datasets, providing a novel and effective approach to address imbalanced data problems, with potential application prospects and generalization value.
不平衡数据集是机器学习和人工智能领域的一个难题。由于大多数模型通常假定数据分布平衡,因此不平衡的正负实例会导致预测或分类任务出现严重偏差。目前的过采样方法经常会遇到过拟合和边界偏差等问题。本文提出了一种称为 SVM-GA 过度采样(SGO)的新型不平衡数据增强技术,它将支持向量机(SVM)与遗传算法(GA)相结合。我们的方法利用 SVM 来识别决策边界,并使用 GA 沿此边界生成新的少数样本,从而有效地解决了过拟合和边界偏差问题。实验验证了 SGO 在大多数数据集上的表现优于传统方法,为解决不平衡数据问题提供了一种新颖有效的方法,具有潜在的应用前景和推广价值。
{"title":"SGO: An innovative oversampling approach for imbalanced datasets using SVM and genetic algorithms","authors":"Jianfeng Deng, Dongmei Wang, Jinan Gu, Chen Chen","doi":"10.1016/j.ins.2024.121584","DOIUrl":"10.1016/j.ins.2024.121584","url":null,"abstract":"<div><div>Imbalanced datasets present a challenging problem in machine learning and artificial intelligence. Since most models typically assume balanced data distributions, imbalanced positive and negative examples can lead to significant bias in prediction or classification tasks. Current over-sampling methods frequently encounter issues like overfitting and boundary bias. A novel imbalanced data augmentation technique called SVM-GA over-sampling (SGO) is proposed in this paper, which integrates Support Vector Machines (SVM) with Genetic Algorithms (GA). Our approach leverages SVM to identify the decision boundary and uses GA to generate new minority samples along this boundary, effectively addressing both over-fitting and boundary biases. It has been experimentally validated that SGO outperforms the traditional methods on most datasets, providing a novel and effective approach to address imbalanced data problems, with potential application prospects and generalization value.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121584"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}