首页 > 最新文献

Information Sciences最新文献

英文 中文
Similarity measure for complex non-linear Diophantine fuzzy hypersoft set and its application in pattern recognition 复杂非线性 Diophantine 模糊超软集的相似度量及其在模式识别中的应用
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-28 DOI: 10.1016/j.ins.2024.121591
AN. Surya, J. Vimala
As a hybrid fuzzy extension of the complex non-linear Diophantine fuzzy set, the complex non-linear Diophantine fuzzy hypersoft set was developed by fusing it with the hypersoft set. To address multi-sub-attributed real-world similarity problems within complex non-linear Diophantine fuzzy ambiance, this study proposes distance measures and five innovative similarity measures such as Jaccard similarity measure, exponential similarity measure, cosine similarity measure, similarity measure based on cos function, and similarity measure based on cot function for complex non-linear Diophantine fuzzy hypersoft set. Furthermore, based on proposed similarity measures, a highly effective algorithm is provided for handling decision-making issues exquisitely in the pattern recognition field, along with an illustrative example of mineral identification. Then, to demonstrate the validity, reliability, robustness, and superiority of the proposed notion and algorithm, a detailed comparative study with proper discussion has been presented in the study.
作为复非线性 Diophantine 模糊集的混合模糊扩展,复非线性 Diophantine 模糊超软集与超软集相融合,发展出了复非线性 Diophantine 模糊超软集。为了解决复杂非线性 Diophantine 模糊环境中的多子属性真实世界相似性问题,本研究针对复杂非线性 Diophantine 模糊超软集提出了距离度量和五种创新的相似性度量,如 Jaccard 相似性度量、指数相似性度量、余弦相似性度量、基于 cos 函数的相似性度量和基于 cot 函数的相似性度量。此外,基于所提出的相似性度量,还提供了一种高效的算法,用于在模式识别领域精细地处理决策问题,并以矿物识别为例进行说明。然后,为了证明所提出的概念和算法的有效性、可靠性、稳健性和优越性,研究还进行了详细的比较研究和适当的讨论。
{"title":"Similarity measure for complex non-linear Diophantine fuzzy hypersoft set and its application in pattern recognition","authors":"AN. Surya,&nbsp;J. Vimala","doi":"10.1016/j.ins.2024.121591","DOIUrl":"10.1016/j.ins.2024.121591","url":null,"abstract":"<div><div>As a hybrid fuzzy extension of the complex non-linear Diophantine fuzzy set, the complex non-linear Diophantine fuzzy hypersoft set was developed by fusing it with the hypersoft set. To address multi-sub-attributed real-world similarity problems within complex non-linear Diophantine fuzzy ambiance, this study proposes distance measures and five innovative similarity measures such as Jaccard similarity measure, exponential similarity measure, cosine similarity measure, similarity measure based on cos function, and similarity measure based on cot function for complex non-linear Diophantine fuzzy hypersoft set. Furthermore, based on proposed similarity measures, a highly effective algorithm is provided for handling decision-making issues exquisitely in the pattern recognition field, along with an illustrative example of mineral identification. Then, to demonstrate the validity, reliability, robustness, and superiority of the proposed notion and algorithm, a detailed comparative study with proper discussion has been presented in the study.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121591"},"PeriodicalIF":8.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EMD-based ultraviolet radiation prediction for sport events recommendation with environmental constraint 基于 EMD 的紫外线辐射预测,用于具有环境约束条件的体育赛事推荐
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-28 DOI: 10.1016/j.ins.2024.121592
Ping Liu , Yazhou Song , Junjie Hou , Yanwei Xu
With the rising awareness of health and wellness, accurate ultraviolet (UV) radiation forecasts have become crucial for planning and conducting outdoor activities safely, particularly in the context of global sporting events arrangement and recommendation with definite constraint on environmental conditions. The dynamic nature of UV exposure, influenced by factors such as solar zenith angles, cloud cover, and atmospheric conditions, makes accurate UV radiation data forecasting difficult and challenging. To cope with these challenges, we present an innovative approach for predicting the UV radiation levels of a certain region during a certain time period using Empirical Mode Decomposition (EMD), a robust method for analyzing non-linear and non-stationary data. Our model is specifically designed for urban areas, where outdoor events are common, and integrates meteorological data with historical UV radiation records from specific geographic regions and time periods. The EMD-based model offers precise predictions of UV levels, essential for event organizers and city planners to make informed decisions regarding the scheduling, relocation and recommendation of events to minimize health risks associated with UV exposure. At last, the effectiveness of this model is validated through various experiments across different spatial and temporal contexts based on the Urban-Air dataset (recording 2,891,393 Air Quality Index data that cover four major Chinese cities), demonstrating its adaptability and accuracy under diverse environmental conditions.
随着人们对健康和保健意识的不断提高,准确的紫外线(UV)辐射预报已成为规划和安全开展户外活动的关键,尤其是在全球体育赛事安排和建议对环境条件有明确限制的情况下。受太阳天顶角、云层和大气条件等因素的影响,紫外线辐射具有动态性质,因此准确的紫外线辐射数据预报既困难又具有挑战性。为了应对这些挑战,我们提出了一种创新方法,即利用经验模式分解(EMD)预测特定区域在特定时间段内的紫外线辐射水平,EMD 是一种分析非线性和非平稳数据的稳健方法。我们的模型专为户外活动频繁的城市地区设计,并将气象数据与特定地理区域和时间段的历史紫外线辐射记录整合在一起。基于 EMD 的模型可精确预测紫外线水平,这对活动组织者和城市规划者在活动安排、迁移和推荐方面做出明智决策至关重要,可最大限度地降低紫外线照射带来的健康风险。最后,基于城市空气数据集(记录了 2,891,393 个空气质量指数数据,涵盖中国四大城市),通过不同时空背景下的各种实验验证了该模型的有效性,证明了其在不同环境条件下的适应性和准确性。
{"title":"EMD-based ultraviolet radiation prediction for sport events recommendation with environmental constraint","authors":"Ping Liu ,&nbsp;Yazhou Song ,&nbsp;Junjie Hou ,&nbsp;Yanwei Xu","doi":"10.1016/j.ins.2024.121592","DOIUrl":"10.1016/j.ins.2024.121592","url":null,"abstract":"<div><div>With the rising awareness of health and wellness, accurate ultraviolet (UV) radiation forecasts have become crucial for planning and conducting outdoor activities safely, particularly in the context of global sporting events arrangement and recommendation with definite constraint on environmental conditions. The dynamic nature of UV exposure, influenced by factors such as solar zenith angles, cloud cover, and atmospheric conditions, makes accurate UV radiation data forecasting difficult and challenging. To cope with these challenges, we present an innovative approach for predicting the UV radiation levels of a certain region during a certain time period using Empirical Mode Decomposition (EMD), a robust method for analyzing non-linear and non-stationary data. Our model is specifically designed for urban areas, where outdoor events are common, and integrates meteorological data with historical UV radiation records from specific geographic regions and time periods. The EMD-based model offers precise predictions of UV levels, essential for event organizers and city planners to make informed decisions regarding the scheduling, relocation and recommendation of events to minimize health risks associated with UV exposure. At last, the effectiveness of this model is validated through various experiments across different spatial and temporal contexts based on the Urban-Air dataset (recording 2,891,393 Air Quality Index data that cover four major Chinese cities), demonstrating its adaptability and accuracy under diverse environmental conditions.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121592"},"PeriodicalIF":8.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142561010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting fuzzy-rough conditional anomalies 检测模糊粗糙条件异常
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-28 DOI: 10.1016/j.ins.2024.121560
Qian Hu , Zhong Yuan , Jusheng Mi , Jun Zhang
The purpose of conditional anomaly detection is to identify samples that significantly deviate from the majority of other samples under specific conditions within a dataset. It has been successfully applied to numerous practical scenarios such as forest fire prevention, gas well leakage detection, and remote sensing data analysis. Aiming at the issue of conditional anomaly detection, this paper utilizes the characteristics of fuzzy rough set theory to construct a conditional anomaly detection method that can effectively handle numerical or mixed attribute data. By defining the fuzzy inner boundary, the subset of contextual data is first divided into two parts, i.e. the fuzzy lower approximation and the fuzzy inner boundary. Subsequently, the fuzzy inner boundary is further divided into two distinct segments: the fuzzy abnormal boundary and the fuzzy main boundary. So far, three-way regions can be obtained, i.e., the fuzzy abnormal boundary, the fuzzy main boundary, and the fuzzy lower approximation. Then, a fuzzy-rough conditional anomaly detection model is constructed based on the above three-way regions. Finally, a related algorithm is proposed for the detection model and its effectiveness is verified by data experiments.
条件异常检测的目的是在数据集的特定条件下,识别出与其他大多数样本有明显偏差的样本。它已成功应用于森林防火、气井泄漏检测和遥感数据分析等众多实际场景。针对条件异常检测问题,本文利用模糊粗糙集理论的特点,构建了一种能有效处理数值或混合属性数据的条件异常检测方法。通过定义模糊内边界,首先将上下文数据子集分为两部分,即模糊下近似和模糊内边界。随后,模糊内边界被进一步划分为两个不同的部分:模糊异常边界和模糊主边界。至此,可以得到三个方向的区域,即模糊异常边界、模糊主边界和模糊下近似边界。然后,基于上述三向区域构建了模糊粗糙条件异常检测模型。最后,针对该检测模型提出了相关算法,并通过数据实验验证了其有效性。
{"title":"Detecting fuzzy-rough conditional anomalies","authors":"Qian Hu ,&nbsp;Zhong Yuan ,&nbsp;Jusheng Mi ,&nbsp;Jun Zhang","doi":"10.1016/j.ins.2024.121560","DOIUrl":"10.1016/j.ins.2024.121560","url":null,"abstract":"<div><div>The purpose of conditional anomaly detection is to identify samples that significantly deviate from the majority of other samples under specific conditions within a dataset. It has been successfully applied to numerous practical scenarios such as forest fire prevention, gas well leakage detection, and remote sensing data analysis. Aiming at the issue of conditional anomaly detection, this paper utilizes the characteristics of fuzzy rough set theory to construct a conditional anomaly detection method that can effectively handle numerical or mixed attribute data. By defining the fuzzy inner boundary, the subset of contextual data is first divided into two parts, i.e. the fuzzy lower approximation and the fuzzy inner boundary. Subsequently, the fuzzy inner boundary is further divided into two distinct segments: the fuzzy abnormal boundary and the fuzzy main boundary. So far, three-way regions can be obtained, i.e., the fuzzy abnormal boundary, the fuzzy main boundary, and the fuzzy lower approximation. Then, a fuzzy-rough conditional anomaly detection model is constructed based on the above three-way regions. Finally, a related algorithm is proposed for the detection model and its effectiveness is verified by data experiments.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121560"},"PeriodicalIF":8.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Developing Big Data anomaly dynamic and static detection algorithms: AnomalyDSD spark package 开发大数据异常动态和静态检测算法:AnomalyDSD 火花软件包
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-26 DOI: 10.1016/j.ins.2024.121587
Diego García-Gil , David López , Daniel Argüelles-Martino , Jacinto Carrasco , Ignacio Aguilera-Martos , Julián Luengo , Francisco Herrera

Background

Anomaly detection is the process of identifying observations that differ greatly from the majority of data. Unsupervised anomaly detection aims to find outliers in data that is not labeled, therefore, the anomalous instances are unknown. The exponential data generation has led to the era of Big Data. This scenario brings new challenges to classic anomaly detection problems due to the massive and unsupervised accumulation of data. Traditional methods are not able to cop up with computing and time requirements of Big Data problems.

Methods

In this paper, we propose four distributed algorithm designs for Big Data anomaly detection problems: HBOS_BD, LODA_BD, LSCP_BD, and XGBOD_BD. They have been designed following the MapReduce distributed methodology in order to be capable of handling Big Data problems.

Results

These algorithms have been integrated into an Spark Package, focused on static and dynamic Big Data anomaly detection tasks, namely AnomalyDSD. Experiments using a real-world case of study have shown the performance and validity of the proposals for Big Data problems.

Conclusions

With this proposal, we have enabled the practitioner to efficiently and effectively detect anomalies in Big Data datasets, where the early detection of an anomaly can lead to a proper and timely decision.
背景异常检测是识别与大多数数据差异很大的观察结果的过程。无监督异常检测的目的是在没有标记的数据中发现异常值,因此异常实例是未知的。指数级的数据生成导致了大数据时代的到来。由于数据的海量和无监督积累,这种情况给传统的异常检测问题带来了新的挑战。本文针对大数据异常检测问题提出了四种分布式算法设计:本文针对大数据异常检测问题提出了四种分布式算法设计:HBOS_BD、LODA_BD、LSCP_BD 和 XGBOD_BD。这些算法已被集成到 Spark 软件包中,该软件包专注于静态和动态大数据异常检测任务,即 AnomalyDSD。通过使用真实世界的研究案例进行实验,证明了这些建议在大数据问题上的性能和有效性。结论通过这项建议,我们使从业人员能够高效地检测大数据数据集中的异常情况,其中异常情况的早期检测可导致正确和及时的决策。
{"title":"Developing Big Data anomaly dynamic and static detection algorithms: AnomalyDSD spark package","authors":"Diego García-Gil ,&nbsp;David López ,&nbsp;Daniel Argüelles-Martino ,&nbsp;Jacinto Carrasco ,&nbsp;Ignacio Aguilera-Martos ,&nbsp;Julián Luengo ,&nbsp;Francisco Herrera","doi":"10.1016/j.ins.2024.121587","DOIUrl":"10.1016/j.ins.2024.121587","url":null,"abstract":"<div><h3>Background</h3><div>Anomaly detection is the process of identifying observations that differ greatly from the majority of data. Unsupervised anomaly detection aims to find outliers in data that is not labeled, therefore, the anomalous instances are unknown. The exponential data generation has led to the era of Big Data. This scenario brings new challenges to classic anomaly detection problems due to the massive and unsupervised accumulation of data. Traditional methods are not able to cop up with computing and time requirements of Big Data problems.</div></div><div><h3>Methods</h3><div>In this paper, we propose four distributed algorithm designs for Big Data anomaly detection problems: HBOS_BD, LODA_BD, LSCP_BD, and XGBOD_BD. They have been designed following the MapReduce distributed methodology in order to be capable of handling Big Data problems.</div></div><div><h3>Results</h3><div>These algorithms have been integrated into an Spark Package, focused on static and dynamic Big Data anomaly detection tasks, namely AnomalyDSD. Experiments using a real-world case of study have shown the performance and validity of the proposals for Big Data problems.</div></div><div><h3>Conclusions</h3><div>With this proposal, we have enabled the practitioner to efficiently and effectively detect anomalies in Big Data datasets, where the early detection of an anomaly can lead to a proper and timely decision.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121587"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Significance-based decision tree for interpretable categorical data clustering 基于显著性的决策树,实现可解释的分类数据聚类
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-26 DOI: 10.1016/j.ins.2024.121588
Lianyu Hu, Mudi Jiang, Xinying Liu, Zengyou He
Numerous clustering algorithms prioritize accuracy, but in high-risk domains, the interpretability of clustering methods is crucial as well. The inherent heterogeneity of categorical data makes it particularly challenging for users to comprehend clustering outcomes. Currently, the majority of interpretable clustering methods are tailored for numerical data and utilize decision tree models, leaving interpretable clustering for categorical data as a less explored domain. Additionally, existing interpretable clustering algorithms often depend on external, potentially non-interpretable algorithms and lack transparency in the decision-making process during tree construction. In this paper, we tackle the problem of interpretable categorical data clustering by growing a decision tree in a statistically meaningful manner. We formulate the evaluation of candidate splits as a multivariate two-sample testing problem, where a single p-value is derived by combining significance evidence from all individual categories. This approach provides a reliable and controllable method for selecting the optimal split while determining its statistical significance. Extensive experimental results on real-world data sets demonstrate that our algorithm achieves comparable performance in terms of cluster quality, running efficiency, and explainability relative to its counterparts.
许多聚类算法都将准确性放在首位,但在高风险领域,聚类方法的可解释性也至关重要。分类数据固有的异质性使用户理解聚类结果尤其具有挑战性。目前,大多数可解释聚类方法都是为数值数据量身定制的,并使用决策树模型,因此分类数据的可解释聚类方法还处于探索阶段。此外,现有的可解释聚类算法通常依赖于外部的、潜在的不可解释算法,并且在树构建过程中缺乏决策过程的透明度。在本文中,我们通过以有统计意义的方式生长决策树来解决可解释的分类数据聚类问题。我们将对候选分割的评估表述为一个多变量双样本检验问题,通过综合所有单个类别的显著性证据得出一个单一的 p 值。这种方法提供了一种可靠、可控的方法,用于选择最佳分割,同时确定其统计意义。在真实世界数据集上的大量实验结果表明,我们的算法在聚类质量、运行效率和可解释性等方面都达到了与同类算法相当的性能。
{"title":"Significance-based decision tree for interpretable categorical data clustering","authors":"Lianyu Hu,&nbsp;Mudi Jiang,&nbsp;Xinying Liu,&nbsp;Zengyou He","doi":"10.1016/j.ins.2024.121588","DOIUrl":"10.1016/j.ins.2024.121588","url":null,"abstract":"<div><div>Numerous clustering algorithms prioritize accuracy, but in high-risk domains, the interpretability of clustering methods is crucial as well. The inherent heterogeneity of categorical data makes it particularly challenging for users to comprehend clustering outcomes. Currently, the majority of interpretable clustering methods are tailored for numerical data and utilize decision tree models, leaving interpretable clustering for categorical data as a less explored domain. Additionally, existing interpretable clustering algorithms often depend on external, potentially non-interpretable algorithms and lack transparency in the decision-making process during tree construction. In this paper, we tackle the problem of interpretable categorical data clustering by growing a decision tree in a statistically meaningful manner. We formulate the evaluation of candidate splits as a multivariate two-sample testing problem, where a single <em>p</em>-value is derived by combining significance evidence from all individual categories. This approach provides a reliable and controllable method for selecting the optimal split while determining its statistical significance. Extensive experimental results on real-world data sets demonstrate that our algorithm achieves comparable performance in terms of cluster quality, running efficiency, and explainability relative to its counterparts.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121588"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142539058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel self-training framework for semi-supervised soft sensor modeling based on indeterminate variational autoencoder 基于不定变异自动编码器的新型半监督软传感器建模自训练框架
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-26 DOI: 10.1016/j.ins.2024.121565
Hengqian Wang , Lei Chen , Kuangrong Hao , Xin Cai , Bing Wei
In modern industrial processes, the high acquisition cost of labeled data can lead to a large number of unlabeled samples, which greatly impacts the accuracy of traditional soft sensor models. To this end, this paper proposes a novel semi-supervised soft sensor framework that can fully utilize the unlabeled data to expand the original labeled data, and ultimately improve the prediction accuracy. Specifically, an indeterminate variational autoencoder (IVAE) is first proposed to obtain pseudo-labels and their uncertainties for unlabeled data. On this basis, the IVAE-based self-training (ST-IVAE) framework is further naturally proposed to expand the original small labeled dataset through continuous circulation. Among them, a variance-based oversampling (VOS) strategy is introduced to better utilize the pseudo-label uncertainty. By determining similar sample sets through the comparison of Kullback-Leibler (KL) divergence obtained by the proposed IVAE model, each sample can be independently modeled for prediction. The effectiveness of the proposed semi-supervised framework is verified on two real industrial processes. Comparable results illustrate that the ST-IVAE framework can still predict well even in the presence of missing input data compared to state-of-the-art methodologies in addressing semi-supervised soft sensing challenges.
在现代工业生产过程中,标记数据的获取成本较高,会导致大量未标记样本的出现,从而极大地影响了传统软传感器模型的准确性。为此,本文提出了一种新颖的半监督软传感器框架,可充分利用未标记数据来扩展原始标记数据,最终提高预测精度。具体来说,本文首先提出了一种不确定变分自动编码器(IVAE),用于获取未标记数据的伪标签及其不确定性。在此基础上,进一步自然地提出了基于 IVAE 的自训练(ST-IVAE)框架,通过连续循环来扩展原始的小标签数据集。其中,为了更好地利用伪标签的不确定性,引入了基于方差的超采样(VOS)策略。通过比较所提出的 IVAE 模型得到的 Kullback-Leibler (KL) 分歧来确定相似样本集,每个样本都可以独立建模进行预测。建议的半监督框架的有效性在两个实际工业流程中得到了验证。可比较的结果表明,与应对半监督软传感挑战的最先进方法相比,ST-IVAE 框架即使在输入数据缺失的情况下也能很好地进行预测。
{"title":"A novel self-training framework for semi-supervised soft sensor modeling based on indeterminate variational autoencoder","authors":"Hengqian Wang ,&nbsp;Lei Chen ,&nbsp;Kuangrong Hao ,&nbsp;Xin Cai ,&nbsp;Bing Wei","doi":"10.1016/j.ins.2024.121565","DOIUrl":"10.1016/j.ins.2024.121565","url":null,"abstract":"<div><div>In modern industrial processes, the high acquisition cost of labeled data can lead to a large number of unlabeled samples, which greatly impacts the accuracy of traditional soft sensor models. To this end, this paper proposes a novel semi-supervised soft sensor framework that can fully utilize the unlabeled data to expand the original labeled data, and ultimately improve the prediction accuracy. Specifically, an indeterminate variational autoencoder (IVAE) is first proposed to obtain pseudo-labels and their uncertainties for unlabeled data. On this basis, the IVAE-based self-training (ST-IVAE) framework is further naturally proposed to expand the original small labeled dataset through continuous circulation. Among them, a variance-based oversampling (VOS) strategy is introduced to better utilize the pseudo-label uncertainty. By determining similar sample sets through the comparison of Kullback-Leibler (KL) divergence obtained by the proposed IVAE model, each sample can be independently modeled for prediction. The effectiveness of the proposed semi-supervised framework is verified on two real industrial processes. Comparable results illustrate that the ST-IVAE framework can still predict well even in the presence of missing input data compared to state-of-the-art methodologies in addressing semi-supervised soft sensing challenges.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121565"},"PeriodicalIF":8.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Some notes on the consequences of pretreatment of multivariate data 关于多元数据预处理后果的一些说明
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-24 DOI: 10.1016/j.ins.2024.121580
Ali S. Hadi , Rida Moustafa
With the advent of data technologies, we have various types of data, such as structured, unstructured and semi-structured. Performing certain statistical or machine learning techniques may require careful preprocessing or pretreatment of the data to make them suitable for analysis. For example, given a data matrix X, which represents n multivariate observations or cases on p variables or features, the columns/rows of X may be pretreated before applying statistical or machine learning techniques to the data. While centering and/or scaling the variables do not alter the correlation structure nor the graphical representation of the data, centering/scaling the observations do. We investigate various row pretreatment methods more closely and show with theoretical proofs and numerical examples that centering/scaling the rows of X changes both the graphical structure of the observations in the multi-dimensional space and the correlation structure among the variables. There may be good reasons for performing row centering/scaling on the data and we are not against it, but analysts who use such row operations should be aware of the geometrical and correlation structures one has performed on the data and should also demonstrate that the process results in a new, more appropriate structure for their questions.
随着数据技术的发展,我们拥有了各种类型的数据,如结构化、非结构化和半结构化数据。在执行某些统计或机器学习技术时,可能需要对数据进行仔细的预处理或预处理后才能使其适合分析。例如,给定的数据矩阵 X 表示 p 个变量或特征的 n 个多元观测值或案例,在对数据应用统计或机器学习技术之前,可以对 X 的列/行进行预处理。虽然变量的居中和/或缩放不会改变数据的相关结构或图形表示,但观测数据的居中和/或缩放却会改变数据的相关结构或图形表示。我们对各种行预处理方法进行了更深入的研究,并通过理论证明和数值示例表明,对 X 行进行居中/缩放会改变多维空间中观测数据的图形结构和变量间的相关结构。对数据进行行居中/缩放处理可能有很好的理由,我们并不反对这样做,但使用这种行操作的分析师应该意识到对数据进行的几何结构和相关结构,并且还应该证明这一过程会为他们的问题带来新的、更合适的结构。
{"title":"Some notes on the consequences of pretreatment of multivariate data","authors":"Ali S. Hadi ,&nbsp;Rida Moustafa","doi":"10.1016/j.ins.2024.121580","DOIUrl":"10.1016/j.ins.2024.121580","url":null,"abstract":"<div><div>With the advent of data technologies, we have various types of data, such as structured, unstructured and semi-structured. Performing certain statistical or machine learning techniques may require careful preprocessing or pretreatment of the data to make them suitable for analysis. For example, given a data matrix <strong>X</strong>, which represents <em>n</em> multivariate observations or cases on <em>p</em> variables or features, the columns/rows of <strong>X</strong> may be pretreated before applying statistical or machine learning techniques to the data. While centering and/or scaling the variables do not alter the correlation structure nor the graphical representation of the data, centering/scaling the observations do. We investigate various row pretreatment methods more closely and show with theoretical proofs and numerical examples that centering/scaling the rows of <strong>X</strong> changes both the graphical structure of the observations in the multi-dimensional space and the correlation structure among the variables. There may be good reasons for performing row centering/scaling on the data and we are not against it, but analysts who use such row operations should be aware of the geometrical and correlation structures one has performed on the data and should also demonstrate that the process results in a new, more appropriate structure for their questions.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121580"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A diversity and reliability-enhanced synthetic minority oversampling technique for multi-label learning 用于多标签学习的多样性和可靠性增强型合成少数群体超采样技术
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-24 DOI: 10.1016/j.ins.2024.121579
Yanlu Gong , Quanwang Wu , Mengchu Zhou , Chao Chen
The class imbalance issue is generally intrinsic in multi-label datasets due to the fact that they have a large number of labels and each sample is associated with only a few of them. This causes the trained multi-label classifier to be biased towards the majority labels. Multi-label oversampling methods have been proposed to handle this issue, and they fall into clone-based and Synthetic Minority Oversampling TEchnique-based (SMOTE-based) ones. However, the former duplicates minority samples and may result in over-fitting whereas the latter may generate unreliable synthetic samples. In this work, we propose a Diversity and Reliability-enhanced SMOTE for multi-label learning (DR-SMOTE). In it, the minority classes are determined according to their label imbalance ratios. A reliable minority sample is used as a seed to generate a synthetic one while a reference sample is selected for it to confine the synthesis region. Features of the synthetic samples are determined probabilistically in this region and their labels are set identically to those of the seeds. We carry out experiments with eleven multi-label datasets to compare DR-SMOTE against seven existing resampling methods based on four base multi-label classifiers. The experimental results demonstrate DR-SMOTE’s superiority over its peers in terms of several evaluation metrics.
多标签数据集一般都存在类不平衡问题,这是因为这些数据集有大量标签,而每个样本只与其中的几个标签相关联。这会导致训练好的多标签分类器偏向于大多数标签。为了解决这个问题,有人提出了多标签超采样方法,它们分为基于克隆的方法和基于合成少数群体超采样技术(SMOTE)的方法。然而,前者会重复少数群体样本并可能导致过度拟合,而后者则可能生成不可靠的合成样本。在这项工作中,我们提出了一种用于多标签学习的多样性和可靠性增强型 SMOTE(DR-SMOTE)。其中,少数类是根据其标签不平衡比率确定的。可靠的少数群体样本被用作生成合成样本的种子,同时为其选择参考样本以限定合成区域。合成样本的特征在该区域内以概率方式确定,其标签设置与种子相同。我们使用 11 个多标签数据集进行了实验,将 DR-SMOTE 与基于 4 个基本多标签分类器的 7 种现有重采样方法进行了比较。实验结果表明,DR-SMOTE 在多个评估指标上都优于同类方法。
{"title":"A diversity and reliability-enhanced synthetic minority oversampling technique for multi-label learning","authors":"Yanlu Gong ,&nbsp;Quanwang Wu ,&nbsp;Mengchu Zhou ,&nbsp;Chao Chen","doi":"10.1016/j.ins.2024.121579","DOIUrl":"10.1016/j.ins.2024.121579","url":null,"abstract":"<div><div>The class imbalance issue is generally intrinsic in multi-label datasets due to the fact that they have a large number of labels and each sample is associated with only a few of them. This causes the trained multi-label classifier to be biased towards the majority labels. Multi-label oversampling methods have been proposed to handle this issue, and they fall into clone-based and Synthetic Minority Oversampling TEchnique-based (SMOTE-based) ones. However, the former duplicates minority samples and may result in over-fitting whereas the latter may generate unreliable synthetic samples. In this work, we propose a Diversity and Reliability-enhanced SMOTE for multi-label learning (DR-SMOTE). In it, the minority classes are determined according to their label imbalance ratios. A reliable minority sample is used as a seed to generate a synthetic one while a reference sample is selected for it to confine the synthesis region. Features of the synthetic samples are determined probabilistically in this region and their labels are set identically to those of the seeds. We carry out experiments with eleven multi-label datasets to compare DR-SMOTE against seven existing resampling methods based on four base multi-label classifiers. The experimental results demonstrate DR-SMOTE’s superiority over its peers in terms of several evaluation metrics.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121579"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sample feature enhancement model based on heterogeneous graph representation learning for few-shot relation classification 基于异构图表示学习的样本特征增强模型,适用于少量关系分类
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-24 DOI: 10.1016/j.ins.2024.121583
Zhezhe Xing , Yuxin Ye , Rui Song , Yun Teng , Ziheng Li , Jiawen Liu
Few-Shot Relation Classification (FSRC) aims to predict novel relationships by learning from limited samples. Graph Neural Network (GNN) approaches for FSRC constructs data as graphs, effectively capturing sample features through graph representation learning. However, they often face several challenges: 1) They tend to neglect the interactions between samples from different support sets and overlook the implicit noise in labels, leading to sub-optimal sample feature generation. 2) They struggle to deeply mine the diverse semantic information present in FSRC data. 3) Over-smoothing and overfitting limit the model's depth and adversely affect overall performance. To address these issues, we propose a Sample Representation Enhancement model based on Heterogeneous Graph Neural Network (SRE-HGNN) for FSRC. This method leverages inter-sample and inter-class associations (i.e., label mutual attention) to effectively fuse features and generate more expressive sample representations. Edge-heterogeneous GNNs are employed to enhance sample features by capturing heterogeneous information of varying depths through different edge attentions. Additionally, we introduce an attention-based neighbor node culling method, enabling the model to stack higher levels and extract deeper inter-sample associations, thereby improving performance. Finally, experiments are conducted for the FSRC task, and SRE-HGNN achieves an average accuracy improvement of 1.84% and 1.02% across two public datasets.
少量关系分类(FSRC)旨在通过从有限的样本中学习来预测新的关系。用于 FSRC 的图神经网络(GNN)方法将数据构建为图,通过图表示学习有效捕捉样本特征。然而,它们往往面临着几个挑战:1) 它们往往会忽略来自不同支持集的样本之间的相互作用,并忽略标签中的隐含噪声,从而导致样本特征生成效果不理想。2) 它们难以深入挖掘 FSRC 数据中的各种语义信息。3) 过度平滑和过度拟合限制了模型的深度,对整体性能产生不利影响。为了解决这些问题,我们提出了一种基于异构图神经网络(SRE-HGNN)的 FSRC 样本表示增强模型。该方法利用样本间和类间关联(即标签相互关注)来有效融合特征并生成更具表现力的样本表示。边缘异构 GNN 通过不同的边缘注意力捕捉不同深度的异构信息,从而增强样本特征。此外,我们还引入了一种基于注意力的邻居节点剔除方法,使模型能够堆叠更高层次并提取更深层次的样本间关联,从而提高性能。最后,我们针对 FSRC 任务进行了实验,在两个公共数据集上,SRE-HGNN 的平均准确率分别提高了 1.84% 和 1.02%。
{"title":"Sample feature enhancement model based on heterogeneous graph representation learning for few-shot relation classification","authors":"Zhezhe Xing ,&nbsp;Yuxin Ye ,&nbsp;Rui Song ,&nbsp;Yun Teng ,&nbsp;Ziheng Li ,&nbsp;Jiawen Liu","doi":"10.1016/j.ins.2024.121583","DOIUrl":"10.1016/j.ins.2024.121583","url":null,"abstract":"<div><div>Few-Shot Relation Classification (FSRC) aims to predict novel relationships by learning from limited samples. Graph Neural Network (GNN) approaches for FSRC constructs data as graphs, effectively capturing sample features through graph representation learning. However, they often face several challenges: 1) They tend to neglect the interactions between samples from different support sets and overlook the implicit noise in labels, leading to sub-optimal sample feature generation. 2) They struggle to deeply mine the diverse semantic information present in FSRC data. 3) Over-smoothing and overfitting limit the model's depth and adversely affect overall performance. To address these issues, we propose a Sample Representation Enhancement model based on Heterogeneous Graph Neural Network (SRE-HGNN) for FSRC. This method leverages inter-sample and inter-class associations (i.e., label mutual attention) to effectively fuse features and generate more expressive sample representations. Edge-heterogeneous GNNs are employed to enhance sample features by capturing heterogeneous information of varying depths through different edge attentions. Additionally, we introduce an attention-based neighbor node culling method, enabling the model to stack higher levels and extract deeper inter-sample associations, thereby improving performance. Finally, experiments are conducted for the FSRC task, and SRE-HGNN achieves an average accuracy improvement of 1.84% and 1.02% across two public datasets.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121583"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142539059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SGO: An innovative oversampling approach for imbalanced datasets using SVM and genetic algorithms SGO:利用 SVM 和遗传算法对不平衡数据集进行超采样的创新方法
IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-24 DOI: 10.1016/j.ins.2024.121584
Jianfeng Deng, Dongmei Wang, Jinan Gu, Chen Chen
Imbalanced datasets present a challenging problem in machine learning and artificial intelligence. Since most models typically assume balanced data distributions, imbalanced positive and negative examples can lead to significant bias in prediction or classification tasks. Current over-sampling methods frequently encounter issues like overfitting and boundary bias. A novel imbalanced data augmentation technique called SVM-GA over-sampling (SGO) is proposed in this paper, which integrates Support Vector Machines (SVM) with Genetic Algorithms (GA). Our approach leverages SVM to identify the decision boundary and uses GA to generate new minority samples along this boundary, effectively addressing both over-fitting and boundary biases. It has been experimentally validated that SGO outperforms the traditional methods on most datasets, providing a novel and effective approach to address imbalanced data problems, with potential application prospects and generalization value.
不平衡数据集是机器学习和人工智能领域的一个难题。由于大多数模型通常假定数据分布平衡,因此不平衡的正负实例会导致预测或分类任务出现严重偏差。目前的过采样方法经常会遇到过拟合和边界偏差等问题。本文提出了一种称为 SVM-GA 过度采样(SGO)的新型不平衡数据增强技术,它将支持向量机(SVM)与遗传算法(GA)相结合。我们的方法利用 SVM 来识别决策边界,并使用 GA 沿此边界生成新的少数样本,从而有效地解决了过拟合和边界偏差问题。实验验证了 SGO 在大多数数据集上的表现优于传统方法,为解决不平衡数据问题提供了一种新颖有效的方法,具有潜在的应用前景和推广价值。
{"title":"SGO: An innovative oversampling approach for imbalanced datasets using SVM and genetic algorithms","authors":"Jianfeng Deng,&nbsp;Dongmei Wang,&nbsp;Jinan Gu,&nbsp;Chen Chen","doi":"10.1016/j.ins.2024.121584","DOIUrl":"10.1016/j.ins.2024.121584","url":null,"abstract":"<div><div>Imbalanced datasets present a challenging problem in machine learning and artificial intelligence. Since most models typically assume balanced data distributions, imbalanced positive and negative examples can lead to significant bias in prediction or classification tasks. Current over-sampling methods frequently encounter issues like overfitting and boundary bias. A novel imbalanced data augmentation technique called SVM-GA over-sampling (SGO) is proposed in this paper, which integrates Support Vector Machines (SVM) with Genetic Algorithms (GA). Our approach leverages SVM to identify the decision boundary and uses GA to generate new minority samples along this boundary, effectively addressing both over-fitting and boundary biases. It has been experimentally validated that SGO outperforms the traditional methods on most datasets, providing a novel and effective approach to address imbalanced data problems, with potential application prospects and generalization value.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121584"},"PeriodicalIF":8.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142538567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information Sciences
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1