WIREs Data Mining and Knowledge Discovery最新文献

Trace Encoding Techniques for Multi‐Perspective Process Mining: A Comparative Study

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-12-10 DOI: 10.1002/widm.1573

Antonino Rullo, Farhana Alam, Edoardo Serra

Process mining (PM) comprises a variety of methods for discovering information about processes from their execution logs. Some of them, such as trace clustering, trace classification, and anomalous trace detection require a preliminary preprocessing step in which the raw data is encoded into a numerical feature space. To this end, encoding techniques are used to generate vectorial representations of process traces. Most of the PM literature provides trace encoding techniques that look at the control flow, that is, only encode the sequence of activities that characterize a process trace disregarding other process data that is fundamental for effectively describing the process behavior. To fill this gap, in this article we show 19 trace encoding methods that work in a multi‐perspective manner, that is, by embedding events and trace attributes in addition to activity names into the vectorial representations of process traces. We also provide an extensive experimental study where these techniques are applied to real‐life datasets and compared to each other.

引用次数: 0

Hyper‐Parameter Optimization of Kernel Functions on Multi‐Class Text Categorization: A Comparative Evaluation

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-11-28 DOI: 10.1002/widm.1572

Michael Loki, Agnes Mindila, Wilson Cheruiyot

In recent years, machine learning (ML) has witnessed a paradigm shift in kernel function selection, which is pivotal in optimizing various ML models. Despite multiple studies about its significance, a comprehensive understanding of kernel function selection, particularly about model performance, still needs to be explored. Challenges remain in selecting and optimizing kernel functions to improve model performance and efficiency. The study investigates how gamma parameter and cost parameter influence performance metrics in multi‐class classification tasks using various kernel‐based algorithms. Through sensitivity analysis, the impact of these parameters on classification performance and computational efficiency is assessed. The experimental setup involves deploying ML models using four kernel‐based algorithms: Support Vector Machine, Radial Basis Function, Polynomial Kernel, and Sigmoid Kernel. Data preparation includes text processing, categorization, and feature extraction using TfidfVectorizer, followed by model training and validation. Results indicate that Support Vector Machine with default settings and Radial Basis Function kernel consistently outperforms polynomial and sigmoid kernels. Adjusting gamma improves model accuracy and precision, highlighting its role in capturing complex relationships. Regularization cost parameters, however, show minimal impact on performance. The study also reveals that configurations with moderate gamma values achieve better balance between performance and computational time compared to higher gamma values or no gamma adjustment. The findings underscore the delicate balance between model performance and computational efficiency by highlighting the trade‐offs between model complexity and efficiency.

{"title":"Hyper‐Parameter Optimization of Kernel Functions on Multi‐Class Text Categorization: A Comparative Evaluation","authors":"Michael Loki, Agnes Mindila, Wilson Cheruiyot","doi":"10.1002/widm.1572","DOIUrl":"https://doi.org/10.1002/widm.1572","url":null,"abstract":"In recent years, machine learning (ML) has witnessed a paradigm shift in kernel function selection, which is pivotal in optimizing various ML models. Despite multiple studies about its significance, a comprehensive understanding of kernel function selection, particularly about model performance, still needs to be explored. Challenges remain in selecting and optimizing kernel functions to improve model performance and efficiency. The study investigates how gamma parameter and cost parameter influence performance metrics in multi‐class classification tasks using various kernel‐based algorithms. Through sensitivity analysis, the impact of these parameters on classification performance and computational efficiency is assessed. The experimental setup involves deploying ML models using four kernel‐based algorithms: Support Vector Machine, Radial Basis Function, Polynomial Kernel, and Sigmoid Kernel. Data preparation includes text processing, categorization, and feature extraction using TfidfVectorizer, followed by model training and validation. Results indicate that Support Vector Machine with default settings and Radial Basis Function kernel consistently outperforms polynomial and sigmoid kernels. Adjusting gamma improves model accuracy and precision, highlighting its role in capturing complex relationships. Regularization cost parameters, however, show minimal impact on performance. The study also reveals that configurations with moderate gamma values achieve better balance between performance and computational time compared to higher gamma values or no gamma adjustment. The findings underscore the delicate balance between model performance and computational efficiency by highlighting the trade‐offs between model complexity and efficiency.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"84 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142753720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dimensionality Reduction for Data Analysis With Quantum Feature Learning 利用量子特征学习降低数据分析的维度

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-11-21 DOI: 10.1002/widm.1568

Shyam R. Sihare

To improve data analysis and feature learning, this study compares the effectiveness of quantum dimensionality reduction (qDR) techniques to classical ones. In this study, we investigate several qDR techniques on a variety of datasets such as quantum Gaussian distribution adaptation (qGDA), quantum principal component analysis (qPCA), quantum linear discriminant analysis (qLDA), and quantum t‐SNE (qt‐SNE). The Olivetti Faces, Wine, Breast Cancer, Digits, and Iris are among the datasets used in this investigation. Through comparison evaluations against well‐established classical approaches, such as classical PCA (cPCA), classical LDA (cLDA), and classical GDA (cGDA), and using well‐established metrics like loss, fidelity, and processing time, the effectiveness of these techniques is assessed. The findings show that cPCA produced positive results with the lowest loss and highest fidelity when used on the Iris dataset. On the other hand, quantum uniform manifold approximation and projection (qUMAP) performs well and shows strong fidelity when tested against the Wine dataset, but ct‐SNE shows mediocre performance against the Digits dataset. Isomap and locally linear embedding (LLE) function differently depending on the dataset. Notably, LLE showed the largest loss and lowest fidelity on the Olivetti Faces dataset. The hypothesis testing findings showed that the qDR strategies did not significantly outperform the classical techniques in terms of maintaining pertinent information from quantum datasets. More specifically, the outcomes of paired t‐tests show that when it comes to the ability to capture complex patterns, there are no statistically significant differences between the cPCA and qPCA, the cLDA and qLDA, and the cGDA and qGDA. According to the findings of the assessments of mutual information (MI) and clustering accuracy, qPCA may be able to recognize patterns more clearly than standardized cPCA. Nevertheless, there is no discernible improvement between the qLDA and qGDA approaches and their classical counterparts.

为了改进数据分析和特征学习，本研究比较了量子降维（qDR）技术与经典降维技术的有效性。在这项研究中，我们在各种数据集上研究了几种量子降维技术，如量子高斯分布自适应（qGDA）、量子主成分分析（qPCA）、量子线性判别分析（qLDA）和量子 t-SNE（qt-SNE）。本研究使用的数据集包括奥利维蒂面孔、葡萄酒、乳腺癌、数字和虹膜。通过与经典 PCA（cPCA）、经典 LDA（cLDA）和经典 GDA（cGDA）等成熟的经典方法进行比较评估，并使用损失、保真度和处理时间等成熟指标，评估了这些技术的有效性。研究结果表明，当在虹膜数据集上使用 cPCA 时，它能以最低的损失和最高的保真度产生积极的结果。另一方面，量子均匀流形逼近和投影（qUMAP）在测试 Wine 数据集时表现良好，保真度高，但 ct-SNE 在测试 Digits 数据集时表现平平。Isomap 和局部线性嵌入（LLE）的功能因数据集而异。值得注意的是，LLE 在 Olivetti Faces 数据集上的损失最大，保真度最低。假设检验结果表明，在保持量子数据集相关信息方面，qDR 策略并没有明显优于经典技术。更具体地说，配对 t 检验的结果表明，在捕捉复杂模式的能力方面，cPCA 和 qPCA、cLDA 和 qLDA 以及 cGDA 和 qGDA 之间没有显著的统计学差异。根据互信息（MI）和聚类准确性的评估结果，qPCA 可能比标准化的 cPCA 能更清晰地识别模式。不过，qLDA 和 qGDA 方法与经典方法相比没有明显改善。

{"title":"Dimensionality Reduction for Data Analysis With Quantum Feature Learning","authors":"Shyam R. Sihare","doi":"10.1002/widm.1568","DOIUrl":"https://doi.org/10.1002/widm.1568","url":null,"abstract":"To improve data analysis and feature learning, this study compares the effectiveness of quantum dimensionality reduction (qDR) techniques to classical ones. In this study, we investigate several qDR techniques on a variety of datasets such as quantum Gaussian distribution adaptation (qGDA), quantum principal component analysis (qPCA), quantum linear discriminant analysis (qLDA), and quantum t‐SNE (qt‐SNE). The Olivetti Faces, Wine, Breast Cancer, Digits, and Iris are among the datasets used in this investigation. Through comparison evaluations against well‐established classical approaches, such as classical PCA (cPCA), classical LDA (cLDA), and classical GDA (cGDA), and using well‐established metrics like loss, fidelity, and processing time, the effectiveness of these techniques is assessed. The findings show that cPCA produced positive results with the lowest loss and highest fidelity when used on the Iris dataset. On the other hand, quantum uniform manifold approximation and projection (qUMAP) performs well and shows strong fidelity when tested against the Wine dataset, but ct‐SNE shows mediocre performance against the Digits dataset. Isomap and locally linear embedding (LLE) function differently depending on the dataset. Notably, LLE showed the largest loss and lowest fidelity on the Olivetti Faces dataset. The hypothesis testing findings showed that the qDR strategies did not significantly outperform the classical techniques in terms of maintaining pertinent information from quantum datasets. More specifically, the outcomes of paired t‐tests show that when it comes to the ability to capture complex patterns, there are no statistically significant differences between the cPCA and qPCA, the cLDA and qLDA, and the cGDA and qGDA. According to the findings of the assessments of mutual information (MI) and clustering accuracy, qPCA may be able to recognize patterns more clearly than standardized cPCA. Nevertheless, there is no discernible improvement between the qLDA and qGDA approaches and their classical counterparts.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142678437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Business Analytics in Customer Lifetime Value: An Overview Analysis 客户终身价值商业分析：概述分析

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-11-06 DOI: 10.1002/widm.1571

Onur Dogan, Abdulkadir Hiziroglu, Ali Pisirgen, Omer Faruk Seymen

In customer‐oriented systems, customer lifetime value (CLV) has been of significant importance for academia and marketing practitioners, especially within the scope of analytical modeling. CLV is a critical approach to managing and organizing a company's profitability. With the vast availability of consumer data, business analytics (BA) tools and approaches, alongside CLV models, have been applied to gain deeper insights into customer behaviors and decision‐making processes. Despite the recognized importance of CLV, there is a noticeable gap in comprehensive analyses and reviews of BA techniques applied to CLV. This study aims to fill this gap by conducting a thorough survey of the state‐of‐the‐art investigations on CLV models integrated with BA approaches, thereby contributing to a research agenda in this field. The review methodology consists of three main steps: identification of relevant studies, creating a coding plan, and ensuring coding reliability. First, relevant studies were identified using predefined keywords. Next, a coding plan—one of the study's significant contributions—was developed to evaluate these studies comprehensively. Finally, the coding plan's reliability was tested by three experts before being applied to the selected studies. Additionally, specific evaluation criteria in the coding plan were implemented to introduce new insights. This study presents exciting and valuable results from various perspectives, providing a crucial reference for academic researchers and marketing practitioners interested in the intersection of BA and CLV.

在以客户为导向的系统中，客户终身价值（CLV）对学术界和市场营销从业人员具有重要意义，尤其是在分析建模的范围内。客户终身价值是管理和组织公司盈利能力的重要方法。随着消费者数据的大量涌现，商业分析（BA）工具和方法以及 CLV 模型已被用于深入了解客户行为和决策过程。尽管 CLV 的重要性已得到公认，但在对应用于 CLV 的 BA 技术进行全面分析和评述方面却存在明显差距。本研究旨在填补这一空白，对结合了 BA 方法的 CLV 模型的最新研究进行全面调查，从而为该领域的研究议程做出贡献。综述方法包括三个主要步骤：确定相关研究、制定编码计划和确保编码可靠性。首先，使用预定义的关键词确定相关研究。其次，制定编码计划--这是本研究的重要贡献之一--以全面评估这些研究。最后，三位专家对编码计划的可靠性进行了测试，然后将其应用到选定的研究中。此外，还实施了编码计划中的具体评估标准，以引入新的见解。本研究从不同角度展示了令人兴奋和有价值的结果，为对 BA 和 CLV 的交集感兴趣的学术研究人员和营销从业人员提供了重要参考。

{"title":"Business Analytics in Customer Lifetime Value: An Overview Analysis","authors":"Onur Dogan, Abdulkadir Hiziroglu, Ali Pisirgen, Omer Faruk Seymen","doi":"10.1002/widm.1571","DOIUrl":"https://doi.org/10.1002/widm.1571","url":null,"abstract":"In customer‐oriented systems, customer lifetime value (CLV) has been of significant importance for academia and marketing practitioners, especially within the scope of analytical modeling. CLV is a critical approach to managing and organizing a company's profitability. With the vast availability of consumer data, business analytics (BA) tools and approaches, alongside CLV models, have been applied to gain deeper insights into customer behaviors and decision‐making processes. Despite the recognized importance of CLV, there is a noticeable gap in comprehensive analyses and reviews of BA techniques applied to CLV. This study aims to fill this gap by conducting a thorough survey of the state‐of‐the‐art investigations on CLV models integrated with BA approaches, thereby contributing to a research agenda in this field. The review methodology consists of three main steps: identification of relevant studies, creating a coding plan, and ensuring coding reliability. First, relevant studies were identified using predefined keywords. Next, a coding plan—one of the study's significant contributions—was developed to evaluate these studies comprehensively. Finally, the coding plan's reliability was tested by three experts before being applied to the selected studies. Additionally, specific evaluation criteria in the coding plan were implemented to introduce new insights. This study presents exciting and valuable results from various perspectives, providing a crucial reference for academic researchers and marketing practitioners interested in the intersection of BA and CLV.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge Graph for Solubility Big Data: Construction and Applications 溶解度大数据知识图谱：构建与应用

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-11-01 DOI: 10.1002/widm.1570

Xiao Haiyang, Yan Ruomei, Wu Yan, Guan Lixin, Li Mengshan

Dissolution refers to the process in which solvent molecules and solute molecules attract and combine with each other. The extensive solubility data generated from the dissolution of various compounds under different conditions, is distributed across structured or semi‐structured formats in various media, such as text, web pages, tables, images, and databases. These data exhibit multi‐source and unstructured features, aligning with the typical 5 V characteristics of big data. A solubility big data technology system has emerged under the fusion of solubility data and big data technologies. However, the acquisition, fusion, storage, representation, and utilization of solubility big data are encountering new challenges. Knowledge Graphs, known as extensive systems for representing and applying knowledge, can effectively describe entities, concepts, and relations across diverse domains. The construction of solubility big data knowledge graph holds substantial value in the retrieval, analysis, utilization, and visualization of solubility knowledge. Throwing out a brick to attract a jade, this paper focuses on the solubility big data knowledge graph and, firstly, summarizes the architecture of solubility knowledge graph construction. Secondly, the key technologies such as knowledge extraction, knowledge fusion, and knowledge reasoning of solubility big data are emphasized, along with summarizing the common machine learning methods in knowledge graph construction. Furthermore, this paper explores application scenarios, such as knowledge question answering and recommender systems for solubility big data. Finally, it presents a prospective view of the shortcomings, challenges, and future directions related to the construction of solubility big data knowledge graph. This article proposes the research direction of solubility big data knowledge graph, which can provide technical references for constructing a solubility knowledge graph. At the same time, it serves as a comprehensive medium for describing data, resources, and their applications across diverse fields such as chemistry, materials, biology, energy, medicine, and so on. It further aids in knowledge retrieval and mining, analysis and utilization, and visualization across various disciplines.

溶解是指溶剂分子和溶质分子相互吸引并结合的过程。各种化合物在不同条件下溶解所产生的大量溶解度数据以结构化或半结构化的格式分布在各种媒体中，如文本、网页、表格、图像和数据库。这些数据具有多源和非结构化的特点，符合大数据的典型 5 V 特征。在溶解度数据与大数据技术的融合下，溶解度大数据技术体系应运而生。然而，溶解度大数据的获取、融合、存储、表示和利用都遇到了新的挑战。知识图谱被称为表示和应用知识的广泛系统，可以有效地描述不同领域的实体、概念和关系。构建溶解度大数据知识图谱，对于溶解度知识的检索、分析、利用和可视化具有重要价值。抛砖引玉，本文聚焦溶解度大数据知识图谱，首先总结了溶解度知识图谱的构建架构。其次，重点介绍了溶度大数据的知识抽取、知识融合、知识推理等关键技术，并总结了知识图谱构建中常用的机器学习方法。此外，本文还探讨了溶解度大数据的知识问题解答和推荐系统等应用场景。最后，本文对构建溶解度大数据知识图谱相关的不足、挑战和未来方向进行了展望。本文提出了溶解度大数据知识图谱的研究方向，可为构建溶解度知识图谱提供技术参考。同时，它也是描述化学、材料、生物、能源、医学等不同领域的数据、资源及其应用的综合媒介。它还有助于各学科的知识检索和挖掘、分析和利用以及可视化。

{"title":"Knowledge Graph for Solubility Big Data: Construction and Applications","authors":"Xiao Haiyang, Yan Ruomei, Wu Yan, Guan Lixin, Li Mengshan","doi":"10.1002/widm.1570","DOIUrl":"https://doi.org/10.1002/widm.1570","url":null,"abstract":"Dissolution refers to the process in which solvent molecules and solute molecules attract and combine with each other. The extensive solubility data generated from the dissolution of various compounds under different conditions, is distributed across structured or semi‐structured formats in various media, such as text, web pages, tables, images, and databases. These data exhibit multi‐source and unstructured features, aligning with the typical 5 V characteristics of big data. A solubility big data technology system has emerged under the fusion of solubility data and big data technologies. However, the acquisition, fusion, storage, representation, and utilization of solubility big data are encountering new challenges. Knowledge Graphs, known as extensive systems for representing and applying knowledge, can effectively describe entities, concepts, and relations across diverse domains. The construction of solubility big data knowledge graph holds substantial value in the retrieval, analysis, utilization, and visualization of solubility knowledge. Throwing out a brick to attract a jade, this paper focuses on the solubility big data knowledge graph and, firstly, summarizes the architecture of solubility knowledge graph construction. Secondly, the key technologies such as knowledge extraction, knowledge fusion, and knowledge reasoning of solubility big data are emphasized, along with summarizing the common machine learning methods in knowledge graph construction. Furthermore, this paper explores application scenarios, such as knowledge question answering and recommender systems for solubility big data. Finally, it presents a prospective view of the shortcomings, challenges, and future directions related to the construction of solubility big data knowledge graph. This article proposes the research direction of solubility big data knowledge graph, which can provide technical references for constructing a solubility knowledge graph. At the same time, it serves as a comprehensive medium for describing data, resources, and their applications across diverse fields such as chemistry, materials, biology, energy, medicine, and so on. It further aids in knowledge retrieval and mining, analysis and utilization, and visualization across various disciplines.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application‐Based Review of Soft Computational Methods to Enhance Industrial Practices Abetted by the Patent Landscape Analysis 基于应用的软计算方法审查，通过专利态势分析加强工业实践

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-10-31 DOI: 10.1002/widm.1564

S. Tamilselvan, G. Dhanalakshmi, D. Balaji, L. Rajeshkumar

Soft computing is a collective methodology that touches all engineering and technology fields owing to its easiness in solving various problems while comparing the conventional methods. Many analytical methods are taken over by this soft computing technique and resolve it accurately and the soft computing has given a paradigm shift. The flexibility in soft computing results in swift knowledge acquisition processing and the information supply renders versatile and affordable technological system. Besides, the accuracy with which the soft computing technique predicts the parameters has transformed the industrial productivity to a whole new level. The interest of this article focuses on versatile applications of SC methods to forecast the technological changes which intend to reorient the progress of various industries, and this is ascertained by a patent landscape analysis. The patent landscape revealed the players who are in the segment consistently and this also provides how this field moves on in the future and who could be a dominant country for a specific technology. Alongside, the accuracy of the soft computing method for a particular practice has also been mentioned indicating the feasibility of the technique. The novel part of this article lies in patent landscape analysis compared with the other data while the other part is the discussion of application of computational techniques to various industrial practices. The progress of various engineering applications integrating them with the patent landscape analysis must be envisaged for a better understanding of the future of all these applications resulting in an improved productivity.

软计算是一种涉及所有工程和技术领域的集体方法，因为与传统方法相比，它易于解决各种问题。许多分析方法都被这种软计算技术所取代，并得到了准确的解决。软计算的灵活性使得知识获取处理和信息供应变得非常迅速，从而形成了功能多样、经济实惠的技术系统。此外，软计算技术预测参数的准确性将工业生产力提升到了一个全新的水平。本文关注的重点是软计算方法在预测技术变革方面的广泛应用，这些技术变革意在调整各行各业的发展方向。专利格局揭示了该细分市场的持续参与者，这也提供了该领域未来的发展方向，以及谁可能成为特定技术的主导国家。此外，文章还提到了软计算方法在特定实践中的准确性，表明了该技术的可行性。这篇文章的新颖之处在于与其他数据相比的专利格局分析，而另一部分则是讨论计算技术在各种工业实践中的应用。为了更好地了解所有这些应用的未来，提高生产率，必须将各种工程应用的进展与专利状况分析结合起来。

{"title":"Application‐Based Review of Soft Computational Methods to Enhance Industrial Practices Abetted by the Patent Landscape Analysis","authors":"S. Tamilselvan, G. Dhanalakshmi, D. Balaji, L. Rajeshkumar","doi":"10.1002/widm.1564","DOIUrl":"https://doi.org/10.1002/widm.1564","url":null,"abstract":"Soft computing is a collective methodology that touches all engineering and technology fields owing to its easiness in solving various problems while comparing the conventional methods. Many analytical methods are taken over by this soft computing technique and resolve it accurately and the soft computing has given a paradigm shift. The flexibility in soft computing results in swift knowledge acquisition processing and the information supply renders versatile and affordable technological system. Besides, the accuracy with which the soft computing technique predicts the parameters has transformed the industrial productivity to a whole new level. The interest of this article focuses on versatile applications of SC methods to forecast the technological changes which intend to reorient the progress of various industries, and this is ascertained by a patent landscape analysis. The patent landscape revealed the players who are in the segment consistently and this also provides how this field moves on in the future and who could be a dominant country for a specific technology. Alongside, the accuracy of the soft computing method for a particular practice has also been mentioned indicating the feasibility of the technique. The novel part of this article lies in patent landscape analysis compared with the other data while the other part is the discussion of application of computational techniques to various industrial practices. The progress of various engineering applications integrating them with the patent landscape analysis must be envisaged for a better understanding of the future of all these applications resulting in an improved productivity.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142561881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Machine Learning for Systematic Literature Review Case in Point: Agile Software Development 使用机器学习进行系统性文献综述案例：敏捷软件开发

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-10-29 DOI: 10.1002/widm.1569

Itzik David, Roy Gelbard

Systematic literature reviews (SLRs) are essential for researchers to keep up with past and recent research in their domains. However, the rapid growth in knowledge creation and the rising number of publications have made this task increasingly complex and challenging. Moreover, most systematic literature reviews are performed manually, which requires significant effort and creates potential bias. The risk of bias is particularly relevant in the data synthesis task, where researchers interpret each study's evidence and summarize the results. This study uses an experimental approach to explore using machine learning (ML) techniques in the SLR process. Specifically, this study replicates a study that manually performed sentiment analysis for the data synthesis step to determine the polarity (negative or positive) of evidence extracted from studies in the field of agile methodology. This study employs a lexicon‐based approach to sentiment analysis and achieves an accuracy rate of approximately 86.5% in identifying study evidence polarity.

系统性文献综述（SLR）对于研究人员了解其研究领域过去和近期的研究情况至关重要。然而，知识创造的快速增长和出版物数量的不断增加使这项工作变得越来越复杂和具有挑战性。此外，大多数系统性文献综述都是人工完成的，这不仅需要大量的精力，还可能造成偏差。在研究人员解释每项研究的证据并总结结果的数据综合任务中，偏差风险尤为重要。本研究采用实验方法，探索在 SLR 过程中使用机器学习（ML）技术。具体来说，本研究复制了一项研究，该研究在数据综合步骤中手动执行情感分析，以确定从敏捷方法学领域的研究中提取的证据的极性（负面或正面）。本研究采用了基于词库的情感分析方法，在识别研究证据极性方面达到了约 86.5% 的准确率。

引用次数: 0

Adversarial Attacks in Explainable Machine Learning: A Survey of Threats Against Models and Humans 可解释机器学习中的对抗性攻击：针对模型和人类的威胁调查

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-10-28 DOI: 10.1002/widm.1567

Jon Vadillo, Roberto Santana, Jose A. Lozano

Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.

神经网络等机器学习模型的可靠部署仍面临挑战，原因在于其存在若干局限性。其中一些主要缺点是缺乏可解释性，以及对对抗性示例或超出分布范围的输入缺乏鲁棒性。在本文中，我们全面回顾了可解释机器学习模型对抗性攻击的可能性和局限性。首先，我们扩展了对抗性示例的概念，使其适用于可解释机器学习场景，即人类不仅要评估输入和输出分类，还要评估模型决策的解释。接下来，我们提出了一个综合框架，用于研究在人类评估下是否（以及如何）为可解释模型生成对抗性示例。在此框架的基础上，我们对该领域现有的各种攻击范例进行了结构化回顾，确定了当前的差距和未来的研究方向，并对所讨论的主要攻击范例进行了说明。此外，我们的框架还考虑了一系列相关但往往被忽视的因素，如问题的类型、用户的专业知识或解释的目的，以确定在每种情况下应采取的攻击策略，从而成功地欺骗模型（和人类）。这些贡献的目的是为在可解释机器学习领域对对抗性实例进行更严格、更现实的研究奠定基础。

{"title":"Adversarial Attacks in Explainable Machine Learning: A Survey of Threats Against Models and Humans","authors":"Jon Vadillo, Roberto Santana, Jose A. Lozano","doi":"10.1002/widm.1567","DOIUrl":"https://doi.org/10.1002/widm.1567","url":null,"abstract":"Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142536806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reflecting on a Decade of Evolution: MapReduce‐Based Advances in Partitioning‐Based, Hierarchical‐Based, and Density‐Based Clustering (2013–2023) 反思十年演变：基于 MapReduce 的分区聚类、层次聚类和密度聚类的进展（2013-2023 年）

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-10-21 DOI: 10.1002/widm.1566

Tanvir Habib Sardar

The traditional clustering algorithms are not appropriate for large real‐world datasets or big data, which is attributable to computational expensiveness and scalability issues. As a solution, the last decade's research headed towards distributed clustering using the MapReduce framework. This study conducts a bibliometric review to assess, establish, and measure the patterns and trends of the MapReduce‐based partitioning, hierarchical, and density clustering algorithms over the past decade (2013–2023). A digital text‐mining‐based comprehensive search technique with multiple field‐specific keywords, inclusion measures, and exclusion criteria is employed to obtain the research landscape from the Scopus database. The Scopus‐obtained data is analyzed using the VOSViewer software tool and coded using the R statistical analysis tool. The analysis identifies the numbers of scholarly articles, diversities of article sources, their impact and growth patterns, details of most influential authors and co‐authors, most cited articles, most contributing affiliations and countries, and their collaborations, use of different keywords and their impact, and so forth. The study further explores the articles and reports the methodologies employed for designing MapReduce‐based counterparts of traditional partitioning, hierarchical, and density clustering algorithms and their optimizations and hybridizations. Finally, the study lists the main research challenges encountered in the past decade for MapReduce‐based partitioning, hierarchical, and density clustering. It suggests possible areas for future research to contribute further in this field.

传统的聚类算法并不适合大型真实数据集或海量数据，这归因于计算耗费和可扩展性问题。作为一种解决方案，过去十年的研究转向使用 MapReduce 框架的分布式聚类。本研究对过去十年（2013-2023 年）基于 MapReduce 的分区、分层和密度聚类算法进行了文献计量学回顾，以评估、建立和衡量这些算法的模式和趋势。我们采用了基于数字文本挖掘的综合搜索技术，使用多个特定领域的关键词、纳入措施和排除标准，从 Scopus 数据库中获取研究概况。使用 VOSViewer 软件工具对 Scopus 数据进行分析，并使用 R 统计分析工具进行编码。分析确定了学术文章的数量、文章来源的多样性、文章的影响力和增长模式、最有影响力的作者和合著者的详细信息、被引用次数最多的文章、贡献最多的单位和国家及其合作关系、不同关键词的使用及其影响等。研究进一步探讨了这些文章，并报告了在设计基于 MapReduce 的传统分区、分层和密度聚类算法的对应算法时所采用的方法及其优化和混合。最后，研究列举了过去十年中基于 MapReduce 的分区、分层和密度聚类所遇到的主要研究挑战。研究还提出了未来研究的可能领域，以进一步推动该领域的发展。

{"title":"Reflecting on a Decade of Evolution: MapReduce‐Based Advances in Partitioning‐Based, Hierarchical‐Based, and Density‐Based Clustering (2013–2023)","authors":"Tanvir Habib Sardar","doi":"10.1002/widm.1566","DOIUrl":"https://doi.org/10.1002/widm.1566","url":null,"abstract":"The traditional clustering algorithms are not appropriate for large real‐world datasets or big data, which is attributable to computational expensiveness and scalability issues. As a solution, the last decade's research headed towards distributed clustering using the MapReduce framework. This study conducts a bibliometric review to assess, establish, and measure the patterns and trends of the MapReduce‐based partitioning, hierarchical, and density clustering algorithms over the past decade (2013–2023). A digital text‐mining‐based comprehensive search technique with multiple field‐specific keywords, inclusion measures, and exclusion criteria is employed to obtain the research landscape from the Scopus database. The Scopus‐obtained data is analyzed using the VOSViewer software tool and coded using the R statistical analysis tool. The analysis identifies the numbers of scholarly articles, diversities of article sources, their impact and growth patterns, details of most influential authors and co‐authors, most cited articles, most contributing affiliations and countries, and their collaborations, use of different keywords and their impact, and so forth. The study further explores the articles and reports the methodologies employed for designing MapReduce‐based counterparts of traditional partitioning, hierarchical, and density clustering algorithms and their optimizations and hybridizations. Finally, the study lists the main research challenges encountered in the past decade for MapReduce‐based partitioning, hierarchical, and density clustering. It suggests possible areas for future research to contribute further in this field.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142486813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Conceptual Framework for Human‐Centric and Semantics‐Based Explainable Event Detection 以人为本、基于语义的可解释事件检测概念框架

WIREs Data Mining and Knowledge Discovery

Pub Date : 2024-10-18 DOI: 10.1002/widm.1565

Taiwo Kolajo, Olawande Daramola

Explainability in the field of event detection is a new emerging research area. For practitioners and users alike, explainability is essential to ensuring that models are widely adopted and trusted. Several research efforts have focused on the efficacy and efficiency of event detection. However, a human‐centric explanation approach to existing event detection solutions is still lacking. This paper presents an overview of a conceptual framework for human‐centric semantic‐based explainable event detection with the acronym HUSEED. The framework considered the affordances of XAI and semantics technologies for human‐comprehensible explanations of events to facilitate 5W1H explanations (Who did what, when, where, why, and how). Providing this kind of explanation will lead to trustworthy, unambiguous, and transparent event detection models with a higher possibility of uptake by users in various domains of application. We illustrated the applicability of the proposed framework by using two use cases involving first story detection and fake news detection.

事件检测领域的可解释性是一个新兴的研究领域。对于从业人员和用户来说，可解释性对于确保模型被广泛采用和信任至关重要。多项研究工作都集中在事件检测的功效和效率上。然而，现有的事件检测解决方案仍然缺乏以人为本的解释方法。本文概述了以人为中心、基于语义的可解释事件检测概念框架，缩写为 HUSEED。该框架考虑了 XAI 和语义学技术在对事件进行人类可理解的解释方面的能力，以促进 5W1H 解释（谁做了什么、何时、何地、为何和如何）。提供这种解释将有助于建立可信、明确和透明的事件检测模型，从而更有可能被各应用领域的用户所接受。我们通过两个使用案例来说明所提议框架的适用性，这两个案例分别涉及第一新闻检测和假新闻检测。

引用次数: 0