Process mining (PM) comprises a variety of methods for discovering information about processes from their execution logs. Some of them, such as trace clustering, trace classification, and anomalous trace detection require a preliminary preprocessing step in which the raw data is encoded into a numerical feature space. To this end, encoding techniques are used to generate vectorial representations of process traces. Most of the PM literature provides trace encoding techniques that look at the control flow, that is, only encode the sequence of activities that characterize a process trace disregarding other process data that is fundamental for effectively describing the process behavior. To fill this gap, in this article we show 19 trace encoding methods that work in a multi‐perspective manner, that is, by embedding events and trace attributes in addition to activity names into the vectorial representations of process traces. We also provide an extensive experimental study where these techniques are applied to real‐life datasets and compared to each other.
{"title":"Trace Encoding Techniques for Multi‐Perspective Process Mining: A Comparative Study","authors":"Antonino Rullo, Farhana Alam, Edoardo Serra","doi":"10.1002/widm.1573","DOIUrl":"https://doi.org/10.1002/widm.1573","url":null,"abstract":"Process mining (PM) comprises a variety of methods for discovering information about processes from their execution logs. Some of them, such as trace clustering, trace classification, and anomalous trace detection require a preliminary preprocessing step in which the raw data is encoded into a numerical feature space. To this end, encoding techniques are used to generate vectorial representations of process traces. Most of the PM literature provides trace encoding techniques that look at the control flow, that is, only encode the sequence of activities that characterize a process trace disregarding other process data that is fundamental for effectively describing the process behavior. To fill this gap, in this article we show 19 trace encoding methods that work in a multi‐perspective manner, that is, by embedding events and trace attributes in addition to activity names into the vectorial representations of process traces. We also provide an extensive experimental study where these techniques are applied to real‐life datasets and compared to each other.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142804570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, machine learning (ML) has witnessed a paradigm shift in kernel function selection, which is pivotal in optimizing various ML models. Despite multiple studies about its significance, a comprehensive understanding of kernel function selection, particularly about model performance, still needs to be explored. Challenges remain in selecting and optimizing kernel functions to improve model performance and efficiency. The study investigates how gamma parameter and cost parameter influence performance metrics in multi‐class classification tasks using various kernel‐based algorithms. Through sensitivity analysis, the impact of these parameters on classification performance and computational efficiency is assessed. The experimental setup involves deploying ML models using four kernel‐based algorithms: Support Vector Machine, Radial Basis Function, Polynomial Kernel, and Sigmoid Kernel. Data preparation includes text processing, categorization, and feature extraction using TfidfVectorizer, followed by model training and validation. Results indicate that Support Vector Machine with default settings and Radial Basis Function kernel consistently outperforms polynomial and sigmoid kernels. Adjusting gamma improves model accuracy and precision, highlighting its role in capturing complex relationships. Regularization cost parameters, however, show minimal impact on performance. The study also reveals that configurations with moderate gamma values achieve better balance between performance and computational time compared to higher gamma values or no gamma adjustment. The findings underscore the delicate balance between model performance and computational efficiency by highlighting the trade‐offs between model complexity and efficiency.
{"title":"Hyper‐Parameter Optimization of Kernel Functions on Multi‐Class Text Categorization: A Comparative Evaluation","authors":"Michael Loki, Agnes Mindila, Wilson Cheruiyot","doi":"10.1002/widm.1572","DOIUrl":"https://doi.org/10.1002/widm.1572","url":null,"abstract":"In recent years, machine learning (ML) has witnessed a paradigm shift in kernel function selection, which is pivotal in optimizing various ML models. Despite multiple studies about its significance, a comprehensive understanding of kernel function selection, particularly about model performance, still needs to be explored. Challenges remain in selecting and optimizing kernel functions to improve model performance and efficiency. The study investigates how gamma parameter and cost parameter influence performance metrics in multi‐class classification tasks using various kernel‐based algorithms. Through sensitivity analysis, the impact of these parameters on classification performance and computational efficiency is assessed. The experimental setup involves deploying ML models using four kernel‐based algorithms: Support Vector Machine, Radial Basis Function, Polynomial Kernel, and Sigmoid Kernel. Data preparation includes text processing, categorization, and feature extraction using TfidfVectorizer, followed by model training and validation. Results indicate that Support Vector Machine with default settings and Radial Basis Function kernel consistently outperforms polynomial and sigmoid kernels. Adjusting gamma improves model accuracy and precision, highlighting its role in capturing complex relationships. Regularization cost parameters, however, show minimal impact on performance. The study also reveals that configurations with moderate gamma values achieve better balance between performance and computational time compared to higher gamma values or no gamma adjustment. The findings underscore the delicate balance between model performance and computational efficiency by highlighting the trade‐offs between model complexity and efficiency.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"84 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142753720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To improve data analysis and feature learning, this study compares the effectiveness of quantum dimensionality reduction (qDR) techniques to classical ones. In this study, we investigate several qDR techniques on a variety of datasets such as quantum Gaussian distribution adaptation (qGDA), quantum principal component analysis (qPCA), quantum linear discriminant analysis (qLDA), and quantum t‐SNE (qt‐SNE). The Olivetti Faces, Wine, Breast Cancer, Digits, and Iris are among the datasets used in this investigation. Through comparison evaluations against well‐established classical approaches, such as classical PCA (cPCA), classical LDA (cLDA), and classical GDA (cGDA), and using well‐established metrics like loss, fidelity, and processing time, the effectiveness of these techniques is assessed. The findings show that cPCA produced positive results with the lowest loss and highest fidelity when used on the Iris dataset. On the other hand, quantum uniform manifold approximation and projection (qUMAP) performs well and shows strong fidelity when tested against the Wine dataset, but ct‐SNE shows mediocre performance against the Digits dataset. Isomap and locally linear embedding (LLE) function differently depending on the dataset. Notably, LLE showed the largest loss and lowest fidelity on the Olivetti Faces dataset. The hypothesis testing findings showed that the qDR strategies did not significantly outperform the classical techniques in terms of maintaining pertinent information from quantum datasets. More specifically, the outcomes of paired t‐tests show that when it comes to the ability to capture complex patterns, there are no statistically significant differences between the cPCA and qPCA, the cLDA and qLDA, and the cGDA and qGDA. According to the findings of the assessments of mutual information (MI) and clustering accuracy, qPCA may be able to recognize patterns more clearly than standardized cPCA. Nevertheless, there is no discernible improvement between the qLDA and qGDA approaches and their classical counterparts.
{"title":"Dimensionality Reduction for Data Analysis With Quantum Feature Learning","authors":"Shyam R. Sihare","doi":"10.1002/widm.1568","DOIUrl":"https://doi.org/10.1002/widm.1568","url":null,"abstract":"To improve data analysis and feature learning, this study compares the effectiveness of quantum dimensionality reduction (qDR) techniques to classical ones. In this study, we investigate several qDR techniques on a variety of datasets such as quantum Gaussian distribution adaptation (qGDA), quantum principal component analysis (qPCA), quantum linear discriminant analysis (qLDA), and quantum t‐SNE (qt‐SNE). The Olivetti Faces, Wine, Breast Cancer, Digits, and Iris are among the datasets used in this investigation. Through comparison evaluations against well‐established classical approaches, such as classical PCA (cPCA), classical LDA (cLDA), and classical GDA (cGDA), and using well‐established metrics like loss, fidelity, and processing time, the effectiveness of these techniques is assessed. The findings show that cPCA produced positive results with the lowest loss and highest fidelity when used on the Iris dataset. On the other hand, quantum uniform manifold approximation and projection (qUMAP) performs well and shows strong fidelity when tested against the Wine dataset, but ct‐SNE shows mediocre performance against the Digits dataset. Isomap and locally linear embedding (LLE) function differently depending on the dataset. Notably, LLE showed the largest loss and lowest fidelity on the Olivetti Faces dataset. The hypothesis testing findings showed that the qDR strategies did not significantly outperform the classical techniques in terms of maintaining pertinent information from quantum datasets. More specifically, the outcomes of paired <jats:italic>t</jats:italic>‐tests show that when it comes to the ability to capture complex patterns, there are no statistically significant differences between the cPCA and qPCA, the cLDA and qLDA, and the cGDA and qGDA. According to the findings of the assessments of mutual information (MI) and clustering accuracy, qPCA may be able to recognize patterns more clearly than standardized cPCA. Nevertheless, there is no discernible improvement between the qLDA and qGDA approaches and their classical counterparts.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142678437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Onur Dogan, Abdulkadir Hiziroglu, Ali Pisirgen, Omer Faruk Seymen
In customer‐oriented systems, customer lifetime value (CLV) has been of significant importance for academia and marketing practitioners, especially within the scope of analytical modeling. CLV is a critical approach to managing and organizing a company's profitability. With the vast availability of consumer data, business analytics (BA) tools and approaches, alongside CLV models, have been applied to gain deeper insights into customer behaviors and decision‐making processes. Despite the recognized importance of CLV, there is a noticeable gap in comprehensive analyses and reviews of BA techniques applied to CLV. This study aims to fill this gap by conducting a thorough survey of the state‐of‐the‐art investigations on CLV models integrated with BA approaches, thereby contributing to a research agenda in this field. The review methodology consists of three main steps: identification of relevant studies, creating a coding plan, and ensuring coding reliability. First, relevant studies were identified using predefined keywords. Next, a coding plan—one of the study's significant contributions—was developed to evaluate these studies comprehensively. Finally, the coding plan's reliability was tested by three experts before being applied to the selected studies. Additionally, specific evaluation criteria in the coding plan were implemented to introduce new insights. This study presents exciting and valuable results from various perspectives, providing a crucial reference for academic researchers and marketing practitioners interested in the intersection of BA and CLV.
在以客户为导向的系统中,客户终身价值(CLV)对学术界和市场营销从业人员具有重要意义,尤其是在分析建模的范围内。客户终身价值是管理和组织公司盈利能力的重要方法。随着消费者数据的大量涌现,商业分析(BA)工具和方法以及 CLV 模型已被用于深入了解客户行为和决策过程。尽管 CLV 的重要性已得到公认,但在对应用于 CLV 的 BA 技术进行全面分析和评述方面却存在明显差距。本研究旨在填补这一空白,对结合了 BA 方法的 CLV 模型的最新研究进行全面调查,从而为该领域的研究议程做出贡献。综述方法包括三个主要步骤:确定相关研究、制定编码计划和确保编码可靠性。首先,使用预定义的关键词确定相关研究。其次,制定编码计划--这是本研究的重要贡献之一--以全面评估这些研究。最后,三位专家对编码计划的可靠性进行了测试,然后将其应用到选定的研究中。此外,还实施了编码计划中的具体评估标准,以引入新的见解。本研究从不同角度展示了令人兴奋和有价值的结果,为对 BA 和 CLV 的交集感兴趣的学术研究人员和营销从业人员提供了重要参考。
{"title":"Business Analytics in Customer Lifetime Value: An Overview Analysis","authors":"Onur Dogan, Abdulkadir Hiziroglu, Ali Pisirgen, Omer Faruk Seymen","doi":"10.1002/widm.1571","DOIUrl":"https://doi.org/10.1002/widm.1571","url":null,"abstract":"In customer‐oriented systems, customer lifetime value (CLV) has been of significant importance for academia and marketing practitioners, especially within the scope of analytical modeling. CLV is a critical approach to managing and organizing a company's profitability. With the vast availability of consumer data, business analytics (BA) tools and approaches, alongside CLV models, have been applied to gain deeper insights into customer behaviors and decision‐making processes. Despite the recognized importance of CLV, there is a noticeable gap in comprehensive analyses and reviews of BA techniques applied to CLV. This study aims to fill this gap by conducting a thorough survey of the state‐of‐the‐art investigations on CLV models integrated with BA approaches, thereby contributing to a research agenda in this field. The review methodology consists of three main steps: identification of relevant studies, creating a coding plan, and ensuring coding reliability. First, relevant studies were identified using predefined keywords. Next, a coding plan—one of the study's significant contributions—was developed to evaluate these studies comprehensively. Finally, the coding plan's reliability was tested by three experts before being applied to the selected studies. Additionally, specific evaluation criteria in the coding plan were implemented to introduce new insights. This study presents exciting and valuable results from various perspectives, providing a crucial reference for academic researchers and marketing practitioners interested in the intersection of BA and CLV.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Haiyang, Yan Ruomei, Wu Yan, Guan Lixin, Li Mengshan
Dissolution refers to the process in which solvent molecules and solute molecules attract and combine with each other. The extensive solubility data generated from the dissolution of various compounds under different conditions, is distributed across structured or semi‐structured formats in various media, such as text, web pages, tables, images, and databases. These data exhibit multi‐source and unstructured features, aligning with the typical 5 V characteristics of big data. A solubility big data technology system has emerged under the fusion of solubility data and big data technologies. However, the acquisition, fusion, storage, representation, and utilization of solubility big data are encountering new challenges. Knowledge Graphs, known as extensive systems for representing and applying knowledge, can effectively describe entities, concepts, and relations across diverse domains. The construction of solubility big data knowledge graph holds substantial value in the retrieval, analysis, utilization, and visualization of solubility knowledge. Throwing out a brick to attract a jade, this paper focuses on the solubility big data knowledge graph and, firstly, summarizes the architecture of solubility knowledge graph construction. Secondly, the key technologies such as knowledge extraction, knowledge fusion, and knowledge reasoning of solubility big data are emphasized, along with summarizing the common machine learning methods in knowledge graph construction. Furthermore, this paper explores application scenarios, such as knowledge question answering and recommender systems for solubility big data. Finally, it presents a prospective view of the shortcomings, challenges, and future directions related to the construction of solubility big data knowledge graph. This article proposes the research direction of solubility big data knowledge graph, which can provide technical references for constructing a solubility knowledge graph. At the same time, it serves as a comprehensive medium for describing data, resources, and their applications across diverse fields such as chemistry, materials, biology, energy, medicine, and so on. It further aids in knowledge retrieval and mining, analysis and utilization, and visualization across various disciplines.
溶解是指溶剂分子和溶质分子相互吸引并结合的过程。各种化合物在不同条件下溶解所产生的大量溶解度数据以结构化或半结构化的格式分布在各种媒体中,如文本、网页、表格、图像和数据库。这些数据具有多源和非结构化的特点,符合大数据的典型 5 V 特征。在溶解度数据与大数据技术的融合下,溶解度大数据技术体系应运而生。然而,溶解度大数据的获取、融合、存储、表示和利用都遇到了新的挑战。知识图谱被称为表示和应用知识的广泛系统,可以有效地描述不同领域的实体、概念和关系。构建溶解度大数据知识图谱,对于溶解度知识的检索、分析、利用和可视化具有重要价值。抛砖引玉,本文聚焦溶解度大数据知识图谱,首先总结了溶解度知识图谱的构建架构。其次,重点介绍了溶度大数据的知识抽取、知识融合、知识推理等关键技术,并总结了知识图谱构建中常用的机器学习方法。此外,本文还探讨了溶解度大数据的知识问题解答和推荐系统等应用场景。最后,本文对构建溶解度大数据知识图谱相关的不足、挑战和未来方向进行了展望。本文提出了溶解度大数据知识图谱的研究方向,可为构建溶解度知识图谱提供技术参考。同时,它也是描述化学、材料、生物、能源、医学等不同领域的数据、资源及其应用的综合媒介。它还有助于各学科的知识检索和挖掘、分析和利用以及可视化。
{"title":"Knowledge Graph for Solubility Big Data: Construction and Applications","authors":"Xiao Haiyang, Yan Ruomei, Wu Yan, Guan Lixin, Li Mengshan","doi":"10.1002/widm.1570","DOIUrl":"https://doi.org/10.1002/widm.1570","url":null,"abstract":"Dissolution refers to the process in which solvent molecules and solute molecules attract and combine with each other. The extensive solubility data generated from the dissolution of various compounds under different conditions, is distributed across structured or semi‐structured formats in various media, such as text, web pages, tables, images, and databases. These data exhibit multi‐source and unstructured features, aligning with the typical 5 V characteristics of big data. A solubility big data technology system has emerged under the fusion of solubility data and big data technologies. However, the acquisition, fusion, storage, representation, and utilization of solubility big data are encountering new challenges. Knowledge Graphs, known as extensive systems for representing and applying knowledge, can effectively describe entities, concepts, and relations across diverse domains. The construction of solubility big data knowledge graph holds substantial value in the retrieval, analysis, utilization, and visualization of solubility knowledge. Throwing out a brick to attract a jade, this paper focuses on the solubility big data knowledge graph and, firstly, summarizes the architecture of solubility knowledge graph construction. Secondly, the key technologies such as knowledge extraction, knowledge fusion, and knowledge reasoning of solubility big data are emphasized, along with summarizing the common machine learning methods in knowledge graph construction. Furthermore, this paper explores application scenarios, such as knowledge question answering and recommender systems for solubility big data. Finally, it presents a prospective view of the shortcomings, challenges, and future directions related to the construction of solubility big data knowledge graph. This article proposes the research direction of solubility big data knowledge graph, which can provide technical references for constructing a solubility knowledge graph. At the same time, it serves as a comprehensive medium for describing data, resources, and their applications across diverse fields such as chemistry, materials, biology, energy, medicine, and so on. It further aids in knowledge retrieval and mining, analysis and utilization, and visualization across various disciplines.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Tamilselvan, G. Dhanalakshmi, D. Balaji, L. Rajeshkumar
Soft computing is a collective methodology that touches all engineering and technology fields owing to its easiness in solving various problems while comparing the conventional methods. Many analytical methods are taken over by this soft computing technique and resolve it accurately and the soft computing has given a paradigm shift. The flexibility in soft computing results in swift knowledge acquisition processing and the information supply renders versatile and affordable technological system. Besides, the accuracy with which the soft computing technique predicts the parameters has transformed the industrial productivity to a whole new level. The interest of this article focuses on versatile applications of SC methods to forecast the technological changes which intend to reorient the progress of various industries, and this is ascertained by a patent landscape analysis. The patent landscape revealed the players who are in the segment consistently and this also provides how this field moves on in the future and who could be a dominant country for a specific technology. Alongside, the accuracy of the soft computing method for a particular practice has also been mentioned indicating the feasibility of the technique. The novel part of this article lies in patent landscape analysis compared with the other data while the other part is the discussion of application of computational techniques to various industrial practices. The progress of various engineering applications integrating them with the patent landscape analysis must be envisaged for a better understanding of the future of all these applications resulting in an improved productivity.
{"title":"Application‐Based Review of Soft Computational Methods to Enhance Industrial Practices Abetted by the Patent Landscape Analysis","authors":"S. Tamilselvan, G. Dhanalakshmi, D. Balaji, L. Rajeshkumar","doi":"10.1002/widm.1564","DOIUrl":"https://doi.org/10.1002/widm.1564","url":null,"abstract":"Soft computing is a collective methodology that touches all engineering and technology fields owing to its easiness in solving various problems while comparing the conventional methods. Many analytical methods are taken over by this soft computing technique and resolve it accurately and the soft computing has given a paradigm shift. The flexibility in soft computing results in swift knowledge acquisition processing and the information supply renders versatile and affordable technological system. Besides, the accuracy with which the soft computing technique predicts the parameters has transformed the industrial productivity to a whole new level. The interest of this article focuses on versatile applications of SC methods to forecast the technological changes which intend to reorient the progress of various industries, and this is ascertained by a patent landscape analysis. The patent landscape revealed the players who are in the segment consistently and this also provides how this field moves on in the future and who could be a dominant country for a specific technology. Alongside, the accuracy of the soft computing method for a particular practice has also been mentioned indicating the feasibility of the technique. The novel part of this article lies in patent landscape analysis compared with the other data while the other part is the discussion of application of computational techniques to various industrial practices. The progress of various engineering applications integrating them with the patent landscape analysis must be envisaged for a better understanding of the future of all these applications resulting in an improved productivity.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142561881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Systematic literature reviews (SLRs) are essential for researchers to keep up with past and recent research in their domains. However, the rapid growth in knowledge creation and the rising number of publications have made this task increasingly complex and challenging. Moreover, most systematic literature reviews are performed manually, which requires significant effort and creates potential bias. The risk of bias is particularly relevant in the data synthesis task, where researchers interpret each study's evidence and summarize the results. This study uses an experimental approach to explore using machine learning (ML) techniques in the SLR process. Specifically, this study replicates a study that manually performed sentiment analysis for the data synthesis step to determine the polarity (negative or positive) of evidence extracted from studies in the field of agile methodology. This study employs a lexicon‐based approach to sentiment analysis and achieves an accuracy rate of approximately 86.5% in identifying study evidence polarity.
{"title":"Using Machine Learning for Systematic Literature Review Case in Point: Agile Software Development","authors":"Itzik David, Roy Gelbard","doi":"10.1002/widm.1569","DOIUrl":"https://doi.org/10.1002/widm.1569","url":null,"abstract":"Systematic literature reviews (SLRs) are essential for researchers to keep up with past and recent research in their domains. However, the rapid growth in knowledge creation and the rising number of publications have made this task increasingly complex and challenging. Moreover, most systematic literature reviews are performed manually, which requires significant effort and creates potential bias. The risk of bias is particularly relevant in the data synthesis task, where researchers interpret each study's evidence and summarize the results. This study uses an experimental approach to explore using machine learning (ML) techniques in the SLR process. Specifically, this study replicates a study that manually performed sentiment analysis for the <jats:italic>data synthesis</jats:italic> step to determine the polarity (negative or positive) of evidence extracted from studies in the field of agile methodology. This study employs a lexicon‐based approach to sentiment analysis and achieves an accuracy rate of approximately 86.5% in identifying study evidence polarity.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"237 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142536805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.
{"title":"Adversarial Attacks in Explainable Machine Learning: A Survey of Threats Against Models and Humans","authors":"Jon Vadillo, Roberto Santana, Jose A. Lozano","doi":"10.1002/widm.1567","DOIUrl":"https://doi.org/10.1002/widm.1567","url":null,"abstract":"Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142536806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The traditional clustering algorithms are not appropriate for large real‐world datasets or big data, which is attributable to computational expensiveness and scalability issues. As a solution, the last decade's research headed towards distributed clustering using the MapReduce framework. This study conducts a bibliometric review to assess, establish, and measure the patterns and trends of the MapReduce‐based partitioning, hierarchical, and density clustering algorithms over the past decade (2013–2023). A digital text‐mining‐based comprehensive search technique with multiple field‐specific keywords, inclusion measures, and exclusion criteria is employed to obtain the research landscape from the Scopus database. The Scopus‐obtained data is analyzed using the VOSViewer software tool and coded using the R statistical analysis tool. The analysis identifies the numbers of scholarly articles, diversities of article sources, their impact and growth patterns, details of most influential authors and co‐authors, most cited articles, most contributing affiliations and countries, and their collaborations, use of different keywords and their impact, and so forth. The study further explores the articles and reports the methodologies employed for designing MapReduce‐based counterparts of traditional partitioning, hierarchical, and density clustering algorithms and their optimizations and hybridizations. Finally, the study lists the main research challenges encountered in the past decade for MapReduce‐based partitioning, hierarchical, and density clustering. It suggests possible areas for future research to contribute further in this field.
{"title":"Reflecting on a Decade of Evolution: MapReduce‐Based Advances in Partitioning‐Based, Hierarchical‐Based, and Density‐Based Clustering (2013–2023)","authors":"Tanvir Habib Sardar","doi":"10.1002/widm.1566","DOIUrl":"https://doi.org/10.1002/widm.1566","url":null,"abstract":"The traditional clustering algorithms are not appropriate for large real‐world datasets or big data, which is attributable to computational expensiveness and scalability issues. As a solution, the last decade's research headed towards distributed clustering using the MapReduce framework. This study conducts a bibliometric review to assess, establish, and measure the patterns and trends of the MapReduce‐based partitioning, hierarchical, and density clustering algorithms over the past decade (2013–2023). A digital text‐mining‐based comprehensive search technique with multiple field‐specific keywords, inclusion measures, and exclusion criteria is employed to obtain the research landscape from the Scopus database. The Scopus‐obtained data is analyzed using the VOSViewer software tool and coded using the R statistical analysis tool. The analysis identifies the numbers of scholarly articles, diversities of article sources, their impact and growth patterns, details of most influential authors and co‐authors, most cited articles, most contributing affiliations and countries, and their collaborations, use of different keywords and their impact, and so forth. The study further explores the articles and reports the methodologies employed for designing MapReduce‐based counterparts of traditional partitioning, hierarchical, and density clustering algorithms and their optimizations and hybridizations. Finally, the study lists the main research challenges encountered in the past decade for MapReduce‐based partitioning, hierarchical, and density clustering. It suggests possible areas for future research to contribute further in this field.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142486813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Explainability in the field of event detection is a new emerging research area. For practitioners and users alike, explainability is essential to ensuring that models are widely adopted and trusted. Several research efforts have focused on the efficacy and efficiency of event detection. However, a human‐centric explanation approach to existing event detection solutions is still lacking. This paper presents an overview of a conceptual framework for human‐centric semantic‐based explainable event detection with the acronym HUSEED. The framework considered the affordances of XAI and semantics technologies for human‐comprehensible explanations of events to facilitate 5W1H explanations (Who did what, when, where, why, and how). Providing this kind of explanation will lead to trustworthy, unambiguous, and transparent event detection models with a higher possibility of uptake by users in various domains of application. We illustrated the applicability of the proposed framework by using two use cases involving first story detection and fake news detection.
{"title":"A Conceptual Framework for Human‐Centric and Semantics‐Based Explainable Event Detection","authors":"Taiwo Kolajo, Olawande Daramola","doi":"10.1002/widm.1565","DOIUrl":"https://doi.org/10.1002/widm.1565","url":null,"abstract":"Explainability in the field of event detection is a new emerging research area. For practitioners and users alike, explainability is essential to ensuring that models are widely adopted and trusted. Several research efforts have focused on the efficacy and efficiency of event detection. However, a human‐centric explanation approach to existing event detection solutions is still lacking. This paper presents an overview of a conceptual framework for human‐centric semantic‐based explainable event detection with the acronym HUSEED. The framework considered the affordances of XAI and semantics technologies for human‐comprehensible explanations of events to facilitate 5W1H explanations (Who did what, when, where, why, and how). Providing this kind of explanation will lead to trustworthy, unambiguous, and transparent event detection models with a higher possibility of uptake by users in various domains of application. We illustrated the applicability of the proposed framework by using two use cases involving first story detection and fake news detection.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}