This paper reports on a thorough analysis of the scientific literature using data and text mining to uncover knowledge from online reviews due to their importance as user‐generated content. In this context, more than 12,000 papers were extracted from publications indexed in the Scopus database within the last 15 years. Regarding the type of data, most previous studies focused on qualitative textual data to perform their analysis, with fewer looking for quantitative scores and/or characterizing reviewer profiles. In terms of application domains, information management and technology, e‐commerce, and tourism stand out. It is also clear that other areas of potentially valuable applications should be addressed in future research, such as arts and education, as well as more interdisciplinary approaches, namely in the spectrum of the social sciences.
{"title":"Data and text mining from online reviews: An automatic literature analysis","authors":"Sérgio Moro, P. Rita","doi":"10.1002/widm.1448","DOIUrl":"https://doi.org/10.1002/widm.1448","url":null,"abstract":"This paper reports on a thorough analysis of the scientific literature using data and text mining to uncover knowledge from online reviews due to their importance as user‐generated content. In this context, more than 12,000 papers were extracted from publications indexed in the Scopus database within the last 15 years. Regarding the type of data, most previous studies focused on qualitative textual data to perform their analysis, with fewer looking for quantitative scores and/or characterizing reviewer profiles. In terms of application domains, information management and technology, e‐commerce, and tourism stand out. It is also clear that other areas of potentially valuable applications should be addressed in future research, such as arts and education, as well as more interdisciplinary approaches, namely in the spectrum of the social sciences.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"5 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87830719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Rita Nogueira, Andrea Pugnana, S. Ruggieri, D. Pedreschi, João Gama
Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation‐based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples.
{"title":"Methods and tools for causal discovery and causal inference","authors":"Ana Rita Nogueira, Andrea Pugnana, S. Ruggieri, D. Pedreschi, João Gama","doi":"10.1002/widm.1449","DOIUrl":"https://doi.org/10.1002/widm.1449","url":null,"abstract":"Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation‐based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"90 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86968307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In communication, textual data are a vital attribute. In all languages, ambiguous or polysemous words' meaning changes depending on the context in which they are used. The ability to determine the ambiguous word's correct meaning is a Know‐distill challenging task in natural language processing (NLP). Word sense disambiguation (WSD) is an NLP process to analyze and determine the correct meaning of polysemous words in a text. WSD is a computational linguistics task that automatically identifies the polysemous word's set of senses. Based on the context some word comes into view, WSD recognizes and tags the word to its correct priori known meaning. Semitic languages like Arabic have even more significant challenges than other languages since Arabic lacks diacritics, standardization, and a massive shortage of available resources. Recently, many approaches and techniques have been suggested to solve word ambiguity dilemmas in many different ways and several languages. In this review paper, an extensive survey of research works is presented, seeking to solve Arabic word sense disambiguation with the existing AWSD datasets.
{"title":"A comprehensive review on Arabic word sense disambiguation for natural language processing applications","authors":"S. Kaddoura, R. D. Ahmed, D. JudeHemanth","doi":"10.1002/widm.1447","DOIUrl":"https://doi.org/10.1002/widm.1447","url":null,"abstract":"In communication, textual data are a vital attribute. In all languages, ambiguous or polysemous words' meaning changes depending on the context in which they are used. The ability to determine the ambiguous word's correct meaning is a Know‐distill challenging task in natural language processing (NLP). Word sense disambiguation (WSD) is an NLP process to analyze and determine the correct meaning of polysemous words in a text. WSD is a computational linguistics task that automatically identifies the polysemous word's set of senses. Based on the context some word comes into view, WSD recognizes and tags the word to its correct priori known meaning. Semitic languages like Arabic have even more significant challenges than other languages since Arabic lacks diacritics, standardization, and a massive shortage of available resources. Recently, many approaches and techniques have been suggested to solve word ambiguity dilemmas in many different ways and several languages. In this review paper, an extensive survey of research works is presented, seeking to solve Arabic word sense disambiguation with the existing AWSD datasets.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"15 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82501286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Manco, E. Ritacco, Antonino Rullo, D. Saccá, Edoardo Serra
The development of platforms and techniques for emerging Big Data and Machine Learning applications requires the availability of real‐life datasets. A possible solution is to synthesize datasets that reflect patterns of real ones using a two‐step approach: first, a real dataset X is analyzed to derive relevant patterns Z and, then, to use such patterns for reconstructing a new dataset X′ that preserves the main characteristics of X . This survey explores two possible approaches: (1) Constraint‐based generation and (2) probabilistic generative modeling. The former is devised using inverse mining ( IFM ) techniques, and consists of generating a dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. By contrast, for the latter approach, recent developments in probabilistic generative modeling ( PGM ) are explored that model the generation as a sampling process from a parametric distribution, typically encoded as neural network. The two approaches are compared by providing an overview of their instantiations for the case of discrete data and discussing their pros and cons.
{"title":"Machine learning methods for generating high dimensional discrete datasets","authors":"G. Manco, E. Ritacco, Antonino Rullo, D. Saccá, Edoardo Serra","doi":"10.1002/widm.1450","DOIUrl":"https://doi.org/10.1002/widm.1450","url":null,"abstract":"The development of platforms and techniques for emerging Big Data and Machine Learning applications requires the availability of real‐life datasets. A possible solution is to synthesize datasets that reflect patterns of real ones using a two‐step approach: first, a real dataset X is analyzed to derive relevant patterns Z and, then, to use such patterns for reconstructing a new dataset X′ that preserves the main characteristics of X . This survey explores two possible approaches: (1) Constraint‐based generation and (2) probabilistic generative modeling. The former is devised using inverse mining ( IFM ) techniques, and consists of generating a dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. By contrast, for the latter approach, recent developments in probabilistic generative modeling ( PGM ) are explored that model the generation as a sampling process from a parametric distribution, typically encoded as neural network. The two approaches are compared by providing an overview of their instantiations for the case of discrete data and discussing their pros and cons.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"93 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82095149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The use of machine learning in sport outcome prediction: A review","authors":"Ines Horvat","doi":"10.1002/widm.1445","DOIUrl":"https://doi.org/10.1002/widm.1445","url":null,"abstract":"","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"117 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2022-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88239977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01Epub Date: 2021-11-17DOI: 10.1002/widm.1436
Cuneyt Gurcan Akcora, Yulia R Gel, Murat Kantarcioglu
Blockchain is an emerging technology that has enabled many applications, from cryptocurrencies to digital asset management and supply chains. Due to this surge of popularity, analyzing the data stored on blockchains poses a new critical challenge in data science. To assist data scientists in various analytic tasks for a blockchain, in this tutorial, we provide a systematic and comprehensive overview of the fundamental elements of blockchain network models. We discuss how we can abstract blockchain data as various types of networks and further use such associated network abstractions to reap important insights on blockchains' structure, organization, and functionality. This article is categorized under:Technologies > Data PreprocessingApplication Areas > Business and IndustryFundamental Concepts of Data and Knowledge > Data ConceptsFundamental Concepts of Data and Knowledge > Knowledge Representation.
{"title":"Blockchain networks: Data structures of Bitcoin, Monero, Zcash, Ethereum, Ripple, and Iota.","authors":"Cuneyt Gurcan Akcora, Yulia R Gel, Murat Kantarcioglu","doi":"10.1002/widm.1436","DOIUrl":"https://doi.org/10.1002/widm.1436","url":null,"abstract":"<p><p>Blockchain is an emerging technology that has enabled many applications, from cryptocurrencies to digital asset management and supply chains. Due to this surge of popularity, analyzing the data stored on blockchains poses a new critical challenge in data science. To assist data scientists in various analytic tasks for a blockchain, in this tutorial, we provide a systematic and comprehensive overview of the fundamental elements of blockchain network models. We discuss how we can abstract blockchain data as various types of networks and further use such associated network abstractions to reap important insights on blockchains' structure, organization, and functionality. This article is categorized under:Technologies > Data PreprocessingApplication Areas > Business and IndustryFundamental Concepts of Data and Knowledge > Data ConceptsFundamental Concepts of Data and Knowledge > Knowledge Representation.</p>","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"12 1","pages":"e1436"},"PeriodicalIF":7.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/9f/f6/WIDM-12-0.PMC9286592.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40613886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Process mining (PM) is a well‐known research area that includes techniques, methodologies, and tools for analyzing processes in a variety of application domains. In the case of healthcare, processes are characterized by high variability in terms of activities, duration, and involved resources (e.g., physicians, nurses, administrators, machineries, etc.). Besides, the multitude of diseases that the patients housed in healthcare facilities suffer from makes medical contexts highly heterogeneous. As a result, understanding and analyzing healthcare processes are certainly not trivial tasks, and administrators and doctors look for tools and methods that can concretely support them in improving the healthcare services they are involved in. In this context, PM has been increasingly used for a wide range of applications as reported in some recent reviews. However, these reviews mainly focus on discussion on applications related to the clinical pathways, while a systematic review of all possible applications is absent. In this article, we selected 172 papers published in the last 10 years, that present applications of PM in the healthcare domain. The objective of this study is to help and guide researchers interested in the medical field to understand the main PM applications in the healthcare, but also to suggest new ways to develop promising and not yet fully investigated applications. Moreover, our study could be of interest for practitioners who are considering applications of PM, who can identify and choose PM algorithms, techniques, tools, methodologies, and approaches, toward what have been the experiences of success.
{"title":"Process mining applications in the healthcare domain: A comprehensive review","authors":"A. Guzzo, Antonino Rullo, E. Vocaturo","doi":"10.1002/widm.1442","DOIUrl":"https://doi.org/10.1002/widm.1442","url":null,"abstract":"Process mining (PM) is a well‐known research area that includes techniques, methodologies, and tools for analyzing processes in a variety of application domains. In the case of healthcare, processes are characterized by high variability in terms of activities, duration, and involved resources (e.g., physicians, nurses, administrators, machineries, etc.). Besides, the multitude of diseases that the patients housed in healthcare facilities suffer from makes medical contexts highly heterogeneous. As a result, understanding and analyzing healthcare processes are certainly not trivial tasks, and administrators and doctors look for tools and methods that can concretely support them in improving the healthcare services they are involved in. In this context, PM has been increasingly used for a wide range of applications as reported in some recent reviews. However, these reviews mainly focus on discussion on applications related to the clinical pathways, while a systematic review of all possible applications is absent. In this article, we selected 172 papers published in the last 10 years, that present applications of PM in the healthcare domain. The objective of this study is to help and guide researchers interested in the medical field to understand the main PM applications in the healthcare, but also to suggest new ways to develop promising and not yet fully investigated applications. Moreover, our study could be of interest for practitioners who are considering applications of PM, who can identify and choose PM algorithms, techniques, tools, methodologies, and approaches, toward what have been the experiences of success.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"2 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72951562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data mining is a process to extract unknown, hidden, and potentially useful information from data. But the problem of data island makes it arduous for people to collect and analyze scattered data, and there is also a privacy security issue when mining data. A collaboratively decentralized approach called federated learning unites multiple participants to generate a shareable global optimal model and keeps privacy‐sensitive data on local devices, which may bring great hope to us for solving the problems of decentralized data and privacy protection. Though federated learning has been widely used, few systematic studies have been conducted on the subject of federated learning in data mining. Hence, different from prior reviews in this field, we make a comprehensive summary and provide a novel taxonomy of the application of federated learning in data mining. This article starts by providing a thorough description of the relevant definitions and concepts, followed by an in‐depth investigation on the challenges faced by federated learning. In this context, we elaborate four taxonomies of major applications of federated learning in data mining, including education, healthcare, IoT, and intelligent transportation, and discuss them comprehensively. Finally, we discuss four promising research directions for further research, that is, privacy enhancement, improvement of communication efficiency, heterogeneous system processing, and reducing economic costs.
{"title":"A survey on federated learning in data mining","authors":"Bin Yu, Wenjie Mao, Yihan Lv, Chen Zhang, Yu Xie","doi":"10.1002/widm.1443","DOIUrl":"https://doi.org/10.1002/widm.1443","url":null,"abstract":"Data mining is a process to extract unknown, hidden, and potentially useful information from data. But the problem of data island makes it arduous for people to collect and analyze scattered data, and there is also a privacy security issue when mining data. A collaboratively decentralized approach called federated learning unites multiple participants to generate a shareable global optimal model and keeps privacy‐sensitive data on local devices, which may bring great hope to us for solving the problems of decentralized data and privacy protection. Though federated learning has been widely used, few systematic studies have been conducted on the subject of federated learning in data mining. Hence, different from prior reviews in this field, we make a comprehensive summary and provide a novel taxonomy of the application of federated learning in data mining. This article starts by providing a thorough description of the relevant definitions and concepts, followed by an in‐depth investigation on the challenges faced by federated learning. In this context, we elaborate four taxonomies of major applications of federated learning in data mining, including education, healthcare, IoT, and intelligent transportation, and discuss them comprehensively. Finally, we discuss four promising research directions for further research, that is, privacy enhancement, improvement of communication efficiency, heterogeneous system processing, and reducing economic costs.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"14 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79460335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The automated news classification concerns the assignment of news to one or more predefined categories. The automated classified news helps the search engines to mine and categorize the type of news that the user asks for. Most of the researchers focused on the classification of English news and ignore the Arabic news due to the complexity of the Arabic morphology. This article presents a novel methodology to classify the Arabic news. It relies on the use of features extraction and the application of machine learning classifiers which are the Naive Bayes (NB), the Logistic Regression (LR), the Random Forest (RF), the Xtreme Gradient Boosting (XGB), the K‐Nearest Neighbors (KNN), the Stochastic Gradient Descent (SGD), the Decision Tree (DT), and the Multi‐Layer Perceptron (MLP). The methodology is applied to the Arabic news dataset provided by Mendeley. The accuracy of the classification is more than 95%.
{"title":"A novel methodology for Arabic news classification","authors":"Marco Alfonse, M. Gawich","doi":"10.1002/widm.1440","DOIUrl":"https://doi.org/10.1002/widm.1440","url":null,"abstract":"The automated news classification concerns the assignment of news to one or more predefined categories. The automated classified news helps the search engines to mine and categorize the type of news that the user asks for. Most of the researchers focused on the classification of English news and ignore the Arabic news due to the complexity of the Arabic morphology. This article presents a novel methodology to classify the Arabic news. It relies on the use of features extraction and the application of machine learning classifiers which are the Naive Bayes (NB), the Logistic Regression (LR), the Random Forest (RF), the Xtreme Gradient Boosting (XGB), the K‐Nearest Neighbors (KNN), the Stochastic Gradient Descent (SGD), the Decision Tree (DT), and the Multi‐Layer Perceptron (MLP). The methodology is applied to the Arabic news dataset provided by Mendeley. The accuracy of the classification is more than 95%.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"81 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76038631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}