Formal concept analysis (FCA) visualizes formal concepts in terms of a concept lattice. Usually, it is an NP‐problem and consumes plenty of time and storage space to update the changes of the lattice. Thus, introducing an efficient way to update and maintain such lattices is a significant area of interest within the field of FCA and its applications. One of those vital FCA applications is the association rule mining (ARM), which aims at generating a loss‐less nonredundant compact Association Rule‐basis (AR‐basis). Currently, the real‐world data rapidly overgrow that asks the need for updating the existing concept lattice and AR‐basis upon data change continually. Intuitively, updating and maintaining an existing concept‐lattice or AR‐basis is much more efficient and consistent than reconstructing them from scratch, particularly in the case of massive data. So far, the area of updating both concept lattice and AR‐basis has not received much attention. Besides, few noncomprehensive studies have focused only on updating the concept lattice. From this point, this article comprehensively introduces basic knowledge regarding updating both concept lattices and AR‐basis with new illustrations, formalization, and examples. Also, the article reviews and compares recent remarkable works and explores the emerging future research trends.
{"title":"A comprehensive review on updating concept lattices and its application in updating association rules","authors":"Ebtesam E. Shemis, Ammar Mohammed","doi":"10.1002/widm.1401","DOIUrl":"https://doi.org/10.1002/widm.1401","url":null,"abstract":"Formal concept analysis (FCA) visualizes formal concepts in terms of a concept lattice. Usually, it is an NP‐problem and consumes plenty of time and storage space to update the changes of the lattice. Thus, introducing an efficient way to update and maintain such lattices is a significant area of interest within the field of FCA and its applications. One of those vital FCA applications is the association rule mining (ARM), which aims at generating a loss‐less nonredundant compact Association Rule‐basis (AR‐basis). Currently, the real‐world data rapidly overgrow that asks the need for updating the existing concept lattice and AR‐basis upon data change continually. Intuitively, updating and maintaining an existing concept‐lattice or AR‐basis is much more efficient and consistent than reconstructing them from scratch, particularly in the case of massive data. So far, the area of updating both concept lattice and AR‐basis has not received much attention. Besides, few noncomprehensive studies have focused only on updating the concept lattice. From this point, this article comprehensively introduces basic knowledge regarding updating both concept lattices and AR‐basis with new illustrations, formalization, and examples. Also, the article reviews and compares recent remarkable works and explores the emerging future research trends.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"211 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76052101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sayantan Adak, Souvic Chakraborty, Paramtia Das, Mithun Das, A. Dash, Rima Hazra, Binny Mathew, Punyajoy Saha, Soumya Sarkar, Animesh Mukherjee
The evolution of Artificial Intelligence (AI)‐based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI‐based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow‐up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting.
{"title":"Mining the online infosphere: A survey","authors":"Sayantan Adak, Souvic Chakraborty, Paramtia Das, Mithun Das, A. Dash, Rima Hazra, Binny Mathew, Punyajoy Saha, Soumya Sarkar, Animesh Mukherjee","doi":"10.1002/widm.1453","DOIUrl":"https://doi.org/10.1002/widm.1453","url":null,"abstract":"The evolution of Artificial Intelligence (AI)‐based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI‐based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow‐up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"9 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88462129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Krittakom Srijiranon, Narissara Eiamkanitchat, Sakgasit Ramingwong, K. Cosh, L. Ramingwong
Coarse particulate matter (PM10), the inhalable particles with an aerodynamic diameter smaller than 10 micrometers are one of the major air pollutions that affect human health. Over the previous decade, a number of researchers applied various data mining techniques to create a temporal prediction model. This study reviews and discusses 100 research articles in computer science and environmental science coming from the Scopus database. The three processes of data mining techniques, including data preparation, model creation, and model evaluation for prediction PM10 are highlighted. A summary of the overall process directions of data mining as well as their output are revealed. Additionally, recommendations for future research are identified.
{"title":"Investigation of PM10 prediction utilizing data mining techniques: Analyze by topic","authors":"Krittakom Srijiranon, Narissara Eiamkanitchat, Sakgasit Ramingwong, K. Cosh, L. Ramingwong","doi":"10.1002/widm.1423","DOIUrl":"https://doi.org/10.1002/widm.1423","url":null,"abstract":"Coarse particulate matter (PM10), the inhalable particles with an aerodynamic diameter smaller than 10 micrometers are one of the major air pollutions that affect human health. Over the previous decade, a number of researchers applied various data mining techniques to create a temporal prediction model. This study reviews and discusses 100 research articles in computer science and environmental science coming from the Scopus database. The three processes of data mining techniques, including data preparation, model creation, and model evaluation for prediction PM10 are highlighted. A summary of the overall process directions of data mining as well as their output are revealed. Additionally, recommendations for future research are identified.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"64 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90141887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Question answering has emerged as an intuitive way of querying structured data sources and has attracted significant advancements over the years. A large body of recent work on question answering over knowledge graphs (KGQA) employs neural network‐based systems. In this article, we provide an overview of these neural network‐based methods for KGQA. We introduce readers to the formalism and the challenges of the task, different paradigms and approaches, discuss notable advancements, and outline the emerging trends in the field. Through this article, we aim to provide newcomers to the field with a suitable entry point to semantic parsing for KGQA, and ease their process of making informed decisions while creating their own QA systems.
{"title":"Introduction to neural network‐based question answering over knowledge graphs","authors":"Nilesh Chakraborty, Denis Lukovnikov, Gaurav Maheshwari, Priyansh Trivedi, Jens Lehmann, Asja Fischer","doi":"10.1002/widm.1389","DOIUrl":"https://doi.org/10.1002/widm.1389","url":null,"abstract":"Question answering has emerged as an intuitive way of querying structured data sources and has attracted significant advancements over the years. A large body of recent work on question answering over knowledge graphs (KGQA) employs neural network‐based systems. In this article, we provide an overview of these neural network‐based methods for KGQA. We introduce readers to the formalism and the challenges of the task, different paradigms and approaches, discuss notable advancements, and outline the emerging trends in the field. Through this article, we aim to provide newcomers to the field with a suitable entry point to semantic parsing for KGQA, and ease their process of making informed decisions while creating their own QA systems.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"26 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82248840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The availability of a dataset represents a critical component in educational data mining (EDM) pipelines. Once the dataset is at hand, the next steps within the research methodology regard proper research issue formulation, data analysis pipeline design and implementation and, finally, presentation of validation results. As the EDM research area is continuously growing due to the increasing number of available tools and technologies, one of the critical issues that constitute a bottleneck regards a properly documented review on publicly available datasets. This paper aims to present a succinct, yet informative, description of the most used publicly available data sources along with their associated EDM tasks, used algorithms, experimental results and main findings. We have found that there are three types of data sources: well‐known data sources, datasets used in EDM competitions and standalone EDM datasets. We conclude that the success of the future of EDM data sources will rely on their ability to manage proposed approaches and their experimental results as a dashboard of benchmarked runs. Under these circumstances, the reproducibility of data analysis pipelines and benchmarking of proposed algorithms becomes at hand for the research community such that progress in the EDM domain may be much more easily acquired. The most crucial outcome regards the possibility of continuously improving existing data analysis pipelines by tackling EDM tasks that rely on publicly available datasets and benchmarking data analysis pipelines that use open‐source implementations.
{"title":"Review on publicly available datasets for educational data mining","authors":"M. Mihăescu, Paul-Stefan Popescu","doi":"10.1002/widm.1403","DOIUrl":"https://doi.org/10.1002/widm.1403","url":null,"abstract":"The availability of a dataset represents a critical component in educational data mining (EDM) pipelines. Once the dataset is at hand, the next steps within the research methodology regard proper research issue formulation, data analysis pipeline design and implementation and, finally, presentation of validation results. As the EDM research area is continuously growing due to the increasing number of available tools and technologies, one of the critical issues that constitute a bottleneck regards a properly documented review on publicly available datasets. This paper aims to present a succinct, yet informative, description of the most used publicly available data sources along with their associated EDM tasks, used algorithms, experimental results and main findings. We have found that there are three types of data sources: well‐known data sources, datasets used in EDM competitions and standalone EDM datasets. We conclude that the success of the future of EDM data sources will rely on their ability to manage proposed approaches and their experimental results as a dashboard of benchmarked runs. Under these circumstances, the reproducibility of data analysis pipelines and benchmarking of proposed algorithms becomes at hand for the research community such that progress in the EDM domain may be much more easily acquired. The most crucial outcome regards the possibility of continuously improving existing data analysis pipelines by tackling EDM tasks that rely on publicly available datasets and benchmarking data analysis pipelines that use open‐source implementations.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"10 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75059923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many modern applications, the generated data is a dynamic network. These networks are graphs that change over time by a sequence of update operations (node addition, node deletion, edge addition, edge deletion, and edge weight change). In such networks, it is inefficient to compute from scratch the solution of a data mining/machine learning task, after any update operation. Therefore in recent years, several so‐called dynamical algorithms have been proposed that update the solution, instead of computing it from scratch. In this paper, first we formulate this emerging setting and discuss its high‐level algorithmic aspects. Then, we review state of the art dynamical algorithms proposed for several data mining and machine learning tasks, including frequent pattern discovery, betweenness/closeness/PageRank centralities, clustering, classification, and regression.
{"title":"Dynamical algorithms for data mining and machine learning over dynamic graphs","authors":"Mostafa Haghir Chehreghani","doi":"10.1002/widm.1393","DOIUrl":"https://doi.org/10.1002/widm.1393","url":null,"abstract":"In many modern applications, the generated data is a dynamic network. These networks are graphs that change over time by a sequence of update operations (node addition, node deletion, edge addition, edge deletion, and edge weight change). In such networks, it is inefficient to compute from scratch the solution of a data mining/machine learning task, after any update operation. Therefore in recent years, several so‐called dynamical algorithms have been proposed that update the solution, instead of computing it from scratch. In this paper, first we formulate this emerging setting and discuss its high‐level algorithmic aspects. Then, we review state of the art dynamical algorithms proposed for several data mining and machine learning tasks, including frequent pattern discovery, betweenness/closeness/PageRank centralities, clustering, classification, and regression.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"20 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79771442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Bahri, A. Bifet, J. Gama, Heitor Murilo Gomes, S. Maniu
The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns.
{"title":"Data stream analysis: Foundations, major tasks and tools","authors":"M. Bahri, A. Bifet, J. Gama, Heitor Murilo Gomes, S. Maniu","doi":"10.1002/widm.1405","DOIUrl":"https://doi.org/10.1002/widm.1405","url":null,"abstract":"The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72985607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wanting Ji, Y. Pang, Xiaoyun Jia, Zhongwei Wang, Feng Hou, Baoyan Song, Mingzhe Liu, Ruili Wang
Feature selection aims to select a feature subset from an original feature set based on a certain evaluation criterion. Since feature selection can achieve efficient feature reduction, it has become a key method for data preprocessing in many data mining tasks. Recently, many feature selection strategies have been developed since in most cases it is infeasible to obtain an optimal/reduced feature subset by using exhaustive search. Among these strategies, fuzzy rough set theory has proved to be an ideal candidate for dealing with uncertain information. This article provides a comprehensive review on the fuzzy rough set theory and two fuzzy rough set theory based feature selection methods, that is, fuzzy rough set based feature selection methods and fuzzy rough neural network based feature selection methods. We review the publications related to the fuzzy rough theory and its applications in feature selection. In addition, the challenges in the two types of feature selection methods are also discussed.
{"title":"Fuzzy rough sets and fuzzy rough neural networks for feature selection: A review","authors":"Wanting Ji, Y. Pang, Xiaoyun Jia, Zhongwei Wang, Feng Hou, Baoyan Song, Mingzhe Liu, Ruili Wang","doi":"10.1002/widm.1402","DOIUrl":"https://doi.org/10.1002/widm.1402","url":null,"abstract":"Feature selection aims to select a feature subset from an original feature set based on a certain evaluation criterion. Since feature selection can achieve efficient feature reduction, it has become a key method for data preprocessing in many data mining tasks. Recently, many feature selection strategies have been developed since in most cases it is infeasible to obtain an optimal/reduced feature subset by using exhaustive search. Among these strategies, fuzzy rough set theory has proved to be an ideal candidate for dealing with uncertain information. This article provides a comprehensive review on the fuzzy rough set theory and two fuzzy rough set theory based feature selection methods, that is, fuzzy rough set based feature selection methods and fuzzy rough neural network based feature selection methods. We review the publications related to the fuzzy rough theory and its applications in feature selection. In addition, the challenges in the two types of feature selection methods are also discussed.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"92 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84260105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Privacy preserving data classification is an important research area in data mining field. The goal of a privacy preserving classification algorithm is to protect the sensitive information as much as possible, while providing satisfactory classification accuracy. Differential privacy is a strong privacy guarantee that enables privacy of sensitive data stored in a database by determining the ratio of sensitive information leakage with respect to an ɛ parameter. In this study, our aim is to investigate the classification performance of the state‐of‐the‐art classification algorithms such as C4.5, Naïve Bayes, One Rule, Bayesian Networks, PART, Ripper, K*, IBk, and Random tree for performing privacy preserving classification. To preserve privacy of the data to be classified, we applied input perturbation technique coming from differential privacy, and observed the relationship between the ɛ parameter values and accuracy of the classifiers. To our best knowledge, this article is the first study that analyzes the performances of the well‐known classification algorithms over differentially private data, and discovers which datasets are more suitable for privacy preserving classification when input perturbation is applied to provide data privacy. The classification algorithms are compared by using the differentially private versions of the well‐known datasets from the UCI repository. According to the experimental results, we observed that, as ɛ parameter value increases, better classification accuracies are achieved with lower privacy levels. When the classifiers are compared, Naïve Bayes classifier is the most successful method. The ɛ parameter should be greater than or equal to 2 (i.e., ɛ ≥2) to achieve cloud server is malicious and untrusted, sensitive data will satisfactory classification accuracies.
{"title":"Privacy preserving classification over differentially private data","authors":"Ezgi Zorarpacı, S. A. Özel","doi":"10.1002/widm.1399","DOIUrl":"https://doi.org/10.1002/widm.1399","url":null,"abstract":"Privacy preserving data classification is an important research area in data mining field. The goal of a privacy preserving classification algorithm is to protect the sensitive information as much as possible, while providing satisfactory classification accuracy. Differential privacy is a strong privacy guarantee that enables privacy of sensitive data stored in a database by determining the ratio of sensitive information leakage with respect to an ɛ parameter. In this study, our aim is to investigate the classification performance of the state‐of‐the‐art classification algorithms such as C4.5, Naïve Bayes, One Rule, Bayesian Networks, PART, Ripper, K*, IBk, and Random tree for performing privacy preserving classification. To preserve privacy of the data to be classified, we applied input perturbation technique coming from differential privacy, and observed the relationship between the ɛ parameter values and accuracy of the classifiers. To our best knowledge, this article is the first study that analyzes the performances of the well‐known classification algorithms over differentially private data, and discovers which datasets are more suitable for privacy preserving classification when input perturbation is applied to provide data privacy. The classification algorithms are compared by using the differentially private versions of the well‐known datasets from the UCI repository. According to the experimental results, we observed that, as ɛ parameter value increases, better classification accuracies are achieved with lower privacy levels. When the classifiers are compared, Naïve Bayes classifier is the most successful method. The ɛ parameter should be greater than or equal to 2 (i.e., ɛ ≥2) to achieve cloud server is malicious and untrusted, sensitive data will satisfactory classification accuracies.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"60 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2020-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82716824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper aims to apply several machine learning (ML) models to the massive dataset present in the area of e‐commerce from Amazon to analyze and predict ratings and to recommend products. For this purpose, we have used both traditional and Big Data algorithms. As the Amazon product review dataset is large, we present Big Data architecture suitable massive dataset for storing and computation, which is not possible with the traditional architecture. Furthermore, the dataset contains 15 attributes and has about 7 million records. With the dataset, we develop several models in Oracle Big Data and Azure Cloud Computing services to predict the review rating and recommendation for the items at Amazon. We present a comparative conclusion in terms of the accuracy as well as the efficiency with Spark ML—the Big Data architecture, and Azure ML—the traditional architecture.
{"title":"Predicting the ratings of Amazon products using Big Data","authors":"Jongwook Woo, Monika Mishra","doi":"10.1002/widm.1400","DOIUrl":"https://doi.org/10.1002/widm.1400","url":null,"abstract":"This paper aims to apply several machine learning (ML) models to the massive dataset present in the area of e‐commerce from Amazon to analyze and predict ratings and to recommend products. For this purpose, we have used both traditional and Big Data algorithms. As the Amazon product review dataset is large, we present Big Data architecture suitable massive dataset for storing and computation, which is not possible with the traditional architecture. Furthermore, the dataset contains 15 attributes and has about 7 million records. With the dataset, we develop several models in Oracle Big Data and Azure Cloud Computing services to predict the review rating and recommendation for the items at Amazon. We present a comparative conclusion in terms of the accuracy as well as the efficiency with Spark ML—the Big Data architecture, and Azure ML—the traditional architecture.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"1 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2020-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82931124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}