Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery最新文献

英文中文

A comprehensive review on updating concept lattices and its application in updating association rules 概念格更新及其在关联规则更新中的应用综述

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-05 DOI: 10.1002/widm.1401

Ebtesam E. Shemis, Ammar Mohammed

Formal concept analysis (FCA) visualizes formal concepts in terms of a concept lattice. Usually, it is an NP‐problem and consumes plenty of time and storage space to update the changes of the lattice. Thus, introducing an efficient way to update and maintain such lattices is a significant area of interest within the field of FCA and its applications. One of those vital FCA applications is the association rule mining (ARM), which aims at generating a loss‐less nonredundant compact Association Rule‐basis (AR‐basis). Currently, the real‐world data rapidly overgrow that asks the need for updating the existing concept lattice and AR‐basis upon data change continually. Intuitively, updating and maintaining an existing concept‐lattice or AR‐basis is much more efficient and consistent than reconstructing them from scratch, particularly in the case of massive data. So far, the area of updating both concept lattice and AR‐basis has not received much attention. Besides, few noncomprehensive studies have focused only on updating the concept lattice. From this point, this article comprehensively introduces basic knowledge regarding updating both concept lattices and AR‐basis with new illustrations, formalization, and examples. Also, the article reviews and compares recent remarkable works and explores the emerging future research trends.

形式概念分析(FCA)通过概念格将形式概念可视化。通常，这是一个NP -问题，需要耗费大量的时间和存储空间来更新晶格的变化。因此，引入一种有效的方法来更新和维护这些格是FCA及其应用领域中一个重要的领域。其中一个重要的FCA应用是关联规则挖掘(ARM)，其目的是生成一个损失较少的非冗余紧凑关联规则基础(AR基础)。目前，现实世界的数据迅速增长，这就需要根据数据的不断变化来更新现有的概念格和AR基础。直观地说，更新和维护现有的概念格或AR基比从头开始重建它们更有效和一致，特别是在海量数据的情况下。到目前为止，概念格和AR基的更新还没有得到足够的重视。此外，很少有不全面的研究只关注概念格的更新。从这一点出发，本文全面介绍了关于更新概念格和AR基础的基本知识，并提供了新的插图，形式化和示例。此外，文章回顾和比较了最近的杰出作品，并探讨了新兴的未来研究趋势。

{"title":"A comprehensive review on updating concept lattices and its application in updating association rules","authors":"Ebtesam E. Shemis, Ammar Mohammed","doi":"10.1002/widm.1401","DOIUrl":"https://doi.org/10.1002/widm.1401","url":null,"abstract":"Formal concept analysis (FCA) visualizes formal concepts in terms of a concept lattice. Usually, it is an NP‐problem and consumes plenty of time and storage space to update the changes of the lattice. Thus, introducing an efficient way to update and maintain such lattices is a significant area of interest within the field of FCA and its applications. One of those vital FCA applications is the association rule mining (ARM), which aims at generating a loss‐less nonredundant compact Association Rule‐basis (AR‐basis). Currently, the real‐world data rapidly overgrow that asks the need for updating the existing concept lattice and AR‐basis upon data change continually. Intuitively, updating and maintaining an existing concept‐lattice or AR‐basis is much more efficient and consistent than reconstructing them from scratch, particularly in the case of massive data. So far, the area of updating both concept lattice and AR‐basis has not received much attention. Besides, few noncomprehensive studies have focused only on updating the concept lattice. From this point, this article comprehensively introduces basic knowledge regarding updating both concept lattices and AR‐basis with new illustrations, formalization, and examples. Also, the article reviews and compares recent remarkable works and explores the emerging future research trends.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"211 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76052101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Mining the online infosphere: A survey 挖掘在线信息圈:一项调查

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-02 DOI: 10.1002/widm.1453

Sayantan Adak, Souvic Chakraborty, Paramtia Das, Mithun Das, A. Dash, Rima Hazra, Binny Mathew, Punyajoy Saha, Soumya Sarkar, Animesh Mukherjee

The evolution of Artificial Intelligence (AI)‐based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI‐based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow‐up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting.

基于人工智能(AI)的系统和应用的发展已经渗透到日常生活中，对个人和社会产生重大影响的决策。随着在线数据(通常被称为在线信息圈)的惊人增长，监测信息圈以确保社会利益变得至关重要，因为基于人工智能的决策严重依赖。本调查旨在全面回顾与信息领域相关的一些最重要的研究领域，重点关注技术挑战和潜在的解决方案。该调查还概述了一些重要的未来方向。我们首先关注信息圈中出现的协作系统，并特别关注维基百科。在后续文章中，我们展示了信息空间如何在科学引用和合作的增长中发挥了重要作用，从而推动了跨学科研究。最后，我们阐述了与信息圈治理相关的问题，例如解决(a)不断上升的仇恨和虐待行为，以及(b)不同在线平台和新闻报道中的偏见和歧视。

{"title":"Mining the online infosphere: A survey","authors":"Sayantan Adak, Souvic Chakraborty, Paramtia Das, Mithun Das, A. Dash, Rima Hazra, Binny Mathew, Punyajoy Saha, Soumya Sarkar, Animesh Mukherjee","doi":"10.1002/widm.1453","DOIUrl":"https://doi.org/10.1002/widm.1453","url":null,"abstract":"The evolution of Artificial Intelligence (AI)‐based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI‐based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow‐up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"9 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88462129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Investigation of PM10 prediction utilizing data mining techniques: Analyze by topic 利用数据挖掘技术预测PM10的研究:按主题分析

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-01 DOI: 10.1002/widm.1423

Krittakom Srijiranon, Narissara Eiamkanitchat, Sakgasit Ramingwong, K. Cosh, L. Ramingwong

Coarse particulate matter (PM10), the inhalable particles with an aerodynamic diameter smaller than 10 micrometers are one of the major air pollutions that affect human health. Over the previous decade, a number of researchers applied various data mining techniques to create a temporal prediction model. This study reviews and discusses 100 research articles in computer science and environmental science coming from the Scopus database. The three processes of data mining techniques, including data preparation, model creation, and model evaluation for prediction PM10 are highlighted. A summary of the overall process directions of data mining as well as their output are revealed. Additionally, recommendations for future research are identified.

粗颗粒物(PM10)是空气动力学直径小于10微米的可吸入颗粒物，是影响人类健康的主要空气污染物之一。在过去的十年中，许多研究人员应用各种数据挖掘技术来创建时间预测模型。本研究对来自Scopus数据库的100篇计算机科学和环境科学研究论文进行了综述和讨论。重点介绍了用于PM10预测的数据挖掘技术的三个过程，包括数据准备、模型创建和模型评估。总结了数据挖掘的总体过程方向及其输出。此外，对未来的研究提出了建议。

引用次数: 1

Introduction to neural network‐based question answering over knowledge graphs 介绍基于神经网络的知识图谱问答

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-01 DOI: 10.1002/widm.1389

Nilesh Chakraborty, Denis Lukovnikov, Gaurav Maheshwari, Priyansh Trivedi, Jens Lehmann, Asja Fischer

Question answering has emerged as an intuitive way of querying structured data sources and has attracted significant advancements over the years. A large body of recent work on question answering over knowledge graphs (KGQA) employs neural network‐based systems. In this article, we provide an overview of these neural network‐based methods for KGQA. We introduce readers to the formalism and the challenges of the task, different paradigms and approaches, discuss notable advancements, and outline the emerging trends in the field. Through this article, we aim to provide newcomers to the field with a suitable entry point to semantic parsing for KGQA, and ease their process of making informed decisions while creating their own QA systems.

问答已经成为查询结构化数据源的一种直观方式，并在过去几年里取得了重大进展。最近关于知识图问答(KGQA)的大量工作采用了基于神经网络的系统。在本文中，我们概述了这些基于神经网络的KGQA方法。我们向读者介绍了形式主义和任务的挑战，不同的范式和方法，讨论了显著的进展，并概述了该领域的新兴趋势。通过本文，我们的目标是为该领域的新手提供KGQA语义解析的合适切入点，并简化他们在创建自己的QA系统时做出明智决策的过程。

引用次数: 24

Review on publicly available datasets for educational data mining 教育数据挖掘的公开可用数据集综述

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-01 DOI: 10.1002/widm.1403

M. Mihăescu, Paul-Stefan Popescu

The availability of a dataset represents a critical component in educational data mining (EDM) pipelines. Once the dataset is at hand, the next steps within the research methodology regard proper research issue formulation, data analysis pipeline design and implementation and, finally, presentation of validation results. As the EDM research area is continuously growing due to the increasing number of available tools and technologies, one of the critical issues that constitute a bottleneck regards a properly documented review on publicly available datasets. This paper aims to present a succinct, yet informative, description of the most used publicly available data sources along with their associated EDM tasks, used algorithms, experimental results and main findings. We have found that there are three types of data sources: well‐known data sources, datasets used in EDM competitions and standalone EDM datasets. We conclude that the success of the future of EDM data sources will rely on their ability to manage proposed approaches and their experimental results as a dashboard of benchmarked runs. Under these circumstances, the reproducibility of data analysis pipelines and benchmarking of proposed algorithms becomes at hand for the research community such that progress in the EDM domain may be much more easily acquired. The most crucial outcome regards the possibility of continuously improving existing data analysis pipelines by tackling EDM tasks that rely on publicly available datasets and benchmarking data analysis pipelines that use open‐source implementations.

数据集的可用性是教育数据挖掘(EDM)管道的关键组成部分。一旦数据集在手，研究方法的下一步是适当的研究问题制定，数据分析管道的设计和实施，最后是验证结果的呈现。由于可用的工具和技术数量的增加，EDM研究领域不断发展，构成瓶颈的关键问题之一是对公开可用的数据集进行适当的文档审查。本文旨在对最常用的公开可用数据源及其相关的EDM任务、使用的算法、实验结果和主要发现进行简洁而翔实的描述。我们发现有三种类型的数据源:众所周知的数据源，EDM竞赛中使用的数据集和独立的EDM数据集。我们的结论是，未来EDM数据源的成功将依赖于它们管理建议方法的能力和作为基准运行仪表板的实验结果。在这种情况下，数据分析管道的再现性和提出的算法的基准测试对研究界来说变得触手可及，这样在EDM领域的进展可能更容易获得。最重要的结果是，通过解决依赖于公开可用数据集的EDM任务和使用开源实现的基准数据分析管道，不断改进现有数据分析管道的可能性。

{"title":"Review on publicly available datasets for educational data mining","authors":"M. Mihăescu, Paul-Stefan Popescu","doi":"10.1002/widm.1403","DOIUrl":"https://doi.org/10.1002/widm.1403","url":null,"abstract":"The availability of a dataset represents a critical component in educational data mining (EDM) pipelines. Once the dataset is at hand, the next steps within the research methodology regard proper research issue formulation, data analysis pipeline design and implementation and, finally, presentation of validation results. As the EDM research area is continuously growing due to the increasing number of available tools and technologies, one of the critical issues that constitute a bottleneck regards a properly documented review on publicly available datasets. This paper aims to present a succinct, yet informative, description of the most used publicly available data sources along with their associated EDM tasks, used algorithms, experimental results and main findings. We have found that there are three types of data sources: well‐known data sources, datasets used in EDM competitions and standalone EDM datasets. We conclude that the success of the future of EDM data sources will rely on their ability to manage proposed approaches and their experimental results as a dashboard of benchmarked runs. Under these circumstances, the reproducibility of data analysis pipelines and benchmarking of proposed algorithms becomes at hand for the research community such that progress in the EDM domain may be much more easily acquired. The most crucial outcome regards the possibility of continuously improving existing data analysis pipelines by tackling EDM tasks that rely on publicly available datasets and benchmarking data analysis pipelines that use open‐source implementations.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"10 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75059923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Dynamical algorithms for data mining and machine learning over dynamic graphs 动态图上的数据挖掘和机器学习的动态算法

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-01 DOI: 10.1002/widm.1393

Mostafa Haghir Chehreghani

In many modern applications, the generated data is a dynamic network. These networks are graphs that change over time by a sequence of update operations (node addition, node deletion, edge addition, edge deletion, and edge weight change). In such networks, it is inefficient to compute from scratch the solution of a data mining/machine learning task, after any update operation. Therefore in recent years, several so‐called dynamical algorithms have been proposed that update the solution, instead of computing it from scratch. In this paper, first we formulate this emerging setting and discuss its high‐level algorithmic aspects. Then, we review state of the art dynamical algorithms proposed for several data mining and machine learning tasks, including frequent pattern discovery, betweenness/closeness/PageRank centralities, clustering, classification, and regression.

在许多现代应用中，生成的数据是一个动态网络。这些网络是通过一系列更新操作(节点添加、节点删除、边缘添加、边缘删除和边缘权重更改)随时间变化的图。在这样的网络中，在任何更新操作之后，从头开始计算数据挖掘/机器学习任务的解决方案是低效的。因此，近年来提出了几种所谓的动态算法来更新解决方案，而不是从头开始计算。在本文中，我们首先阐述了这个新兴的设置，并讨论了它的高级算法方面。然后，我们回顾了为几个数据挖掘和机器学习任务提出的最先进的动态算法的状态，包括频繁模式发现、中间性/接近性/PageRank中心性、聚类、分类和回归。

引用次数: 0

Data stream analysis: Foundations, major tasks and tools 数据流分析:基础、主要任务和工具

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-01 DOI: 10.1002/widm.1405

M. Bahri, A. Bifet, J. Gama, Heitor Murilo Gomes, S. Maniu

The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns.

互联物联网(IoT)设备的显著增长，社交网络的使用，以及不同领域技术的发展，导致多个系统连续生成的数据量不断增加。通过应用机器学习，可以从这些不断变化的数据流中获得有价值的信息。在实践中，当从这些潜在的无限数据中提取有用的知识时，出现了几个关键问题，主要是因为它们不断发展的性质和高到达率，这意味着无法完全存储它们。在这项工作中，我们提供了一个全面的调查，讨论了在这个充满活力的框架中的研究限制和当前状态。此外，我们对不同流挖掘任务的最新贡献进行了更新的概述，特别是分类、回归、聚类和频繁模式。

引用次数: 56

Fuzzy rough sets and fuzzy rough neural networks for feature selection: A review 模糊粗糙集与模糊粗糙神经网络特征选择研究进展

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2021-01-01 DOI: 10.1002/widm.1402

Wanting Ji, Y. Pang, Xiaoyun Jia, Zhongwei Wang, Feng Hou, Baoyan Song, Mingzhe Liu, Ruili Wang

Feature selection aims to select a feature subset from an original feature set based on a certain evaluation criterion. Since feature selection can achieve efficient feature reduction, it has become a key method for data preprocessing in many data mining tasks. Recently, many feature selection strategies have been developed since in most cases it is infeasible to obtain an optimal/reduced feature subset by using exhaustive search. Among these strategies, fuzzy rough set theory has proved to be an ideal candidate for dealing with uncertain information. This article provides a comprehensive review on the fuzzy rough set theory and two fuzzy rough set theory based feature selection methods, that is, fuzzy rough set based feature selection methods and fuzzy rough neural network based feature selection methods. We review the publications related to the fuzzy rough theory and its applications in feature selection. In addition, the challenges in the two types of feature selection methods are also discussed.

特征选择是根据一定的评价标准从原始特征集中选择出一个特征子集。由于特征选择可以实现高效的特征约简，它已经成为许多数据挖掘任务中数据预处理的关键方法。由于在大多数情况下无法通过穷举搜索获得最优/简化的特征子集，因此近年来开发了许多特征选择策略。在这些策略中，模糊粗糙集理论已被证明是处理不确定信息的理想选择。本文综述了模糊粗糙集理论和两种基于模糊粗糙集理论的特征选择方法，即基于模糊粗糙集的特征选择方法和基于模糊粗糙神经网络的特征选择方法。本文综述了模糊粗糙理论及其在特征选择中的应用。此外，还讨论了两种特征选择方法存在的问题。

引用次数: 29

Privacy preserving classification over differentially private data 对不同的私有数据进行隐私保护分类

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2020-12-13 DOI: 10.1002/widm.1399

Ezgi Zorarpacı, S. A. Özel

Privacy preserving data classification is an important research area in data mining field. The goal of a privacy preserving classification algorithm is to protect the sensitive information as much as possible, while providing satisfactory classification accuracy. Differential privacy is a strong privacy guarantee that enables privacy of sensitive data stored in a database by determining the ratio of sensitive information leakage with respect to an ɛ parameter. In this study, our aim is to investigate the classification performance of the state‐of‐the‐art classification algorithms such as C4.5, Naïve Bayes, One Rule, Bayesian Networks, PART, Ripper, K*, IBk, and Random tree for performing privacy preserving classification. To preserve privacy of the data to be classified, we applied input perturbation technique coming from differential privacy, and observed the relationship between the ɛ parameter values and accuracy of the classifiers. To our best knowledge, this article is the first study that analyzes the performances of the well‐known classification algorithms over differentially private data, and discovers which datasets are more suitable for privacy preserving classification when input perturbation is applied to provide data privacy. The classification algorithms are compared by using the differentially private versions of the well‐known datasets from the UCI repository. According to the experimental results, we observed that, as ɛ parameter value increases, better classification accuracies are achieved with lower privacy levels. When the classifiers are compared, Naïve Bayes classifier is the most successful method. The ɛ parameter should be greater than or equal to 2 (i.e., ɛ ≥2) to achieve cloud server is malicious and untrusted, sensitive data will satisfactory classification accuracies.

保护隐私的数据分类是数据挖掘领域的一个重要研究方向。隐私保护分类算法的目标是尽可能保护敏感信息，同时提供令人满意的分类精度。差分隐私是一种强大的隐私保障，通过确定敏感信息泄漏相对于某个参数的比例，实现数据库中存储的敏感数据的隐私性。在本研究中，我们的目的是研究C4.5、Naïve贝叶斯、One Rule、贝叶斯网络、PART、Ripper、K*、IBk和Random tree等最先进的分类算法在执行隐私保护分类方面的分类性能。为了保护待分类数据的隐私性，我们采用了来自差分隐私的输入扰动技术，并观察了参数值与分类器准确率之间的关系。据我们所知，本文是第一个分析已知分类算法在差分私有数据上的性能的研究，并发现当应用输入扰动来提供数据隐私时，哪些数据集更适合用于保护隐私的分类。通过使用UCI存储库中知名数据集的不同私有版本来比较分类算法。根据实验结果，我们观察到，随着参数值的增加，在较低的隐私级别下获得更好的分类精度。当分类器进行比较时，Naïve贝叶斯分类器是最成功的方法。其中，参数要大于等于2(即，参数要≥2)，才能实现云服务器是恶意的、不可信的、敏感的数据将得到令人满意的分类精度。

{"title":"Privacy preserving classification over differentially private data","authors":"Ezgi Zorarpacı, S. A. Özel","doi":"10.1002/widm.1399","DOIUrl":"https://doi.org/10.1002/widm.1399","url":null,"abstract":"Privacy preserving data classification is an important research area in data mining field. The goal of a privacy preserving classification algorithm is to protect the sensitive information as much as possible, while providing satisfactory classification accuracy. Differential privacy is a strong privacy guarantee that enables privacy of sensitive data stored in a database by determining the ratio of sensitive information leakage with respect to an ɛ parameter. In this study, our aim is to investigate the classification performance of the state‐of‐the‐art classification algorithms such as C4.5, Naïve Bayes, One Rule, Bayesian Networks, PART, Ripper, K*, IBk, and Random tree for performing privacy preserving classification. To preserve privacy of the data to be classified, we applied input perturbation technique coming from differential privacy, and observed the relationship between the ɛ parameter values and accuracy of the classifiers. To our best knowledge, this article is the first study that analyzes the performances of the well‐known classification algorithms over differentially private data, and discovers which datasets are more suitable for privacy preserving classification when input perturbation is applied to provide data privacy. The classification algorithms are compared by using the differentially private versions of the well‐known datasets from the UCI repository. According to the experimental results, we observed that, as ɛ parameter value increases, better classification accuracies are achieved with lower privacy levels. When the classifiers are compared, Naïve Bayes classifier is the most successful method. The ɛ parameter should be greater than or equal to 2 (i.e., ɛ ≥2) to achieve cloud server is malicious and untrusted, sensitive data will satisfactory classification accuracies.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"60 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2020-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82716824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Predicting the ratings of Amazon products using Big Data 利用大数据预测亚马逊产品的评级

IF 7.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

Pub Date : 2020-12-12 DOI: 10.1002/widm.1400

Jongwook Woo, Monika Mishra

This paper aims to apply several machine learning (ML) models to the massive dataset present in the area of e‐commerce from Amazon to analyze and predict ratings and to recommend products. For this purpose, we have used both traditional and Big Data algorithms. As the Amazon product review dataset is large, we present Big Data architecture suitable massive dataset for storing and computation, which is not possible with the traditional architecture. Furthermore, the dataset contains 15 attributes and has about 7 million records. With the dataset, we develop several models in Oracle Big Data and Azure Cloud Computing services to predict the review rating and recommendation for the items at Amazon. We present a comparative conclusion in terms of the accuracy as well as the efficiency with Spark ML—the Big Data architecture, and Azure ML—the traditional architecture.

本文旨在将几个机器学习(ML)模型应用于亚马逊电子商务领域的大量数据集，以分析和预测评级并推荐产品。为此，我们使用了传统算法和大数据算法。由于亚马逊产品评论数据量较大，我们提出了适合海量数据存储和计算的大数据架构，这是传统架构无法实现的。此外，该数据集包含15个属性，大约有700万条记录。利用这些数据集，我们在Oracle大数据和Azure云计算服务中开发了几个模型来预测亚马逊上商品的评论评级和推荐。我们对大数据架构Spark ml和传统架构Azure ml的准确率和效率进行了比较。

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀