首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Mining temporal patterns in interval-based data 在基于间隔的数据中挖掘时间模式
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498397
Yi-Cheng Chen, Wen-Chih Peng, Suh-Yin Lee
Sequential pattern mining is an important subfield in data mining. Recently, discovering patterns from interval events has attracted considerable efforts due to its widespread applications. However, due to the complex relation between two intervals, mining interval-based sequences efficiently is a challenging issue. In this paper, we develop a novel algorithm, P-TPMiner, to efficiently discover two types of interval-based sequential patterns. Some pruning techniques are proposed to further reduce the search space of the mining process. Experimental studies show that proposed algorithm is efficient and scalable. Furthermore, we apply proposed method to real datasets to demonstrate the practicability of discussed patterns.
顺序模式挖掘是数据挖掘的一个重要分支。近年来,从间隔事件中发现模式由于其广泛的应用而引起了人们的极大关注。然而,由于两个区间之间的复杂关系,有效地挖掘基于区间的序列是一个具有挑战性的问题。在本文中,我们开发了一种新的算法,P-TPMiner,以有效地发现两种类型的基于区间的序列模式。提出了一些修剪技术,以进一步缩小挖掘过程的搜索空间。实验研究表明,该算法具有较高的效率和可扩展性。此外,我们将所提出的方法应用于实际数据集,以证明所讨论模式的实用性。
{"title":"Mining temporal patterns in interval-based data","authors":"Yi-Cheng Chen, Wen-Chih Peng, Suh-Yin Lee","doi":"10.1109/ICDE.2016.7498397","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498397","url":null,"abstract":"Sequential pattern mining is an important subfield in data mining. Recently, discovering patterns from interval events has attracted considerable efforts due to its widespread applications. However, due to the complex relation between two intervals, mining interval-based sequences efficiently is a challenging issue. In this paper, we develop a novel algorithm, P-TPMiner, to efficiently discover two types of interval-based sequential patterns. Some pruning techniques are proposed to further reduce the search space of the mining process. Experimental studies show that proposed algorithm is efficient and scalable. Furthermore, we apply proposed method to real datasets to demonstrate the practicability of discussed patterns.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"37 1","pages":"1506-1507"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85484923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Event regularity and irregularity in a time unit 一个时间单位内事件的规律性和非规律性
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498302
Lijian Wan, Tingjian Ge
In this paper, we study the problem of learning a regular model from a number of sequences, each of which contains events in a time unit. Assuming some regularity in such sequences, we determine what events should be deemed irregular in their contexts. We perform an in-depth analysis of the model we build, and propose two optimization techniques, one of which is also of independent interest in solving a new problem named the Group Counting problem. Our comprehensive experiments on real and hybrid datasets show that the model we build is very effective in characterizing regularities and identifying irregular events. One of our optimizations improves model building speed by more than an order of magnitude, and the other significantly saves space consumption.
在本文中,我们研究了从一系列序列中学习正则模型的问题,每个序列在一个时间单元中包含事件。假设这些序列有一定的规律性,我们就可以确定哪些事件在其上下文中应该被认为是不规则的。我们对我们建立的模型进行了深入的分析,并提出了两种优化技术,其中一种技术在解决一个名为群计数问题的新问题时也有独立的兴趣。我们在真实数据集和混合数据集上的综合实验表明,我们建立的模型在描述规律和识别不规则事件方面是非常有效的。我们的一项优化将模型构建速度提高了一个数量级以上,而另一项优化则显著节省了空间消耗。
{"title":"Event regularity and irregularity in a time unit","authors":"Lijian Wan, Tingjian Ge","doi":"10.1109/ICDE.2016.7498302","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498302","url":null,"abstract":"In this paper, we study the problem of learning a regular model from a number of sequences, each of which contains events in a time unit. Assuming some regularity in such sequences, we determine what events should be deemed irregular in their contexts. We perform an in-depth analysis of the model we build, and propose two optimization techniques, one of which is also of independent interest in solving a new problem named the Group Counting problem. Our comprehensive experiments on real and hybrid datasets show that the model we build is very effective in characterizing regularities and identifying irregular events. One of our optimizations improves model building speed by more than an order of magnitude, and the other significantly saves space consumption.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"129 1","pages":"930-941"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89493360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Differentially private multi-party high-dimensional data publishing 差异化私有多方高维数据发布
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498241
Sen Su, Peng Tang, Xiang Cheng, R. Chen, Zequn Wu
In this paper, we study the novel problem of publishing high-dimensional data in a distributed multi-party environment under differential privacy. In particular, with the assistance of a semi-trusted curator, the involved parties (i.e., local data owners) collectively generate a synthetic integrated dataset while satisfying ε-differential privacy for any local dataset. To solve this problem, we present a differentially private sequential update of Bayesian network (DP-SUBN) solution. In DP-SUBN, the parties and the curator collaboratively identify the Bayesian network ℕ that best fits the integrated dataset D in a sequential manner, from which a synthetic dataset can then be generated. The fundamental advantage of adopting the sequential update manner is that the parties can treat the statistical results provided by previous parties as their prior knowledge to direct how to learn ℕ. The core of DP-SUBN is the construction of the search frontier, which can be seen as a priori knowledge to guide the parties to update ℕ. To improve the fitness of ℕ and reduce the communication cost, we introduce a correlation-aware search frontier construction (CSFC) approach, where attribute pairs with strong correlations are used to construct the search frontier. In particular, to privately quantify the correlations of attribute pairs without introducing too much noise, we first propose a non-overlapping covering design (NOCD) method, and then introduce a dynamic programming method to find the optimal parameters used in NOCD to ensure that the injected noise is minimum. Through formal privacy analysis, we show that DP-SUBN satisfies ε-differential privacy for any local dataset. Extensive experiments on a real dataset demonstrate that DP-SUBN offers desirable data utility with low communication cost.
本文研究了差分隐私下分布式多方环境下高维数据发布的新问题。特别是,在半可信的管理员的帮助下,相关各方(即本地数据所有者)共同生成合成集成数据集,同时满足任何本地数据集的ε-差分隐私。为了解决这一问题,我们提出了一种差分私有贝叶斯网络序列更新(DP-SUBN)方案。在DP-SUBN中,各方和管理员以顺序的方式协作确定最适合集成数据集D的贝叶斯网络n,然后可以从中生成合成数据集。采用顺序更新方式的根本优点是,各方可以将之前各方提供的统计结果作为自己的先验知识来指导如何学习n。DP-SUBN的核心是搜索边界的构建,可以看作是指导各方更新n的先验知识。为了提高n的适应度和降低通信成本,我们引入了一种关联感知搜索边界构建(CSFC)方法,该方法使用具有强相关性的属性对来构建搜索边界。特别是,为了在不引入过多噪声的情况下私下量化属性对的相关性,我们首先提出了一种非重叠覆盖设计(NOCD)方法,然后引入动态规划方法来寻找NOCD中使用的最优参数,以确保注入的噪声最小。通过形式化的隐私分析,我们证明了DP-SUBN对任何局部数据集都满足ε-微分隐私。在实际数据集上进行的大量实验表明,DP-SUBN具有较低的通信成本和良好的数据效用。
{"title":"Differentially private multi-party high-dimensional data publishing","authors":"Sen Su, Peng Tang, Xiang Cheng, R. Chen, Zequn Wu","doi":"10.1109/ICDE.2016.7498241","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498241","url":null,"abstract":"In this paper, we study the novel problem of publishing high-dimensional data in a distributed multi-party environment under differential privacy. In particular, with the assistance of a semi-trusted curator, the involved parties (i.e., local data owners) collectively generate a synthetic integrated dataset while satisfying ε-differential privacy for any local dataset. To solve this problem, we present a differentially private sequential update of Bayesian network (DP-SUBN) solution. In DP-SUBN, the parties and the curator collaboratively identify the Bayesian network ℕ that best fits the integrated dataset D in a sequential manner, from which a synthetic dataset can then be generated. The fundamental advantage of adopting the sequential update manner is that the parties can treat the statistical results provided by previous parties as their prior knowledge to direct how to learn ℕ. The core of DP-SUBN is the construction of the search frontier, which can be seen as a priori knowledge to guide the parties to update ℕ. To improve the fitness of ℕ and reduce the communication cost, we introduce a correlation-aware search frontier construction (CSFC) approach, where attribute pairs with strong correlations are used to construct the search frontier. In particular, to privately quantify the correlations of attribute pairs without introducing too much noise, we first propose a non-overlapping covering design (NOCD) method, and then introduce a dynamic programming method to find the optimal parameters used in NOCD to ensure that the injected noise is minimum. Through formal privacy analysis, we show that DP-SUBN satisfies ε-differential privacy for any local dataset. Extensive experiments on a real dataset demonstrate that DP-SUBN offers desirable data utility with low communication cost.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"65 5 1","pages":"205-216"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89862333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
An embedding approach to anomaly detection 异常检测的嵌入方法
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498256
Renjun Hu, C. Aggarwal, Shuai Ma, J. Huai
Network anomaly detection has become very popular in recent years because of the importance of discovering key regions of structural inconsistency in the network. In addition to application-specific information carried by anomalies, the presence of such structural inconsistency is often an impediment to the effective application of data mining algorithms such as community detection and classification. In this paper, we study the problem of detecting structurally inconsistent nodes that connect to a number of diverse influential communities in large social networks. We show that the use of a network embedding approach, together with a novel dimension reduction technique, is an effective tool to discover such structural inconsistencies. We also experimentally show that the detection of such anomalous nodes has significant applications: one is the specific use of detected anomalies, and the other is the improvement of the effectiveness of community detection.
由于发现网络结构不一致的关键区域的重要性,网络异常检测近年来变得非常流行。除了异常所携带的特定于应用程序的信息外,这种结构不一致的存在通常会阻碍社区检测和分类等数据挖掘算法的有效应用。在本文中,我们研究了在大型社会网络中连接到许多不同的有影响力的社区的结构不一致节点的检测问题。我们表明,使用网络嵌入方法,以及一种新的降维技术,是发现这种结构不一致的有效工具。我们还通过实验表明,这种异常节点的检测具有重要的应用:一是对检测到的异常进行具体利用,二是提高社区检测的有效性。
{"title":"An embedding approach to anomaly detection","authors":"Renjun Hu, C. Aggarwal, Shuai Ma, J. Huai","doi":"10.1109/ICDE.2016.7498256","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498256","url":null,"abstract":"Network anomaly detection has become very popular in recent years because of the importance of discovering key regions of structural inconsistency in the network. In addition to application-specific information carried by anomalies, the presence of such structural inconsistency is often an impediment to the effective application of data mining algorithms such as community detection and classification. In this paper, we study the problem of detecting structurally inconsistent nodes that connect to a number of diverse influential communities in large social networks. We show that the use of a network embedding approach, together with a novel dimension reduction technique, is an effective tool to discover such structural inconsistencies. We also experimentally show that the detection of such anomalous nodes has significant applications: one is the specific use of detected anomalies, and the other is the improvement of the effectiveness of community detection.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"385-396"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89906602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Being prepared in a sparse world: The case of KNN graph construction 在稀疏世界中准备:KNN图构造的情况
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498244
A. Boutet, Anne-Marie Kermarrec, Nupur Mittal, François Taïani
K-Nearest-Neighbor (KNN) graphs have emerged as a fundamental building block of many on-line services providing recommendation, similarity search and classification. Constructing a KNN graph rapidly and accurately is, however, a computationally intensive task. As data volumes keep growing, speed and the ability to scale out are becoming critical factors when deploying a KNN algorithm. In this work, we present KIFF, a generic, fast and scalable KNN graph construction algorithm. KIFF directly exploits the bipartite nature of most datasets to which KNN algorithms are applied. This simple but powerful strategy drastically limits the computational cost required to rapidly converge to an accurate KNN solution, especially for sparse datasets. Our evaluation on a representative range of datasets show that KIFF provides, on average, a speed-up factor of 14 against recent state-of-the art solutions while improving the quality of the KNN approximation by 18%.
k -最近邻(KNN)图已经成为许多在线服务的基本构建块,提供推荐、相似性搜索和分类。然而,快速准确地构建KNN图是一项计算密集型任务。随着数据量的不断增长,速度和向外扩展的能力成为部署KNN算法时的关键因素。本文提出了一种通用、快速、可扩展的KNN图构建算法KIFF。KIFF直接利用了KNN算法应用的大多数数据集的二分性。这种简单但功能强大的策略极大地限制了快速收敛到准确的KNN解决方案所需的计算成本,特别是对于稀疏数据集。我们对具有代表性的数据集范围的评估表明,相对于最新的最先进的解决方案,KIFF平均提供了14的加速因子,同时将KNN近似的质量提高了18%。
{"title":"Being prepared in a sparse world: The case of KNN graph construction","authors":"A. Boutet, Anne-Marie Kermarrec, Nupur Mittal, François Taïani","doi":"10.1109/ICDE.2016.7498244","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498244","url":null,"abstract":"K-Nearest-Neighbor (KNN) graphs have emerged as a fundamental building block of many on-line services providing recommendation, similarity search and classification. Constructing a KNN graph rapidly and accurately is, however, a computationally intensive task. As data volumes keep growing, speed and the ability to scale out are becoming critical factors when deploying a KNN algorithm. In this work, we present KIFF, a generic, fast and scalable KNN graph construction algorithm. KIFF directly exploits the bipartite nature of most datasets to which KNN algorithms are applied. This simple but powerful strategy drastically limits the computational cost required to rapidly converge to an accurate KNN solution, especially for sparse datasets. Our evaluation on a representative range of datasets show that KIFF provides, on average, a speed-up factor of 14 against recent state-of-the art solutions while improving the quality of the KNN approximation by 18%.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"241-252"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87126657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
OptImatch: Semantic web system for query problem determination OptImatch:用于查询问题确定的语义web系统
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498338
Guilherme Damasio, Piotr Mierzejewski, Jaroslaw Szlichta, C. Zuzarte
Query performance problem determination is usually performed by analyzing query execution plans (QEPs). Analyzing complex QEPs is excessively time consuming and existing automatic problem determination tools do not provide ability to perform analysis with flexible user-defined problem patterns. We present the novel OptImatch system that allows a relatively naive user to search for patterns in QEPs and get recommendations from an expert and user customizable knowledge base. Our system transforms a QEP into an RDF graph. We provide a web graphical interface for the user to describe a pattern that is transformed with handlers into a SPARQL query. The SPARQL query is matched against the abstracted RDF graph and any matched parts of the graph are relayed back to the user. With the knowledge base the system automatically matches stored patterns to the QEPs by adapting dynamic context through developed tagging language and ranks recommendations using statistical correlation analysis.
查询性能问题的确定通常通过分析查询执行计划(qep)来完成。分析复杂的qep非常耗时,而且现有的自动问题确定工具不提供使用灵活的用户定义问题模式执行分析的能力。我们提出了一种新颖的OptImatch系统,它允许一个相对幼稚的用户在qep中搜索模式,并从专家和用户可定制的知识库中获得建议。我们的系统将QEP转换为RDF图。我们为用户提供了一个web图形界面来描述一个模式,该模式通过处理程序转换为SPARQL查询。SPARQL查询与抽象的RDF图匹配,图中任何匹配的部分都被转发给用户。在知识库的基础上,系统通过开发的标记语言适应动态上下文,自动将存储的模式与qep进行匹配,并使用统计相关性分析对推荐进行排序。
{"title":"OptImatch: Semantic web system for query problem determination","authors":"Guilherme Damasio, Piotr Mierzejewski, Jaroslaw Szlichta, C. Zuzarte","doi":"10.1109/ICDE.2016.7498338","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498338","url":null,"abstract":"Query performance problem determination is usually performed by analyzing query execution plans (QEPs). Analyzing complex QEPs is excessively time consuming and existing automatic problem determination tools do not provide ability to perform analysis with flexible user-defined problem patterns. We present the novel OptImatch system that allows a relatively naive user to search for patterns in QEPs and get recommendations from an expert and user customizable knowledge base. Our system transforms a QEP into an RDF graph. We provide a web graphical interface for the user to describe a pattern that is transformed with handlers into a SPARQL query. The SPARQL query is matched against the abstracted RDF graph and any matched parts of the graph are relayed back to the user. With the knowledge base the system automatically matches stored patterns to the QEPs by adapting dynamic context through developed tagging language and ranks recommendations using statistical correlation analysis.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"15 1","pages":"1334-1337"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82056449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Scalable data management: NoSQL data stores in research and practice 可扩展数据管理:NoSQL数据存储的研究和实践
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498360
Felix Gessert, N. Ritter
The unprecedented scale at which data is consumed and generated today has shown a large demand for scalable data management and given rise to non-relational, distributed “NoSQL” database systems. Two central problems triggered this process: 1) vast amounts of user-generated content in modern applications and the resulting requests loads and data volumes 2) the desire of the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems - now commonly referred to as NoSQL data stores - make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system's design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges.
如今,数据消费和生成的空前规模显示出对可扩展数据管理的巨大需求,并催生了非关系型、分布式的“NoSQL”数据库系统。两个核心问题触发了这个过程:1)现代应用程序中大量用户生成的内容以及由此产生的请求负载和数据量2)开发人员社区希望使用特定于问题的数据模型进行存储和查询。为了满足这些需求,工业界和研究人员开发了各种数据存储,认为一刀切数据库系统的时代已经结束。这些系统(现在通常称为NoSQL数据存储)的异构性和绝对数量使得为给定应用程序选择最合适的系统变得越来越困难。因此,这些系统经常组合在多语言持久性架构中,以在各自的最佳位置利用每个系统。本教程对最相关的NoSQL数据库进行了深入的调查,以提供比较分类并强调开放的挑战。为此,我们分析了每个系统的方法,得出了其可扩展性、可用性、一致性、数据建模和查询特性。我们展示了每个系统的设计是如何由一组对不可调和的系统属性进行权衡的中心控制的。然后,我们介绍了分布式数据管理方面的最新研究成果,以说明NoSQL系统的一些缺点已经可以在实践中得到解决,而其他NoSQL数据管理问题则提出了有趣的和未解决的研究挑战。
{"title":"Scalable data management: NoSQL data stores in research and practice","authors":"Felix Gessert, N. Ritter","doi":"10.1109/ICDE.2016.7498360","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498360","url":null,"abstract":"The unprecedented scale at which data is consumed and generated today has shown a large demand for scalable data management and given rise to non-relational, distributed “NoSQL” database systems. Two central problems triggered this process: 1) vast amounts of user-generated content in modern applications and the resulting requests loads and data volumes 2) the desire of the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems - now commonly referred to as NoSQL data stores - make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system's design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"15 1","pages":"1420-1423"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84964960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
PurTreeClust: A purchase tree clustering algorithm for large-scale customer transaction data PurTreeClust:用于大规模客户交易数据的购买树聚类算法
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498279
Xiaojun Chen, J. Huang, Jun Luo
Clustering of customer transaction data is usually an important procedure to analyze customer behaviors in retail and e-commerce companies. Note that products from companies are often organized as a product tree, in which the leaf nodes are goods to sell, and the internal nodes (except root node) could be multiple product categories. Based on this tree, we present to use a “personalized product tree”, called purchase tree, to represent a customer's transaction data. The customer transaction data set can be represented as a set of purchase trees. We propose a PurTreeClust algorithm for clustering of large-scale customers from purchase trees. We define a new distance metric to effectively compute the distance between two purchase trees from the entire levels in the tree. A cover tree is then built for indexing the purchase tree data and we propose a leveled density estimation method for selecting initial cluster centers from a cover tree. PurTreeClust, a fast clustering method for clustering of large-scale purchase trees, is then presented. Last, we propose a gap statistic based method for estimating the number of clusters from the purchase tree clustering results. A series of experiments were conducted on ten large-scale transaction data sets which contain up to four million transaction records, and experimental results have verified the effectiveness and efficiency of the proposed method. We also compared our method with three clustering algorithms, e.g., spectral clustering, hierarchical agglomerative clustering and DBSCAN. The experimental results have demonstrated the superior performance of the proposed method.
客户交易数据聚类通常是零售和电子商务公司分析客户行为的重要步骤。请注意,来自公司的产品通常被组织为产品树,其中叶节点是要销售的商品,内部节点(根节点除外)可以是多个产品类别。在此树的基础上,我们提出使用“个性化产品树”,称为购买树,来表示客户的交易数据。客户事务数据集可以表示为一组购买树。我们提出了一种PurTreeClust算法,用于从购买树中聚类大规模客户。我们定义了一个新的距离度量来有效地计算两个购买树之间的距离。然后建立一个覆盖树用于索引购买树数据,我们提出了一种分层密度估计方法,用于从覆盖树中选择初始聚类中心。提出了一种用于大规模采购树聚类的快速聚类方法PurTreeClust。最后,我们提出了一种基于间隙统计的方法,从购买树聚类结果中估计聚类数量。在10个包含400万条交易记录的大规模交易数据集上进行了一系列实验,实验结果验证了该方法的有效性和高效性。并将该方法与光谱聚类、层次聚类和DBSCAN三种聚类算法进行了比较。实验结果证明了该方法的优越性。
{"title":"PurTreeClust: A purchase tree clustering algorithm for large-scale customer transaction data","authors":"Xiaojun Chen, J. Huang, Jun Luo","doi":"10.1109/ICDE.2016.7498279","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498279","url":null,"abstract":"Clustering of customer transaction data is usually an important procedure to analyze customer behaviors in retail and e-commerce companies. Note that products from companies are often organized as a product tree, in which the leaf nodes are goods to sell, and the internal nodes (except root node) could be multiple product categories. Based on this tree, we present to use a “personalized product tree”, called purchase tree, to represent a customer's transaction data. The customer transaction data set can be represented as a set of purchase trees. We propose a PurTreeClust algorithm for clustering of large-scale customers from purchase trees. We define a new distance metric to effectively compute the distance between two purchase trees from the entire levels in the tree. A cover tree is then built for indexing the purchase tree data and we propose a leveled density estimation method for selecting initial cluster centers from a cover tree. PurTreeClust, a fast clustering method for clustering of large-scale purchase trees, is then presented. Last, we propose a gap statistic based method for estimating the number of clusters from the purchase tree clustering results. A series of experiments were conducted on ten large-scale transaction data sets which contain up to four million transaction records, and experimental results have verified the effectiveness and efficiency of the proposed method. We also compared our method with three clustering algorithms, e.g., spectral clustering, hierarchical agglomerative clustering and DBSCAN. The experimental results have demonstrated the superior performance of the proposed method.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"59 1","pages":"661-672"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83970276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Scaling up truth discovery 扩大真理发现
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498359
Laure Berti-Équille
The evolution of the Web from a technology platform to a social ecosystem has resulted in unprecedented data volumes being continuously generated, exchanged, and consumed. User-generated content on the Web is massive, highly dynamic, and characterized by a combination of factual data and opinion data. False information, rumors, and fake contents can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. Truth discovery (also known as fact-checking) has recently gained lot of interest from Data Science communities. This tutorial will attempt to cover recent work on truth-finding and how it can scale Big Data. We will provide a broad overview with new insights, highlighting the progress made on truth discovery from information extraction, data and knowledge fusion, as well as modeling of misinformation dynamics in social networks. We will review in details current models, algorithms, and techniques proposed by various research communities whose contributions converge towards the same goal of estimating the veracity of data in a dynamic world. Our aim is to bridge theory and practice and introduce recent work from diverse disciplines to database people to be better equipped for addressing the challenges of truth discovery in Big Data.
Web从技术平台向社会生态系统的演变导致了前所未有的数据量不断生成、交换和消费。Web上用户生成的内容是大量的、高度动态的,并且以事实数据和观点数据的组合为特征。虚假信息、谣言和虚假内容可以很容易地通过多个来源传播,这使得很难区分什么是真实的,什么是虚假的。事实发现(也称为事实核查)最近引起了数据科学界的极大兴趣。本教程将尝试介绍关于真相发现的最新工作以及它如何扩展大数据。我们将提供一个广泛的概述和新的见解,重点介绍在信息提取,数据和知识融合以及社交网络中错误信息动态建模的真相发现方面取得的进展。我们将详细回顾各种研究团体提出的当前模型、算法和技术,这些研究团体的贡献汇聚在动态世界中估计数据准确性的同一目标上。我们的目标是在理论和实践之间架起桥梁,并向数据库人员介绍不同学科的最新研究成果,以便更好地应对大数据中真相发现的挑战。
{"title":"Scaling up truth discovery","authors":"Laure Berti-Équille","doi":"10.1109/ICDE.2016.7498359","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498359","url":null,"abstract":"The evolution of the Web from a technology platform to a social ecosystem has resulted in unprecedented data volumes being continuously generated, exchanged, and consumed. User-generated content on the Web is massive, highly dynamic, and characterized by a combination of factual data and opinion data. False information, rumors, and fake contents can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. Truth discovery (also known as fact-checking) has recently gained lot of interest from Data Science communities. This tutorial will attempt to cover recent work on truth-finding and how it can scale Big Data. We will provide a broad overview with new insights, highlighting the progress made on truth discovery from information extraction, data and knowledge fusion, as well as modeling of misinformation dynamics in social networks. We will review in details current models, algorithms, and techniques proposed by various research communities whose contributions converge towards the same goal of estimating the veracity of data in a dynamic world. Our aim is to bridge theory and practice and introduce recent work from diverse disciplines to database people to be better equipped for addressing the challenges of truth discovery in Big Data.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"1418-1419"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89345769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Finding the minimum spatial keyword cover 寻找最小的空间关键词覆盖
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498281
Dong-Wan Choi, J. Pei, Xuemin Lin
The existing works on spatial keyword search focus on finding a group of spatial objects covering all the query keywords and minimizing the diameter of the group. However, we observe that such a formulation may not address what users need in some application scenarios. In this paper, we introduce a novel spatial keyword cover problem (SK-COVER for short), which aims to identify the group of spatio-textual objects covering all keywords in a query and minimizing a distance cost function that leads to fewer proximate objects in the answer set. We prove that SK-COVER is not only NP-hard but also does not allow an approximation better than O(log m) in polynomial time, where m is the number of query keywords. We establish an O(log m)-approximation algorithm, which is asymptotically optimal in terms of the approximability of SK-COVER. Furthermore, we devise effective accessing strategies and pruning rules to improve the overall efficiency and scalability. In addition to our algorithmic results, we empirically show that our approximation algorithm always achieves the best accuracy, and the efficiency of our algorithm is comparable to a state-of-the-art algorithm that is intended for mCK, a problem similar to yet theoretically easier than SK-COVER.
现有的空间关键字搜索工作主要是寻找一组覆盖所有查询关键字的空间对象,并使组的直径最小化。然而,我们观察到,在某些应用场景中,这样的配方可能无法满足用户的需求。在本文中,我们引入了一个新的空间关键字覆盖问题(简称SK-COVER),该问题旨在识别一组覆盖查询中所有关键字的空间文本对象,并最小化距离代价函数,从而导致答案集中较少的近似对象。我们证明了SK-COVER不仅是np困难的,而且不允许在多项式时间内的近似优于O(log m),其中m是查询关键字的数量。我们建立了一个O(log m)逼近算法,该算法在SK-COVER的逼近性方面是渐近最优的。此外,我们设计了有效的访问策略和修剪规则,以提高整体效率和可扩展性。除了我们的算法结果之外,我们的经验表明,我们的近似算法总是达到最佳精度,并且我们的算法的效率可与用于mCK的最先进算法相媲美,mCK问题与SK-COVER类似,但理论上比SK-COVER更容易。
{"title":"Finding the minimum spatial keyword cover","authors":"Dong-Wan Choi, J. Pei, Xuemin Lin","doi":"10.1109/ICDE.2016.7498281","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498281","url":null,"abstract":"The existing works on spatial keyword search focus on finding a group of spatial objects covering all the query keywords and minimizing the diameter of the group. However, we observe that such a formulation may not address what users need in some application scenarios. In this paper, we introduce a novel spatial keyword cover problem (SK-COVER for short), which aims to identify the group of spatio-textual objects covering all keywords in a query and minimizing a distance cost function that leads to fewer proximate objects in the answer set. We prove that SK-COVER is not only NP-hard but also does not allow an approximation better than O(log m) in polynomial time, where m is the number of query keywords. We establish an O(log m)-approximation algorithm, which is asymptotically optimal in terms of the approximability of SK-COVER. Furthermore, we devise effective accessing strategies and pruning rules to improve the overall efficiency and scalability. In addition to our algorithmic results, we empirically show that our approximation algorithm always achieves the best accuracy, and the efficiency of our algorithm is comparable to a state-of-the-art algorithm that is intended for mCK, a problem similar to yet theoretically easier than SK-COVER.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"222 1","pages":"685-696"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77255202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1