首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
BlackHole: Robust community detection inspired by graph drawing 黑洞:受图形绘制启发的鲁棒社区检测
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498226
Sungsu Lim, Junghoon Kim, Jae-Gil Lee
With regard to social network analysis, we concentrate on two widely-accepted building blocks: community detection and graph drawing. Although community detection and graph drawing have been studied separately, they have a great commonality, which means that it is possible to advance one field using the techniques of the other. In this paper, we propose a novel community detection algorithm for undirected graphs, called BlackHole, by importing a geometric embedding technique from graph drawing. Our proposed algorithm transforms the vertices of a graph to a set of points on a low-dimensional space whose coordinates are determined by a variant of graph drawing algorithms, following the overall procedure of spectral clustering. The set of points are then clustered using a conventional clustering algorithm to form communities. Our primary contribution is to prove that a common idea in graph drawing, which is characterized by consideration of repulsive forces in addition to attractive forces, improves the clusterability of an embedding. As a result, our algorithm has the advantages of being robust especially when the community structure is not easily detectable. Through extensive experiments, we have shown that BlackHole achieves the accuracy higher than or comparable to the state-of-the-art algorithms.
关于社会网络分析,我们专注于两个被广泛接受的构建模块:社区检测和图形绘制。虽然社区检测和图形绘制是分开研究的,但它们有很大的共性,这意味着可以使用另一个领域的技术来推进一个领域。本文引入图形绘制中的几何嵌入技术,提出了一种新的无向图群体检测算法——黑洞。我们提出的算法将图的顶点转换为低维空间上的一组点,这些点的坐标由一种图形绘制算法的变体确定,遵循谱聚类的整个过程。然后使用传统的聚类算法对这些点进行聚类以形成群体。我们的主要贡献是证明了图形绘制中的一个共同思想,除了考虑引力之外,还考虑了斥力,从而提高了嵌入的聚类性。结果表明,该算法具有较强的鲁棒性,特别是在群体结构不易检测的情况下。通过大量的实验,我们已经证明黑洞达到了比最先进的算法更高或相当的精度。
{"title":"BlackHole: Robust community detection inspired by graph drawing","authors":"Sungsu Lim, Junghoon Kim, Jae-Gil Lee","doi":"10.1109/ICDE.2016.7498226","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498226","url":null,"abstract":"With regard to social network analysis, we concentrate on two widely-accepted building blocks: community detection and graph drawing. Although community detection and graph drawing have been studied separately, they have a great commonality, which means that it is possible to advance one field using the techniques of the other. In this paper, we propose a novel community detection algorithm for undirected graphs, called BlackHole, by importing a geometric embedding technique from graph drawing. Our proposed algorithm transforms the vertices of a graph to a set of points on a low-dimensional space whose coordinates are determined by a variant of graph drawing algorithms, following the overall procedure of spectral clustering. The set of points are then clustered using a conventional clustering algorithm to form communities. Our primary contribution is to prove that a common idea in graph drawing, which is characterized by consideration of repulsive forces in addition to attractive forces, improves the clusterability of an embedding. As a result, our algorithm has the advantages of being robust especially when the community structure is not easily detectable. Through extensive experiments, we have shown that BlackHole achieves the accuracy higher than or comparable to the state-of-the-art algorithms.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"43 1","pages":"25-36"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79431904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
OptImatch: Semantic web system for query problem determination OptImatch:用于查询问题确定的语义web系统
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498338
Guilherme Damasio, Piotr Mierzejewski, Jaroslaw Szlichta, C. Zuzarte
Query performance problem determination is usually performed by analyzing query execution plans (QEPs). Analyzing complex QEPs is excessively time consuming and existing automatic problem determination tools do not provide ability to perform analysis with flexible user-defined problem patterns. We present the novel OptImatch system that allows a relatively naive user to search for patterns in QEPs and get recommendations from an expert and user customizable knowledge base. Our system transforms a QEP into an RDF graph. We provide a web graphical interface for the user to describe a pattern that is transformed with handlers into a SPARQL query. The SPARQL query is matched against the abstracted RDF graph and any matched parts of the graph are relayed back to the user. With the knowledge base the system automatically matches stored patterns to the QEPs by adapting dynamic context through developed tagging language and ranks recommendations using statistical correlation analysis.
查询性能问题的确定通常通过分析查询执行计划(qep)来完成。分析复杂的qep非常耗时,而且现有的自动问题确定工具不提供使用灵活的用户定义问题模式执行分析的能力。我们提出了一种新颖的OptImatch系统,它允许一个相对幼稚的用户在qep中搜索模式,并从专家和用户可定制的知识库中获得建议。我们的系统将QEP转换为RDF图。我们为用户提供了一个web图形界面来描述一个模式,该模式通过处理程序转换为SPARQL查询。SPARQL查询与抽象的RDF图匹配,图中任何匹配的部分都被转发给用户。在知识库的基础上,系统通过开发的标记语言适应动态上下文,自动将存储的模式与qep进行匹配,并使用统计相关性分析对推荐进行排序。
{"title":"OptImatch: Semantic web system for query problem determination","authors":"Guilherme Damasio, Piotr Mierzejewski, Jaroslaw Szlichta, C. Zuzarte","doi":"10.1109/ICDE.2016.7498338","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498338","url":null,"abstract":"Query performance problem determination is usually performed by analyzing query execution plans (QEPs). Analyzing complex QEPs is excessively time consuming and existing automatic problem determination tools do not provide ability to perform analysis with flexible user-defined problem patterns. We present the novel OptImatch system that allows a relatively naive user to search for patterns in QEPs and get recommendations from an expert and user customizable knowledge base. Our system transforms a QEP into an RDF graph. We provide a web graphical interface for the user to describe a pattern that is transformed with handlers into a SPARQL query. The SPARQL query is matched against the abstracted RDF graph and any matched parts of the graph are relayed back to the user. With the knowledge base the system automatically matches stored patterns to the QEPs by adapting dynamic context through developed tagging language and ranks recommendations using statistical correlation analysis.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"15 1","pages":"1334-1337"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82056449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Mining temporal patterns in interval-based data 在基于间隔的数据中挖掘时间模式
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498397
Yi-Cheng Chen, Wen-Chih Peng, Suh-Yin Lee
Sequential pattern mining is an important subfield in data mining. Recently, discovering patterns from interval events has attracted considerable efforts due to its widespread applications. However, due to the complex relation between two intervals, mining interval-based sequences efficiently is a challenging issue. In this paper, we develop a novel algorithm, P-TPMiner, to efficiently discover two types of interval-based sequential patterns. Some pruning techniques are proposed to further reduce the search space of the mining process. Experimental studies show that proposed algorithm is efficient and scalable. Furthermore, we apply proposed method to real datasets to demonstrate the practicability of discussed patterns.
顺序模式挖掘是数据挖掘的一个重要分支。近年来,从间隔事件中发现模式由于其广泛的应用而引起了人们的极大关注。然而,由于两个区间之间的复杂关系,有效地挖掘基于区间的序列是一个具有挑战性的问题。在本文中,我们开发了一种新的算法,P-TPMiner,以有效地发现两种类型的基于区间的序列模式。提出了一些修剪技术,以进一步缩小挖掘过程的搜索空间。实验研究表明,该算法具有较高的效率和可扩展性。此外,我们将所提出的方法应用于实际数据集,以证明所讨论模式的实用性。
{"title":"Mining temporal patterns in interval-based data","authors":"Yi-Cheng Chen, Wen-Chih Peng, Suh-Yin Lee","doi":"10.1109/ICDE.2016.7498397","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498397","url":null,"abstract":"Sequential pattern mining is an important subfield in data mining. Recently, discovering patterns from interval events has attracted considerable efforts due to its widespread applications. However, due to the complex relation between two intervals, mining interval-based sequences efficiently is a challenging issue. In this paper, we develop a novel algorithm, P-TPMiner, to efficiently discover two types of interval-based sequential patterns. Some pruning techniques are proposed to further reduce the search space of the mining process. Experimental studies show that proposed algorithm is efficient and scalable. Furthermore, we apply proposed method to real datasets to demonstrate the practicability of discussed patterns.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"37 1","pages":"1506-1507"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85484923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Scalable data management: NoSQL data stores in research and practice 可扩展数据管理:NoSQL数据存储的研究和实践
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498360
Felix Gessert, N. Ritter
The unprecedented scale at which data is consumed and generated today has shown a large demand for scalable data management and given rise to non-relational, distributed “NoSQL” database systems. Two central problems triggered this process: 1) vast amounts of user-generated content in modern applications and the resulting requests loads and data volumes 2) the desire of the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems - now commonly referred to as NoSQL data stores - make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system's design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges.
如今,数据消费和生成的空前规模显示出对可扩展数据管理的巨大需求,并催生了非关系型、分布式的“NoSQL”数据库系统。两个核心问题触发了这个过程:1)现代应用程序中大量用户生成的内容以及由此产生的请求负载和数据量2)开发人员社区希望使用特定于问题的数据模型进行存储和查询。为了满足这些需求,工业界和研究人员开发了各种数据存储,认为一刀切数据库系统的时代已经结束。这些系统(现在通常称为NoSQL数据存储)的异构性和绝对数量使得为给定应用程序选择最合适的系统变得越来越困难。因此,这些系统经常组合在多语言持久性架构中,以在各自的最佳位置利用每个系统。本教程对最相关的NoSQL数据库进行了深入的调查,以提供比较分类并强调开放的挑战。为此,我们分析了每个系统的方法,得出了其可扩展性、可用性、一致性、数据建模和查询特性。我们展示了每个系统的设计是如何由一组对不可调和的系统属性进行权衡的中心控制的。然后,我们介绍了分布式数据管理方面的最新研究成果,以说明NoSQL系统的一些缺点已经可以在实践中得到解决,而其他NoSQL数据管理问题则提出了有趣的和未解决的研究挑战。
{"title":"Scalable data management: NoSQL data stores in research and practice","authors":"Felix Gessert, N. Ritter","doi":"10.1109/ICDE.2016.7498360","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498360","url":null,"abstract":"The unprecedented scale at which data is consumed and generated today has shown a large demand for scalable data management and given rise to non-relational, distributed “NoSQL” database systems. Two central problems triggered this process: 1) vast amounts of user-generated content in modern applications and the resulting requests loads and data volumes 2) the desire of the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems - now commonly referred to as NoSQL data stores - make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system's design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"15 1","pages":"1420-1423"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84964960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
An embedding approach to anomaly detection 异常检测的嵌入方法
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498256
Renjun Hu, C. Aggarwal, Shuai Ma, J. Huai
Network anomaly detection has become very popular in recent years because of the importance of discovering key regions of structural inconsistency in the network. In addition to application-specific information carried by anomalies, the presence of such structural inconsistency is often an impediment to the effective application of data mining algorithms such as community detection and classification. In this paper, we study the problem of detecting structurally inconsistent nodes that connect to a number of diverse influential communities in large social networks. We show that the use of a network embedding approach, together with a novel dimension reduction technique, is an effective tool to discover such structural inconsistencies. We also experimentally show that the detection of such anomalous nodes has significant applications: one is the specific use of detected anomalies, and the other is the improvement of the effectiveness of community detection.
由于发现网络结构不一致的关键区域的重要性,网络异常检测近年来变得非常流行。除了异常所携带的特定于应用程序的信息外,这种结构不一致的存在通常会阻碍社区检测和分类等数据挖掘算法的有效应用。在本文中,我们研究了在大型社会网络中连接到许多不同的有影响力的社区的结构不一致节点的检测问题。我们表明,使用网络嵌入方法,以及一种新的降维技术,是发现这种结构不一致的有效工具。我们还通过实验表明,这种异常节点的检测具有重要的应用:一是对检测到的异常进行具体利用,二是提高社区检测的有效性。
{"title":"An embedding approach to anomaly detection","authors":"Renjun Hu, C. Aggarwal, Shuai Ma, J. Huai","doi":"10.1109/ICDE.2016.7498256","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498256","url":null,"abstract":"Network anomaly detection has become very popular in recent years because of the importance of discovering key regions of structural inconsistency in the network. In addition to application-specific information carried by anomalies, the presence of such structural inconsistency is often an impediment to the effective application of data mining algorithms such as community detection and classification. In this paper, we study the problem of detecting structurally inconsistent nodes that connect to a number of diverse influential communities in large social networks. We show that the use of a network embedding approach, together with a novel dimension reduction technique, is an effective tool to discover such structural inconsistencies. We also experimentally show that the detection of such anomalous nodes has significant applications: one is the specific use of detected anomalies, and the other is the improvement of the effectiveness of community detection.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"385-396"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89906602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Being prepared in a sparse world: The case of KNN graph construction 在稀疏世界中准备:KNN图构造的情况
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498244
A. Boutet, Anne-Marie Kermarrec, Nupur Mittal, François Taïani
K-Nearest-Neighbor (KNN) graphs have emerged as a fundamental building block of many on-line services providing recommendation, similarity search and classification. Constructing a KNN graph rapidly and accurately is, however, a computationally intensive task. As data volumes keep growing, speed and the ability to scale out are becoming critical factors when deploying a KNN algorithm. In this work, we present KIFF, a generic, fast and scalable KNN graph construction algorithm. KIFF directly exploits the bipartite nature of most datasets to which KNN algorithms are applied. This simple but powerful strategy drastically limits the computational cost required to rapidly converge to an accurate KNN solution, especially for sparse datasets. Our evaluation on a representative range of datasets show that KIFF provides, on average, a speed-up factor of 14 against recent state-of-the art solutions while improving the quality of the KNN approximation by 18%.
k -最近邻(KNN)图已经成为许多在线服务的基本构建块,提供推荐、相似性搜索和分类。然而,快速准确地构建KNN图是一项计算密集型任务。随着数据量的不断增长,速度和向外扩展的能力成为部署KNN算法时的关键因素。本文提出了一种通用、快速、可扩展的KNN图构建算法KIFF。KIFF直接利用了KNN算法应用的大多数数据集的二分性。这种简单但功能强大的策略极大地限制了快速收敛到准确的KNN解决方案所需的计算成本,特别是对于稀疏数据集。我们对具有代表性的数据集范围的评估表明,相对于最新的最先进的解决方案,KIFF平均提供了14的加速因子,同时将KNN近似的质量提高了18%。
{"title":"Being prepared in a sparse world: The case of KNN graph construction","authors":"A. Boutet, Anne-Marie Kermarrec, Nupur Mittal, François Taïani","doi":"10.1109/ICDE.2016.7498244","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498244","url":null,"abstract":"K-Nearest-Neighbor (KNN) graphs have emerged as a fundamental building block of many on-line services providing recommendation, similarity search and classification. Constructing a KNN graph rapidly and accurately is, however, a computationally intensive task. As data volumes keep growing, speed and the ability to scale out are becoming critical factors when deploying a KNN algorithm. In this work, we present KIFF, a generic, fast and scalable KNN graph construction algorithm. KIFF directly exploits the bipartite nature of most datasets to which KNN algorithms are applied. This simple but powerful strategy drastically limits the computational cost required to rapidly converge to an accurate KNN solution, especially for sparse datasets. Our evaluation on a representative range of datasets show that KIFF provides, on average, a speed-up factor of 14 against recent state-of-the art solutions while improving the quality of the KNN approximation by 18%.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"241-252"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87126657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Differentially private multi-party high-dimensional data publishing 差异化私有多方高维数据发布
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498241
Sen Su, Peng Tang, Xiang Cheng, R. Chen, Zequn Wu
In this paper, we study the novel problem of publishing high-dimensional data in a distributed multi-party environment under differential privacy. In particular, with the assistance of a semi-trusted curator, the involved parties (i.e., local data owners) collectively generate a synthetic integrated dataset while satisfying ε-differential privacy for any local dataset. To solve this problem, we present a differentially private sequential update of Bayesian network (DP-SUBN) solution. In DP-SUBN, the parties and the curator collaboratively identify the Bayesian network ℕ that best fits the integrated dataset D in a sequential manner, from which a synthetic dataset can then be generated. The fundamental advantage of adopting the sequential update manner is that the parties can treat the statistical results provided by previous parties as their prior knowledge to direct how to learn ℕ. The core of DP-SUBN is the construction of the search frontier, which can be seen as a priori knowledge to guide the parties to update ℕ. To improve the fitness of ℕ and reduce the communication cost, we introduce a correlation-aware search frontier construction (CSFC) approach, where attribute pairs with strong correlations are used to construct the search frontier. In particular, to privately quantify the correlations of attribute pairs without introducing too much noise, we first propose a non-overlapping covering design (NOCD) method, and then introduce a dynamic programming method to find the optimal parameters used in NOCD to ensure that the injected noise is minimum. Through formal privacy analysis, we show that DP-SUBN satisfies ε-differential privacy for any local dataset. Extensive experiments on a real dataset demonstrate that DP-SUBN offers desirable data utility with low communication cost.
本文研究了差分隐私下分布式多方环境下高维数据发布的新问题。特别是,在半可信的管理员的帮助下,相关各方(即本地数据所有者)共同生成合成集成数据集,同时满足任何本地数据集的ε-差分隐私。为了解决这一问题,我们提出了一种差分私有贝叶斯网络序列更新(DP-SUBN)方案。在DP-SUBN中,各方和管理员以顺序的方式协作确定最适合集成数据集D的贝叶斯网络n,然后可以从中生成合成数据集。采用顺序更新方式的根本优点是,各方可以将之前各方提供的统计结果作为自己的先验知识来指导如何学习n。DP-SUBN的核心是搜索边界的构建,可以看作是指导各方更新n的先验知识。为了提高n的适应度和降低通信成本,我们引入了一种关联感知搜索边界构建(CSFC)方法,该方法使用具有强相关性的属性对来构建搜索边界。特别是,为了在不引入过多噪声的情况下私下量化属性对的相关性,我们首先提出了一种非重叠覆盖设计(NOCD)方法,然后引入动态规划方法来寻找NOCD中使用的最优参数,以确保注入的噪声最小。通过形式化的隐私分析,我们证明了DP-SUBN对任何局部数据集都满足ε-微分隐私。在实际数据集上进行的大量实验表明,DP-SUBN具有较低的通信成本和良好的数据效用。
{"title":"Differentially private multi-party high-dimensional data publishing","authors":"Sen Su, Peng Tang, Xiang Cheng, R. Chen, Zequn Wu","doi":"10.1109/ICDE.2016.7498241","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498241","url":null,"abstract":"In this paper, we study the novel problem of publishing high-dimensional data in a distributed multi-party environment under differential privacy. In particular, with the assistance of a semi-trusted curator, the involved parties (i.e., local data owners) collectively generate a synthetic integrated dataset while satisfying ε-differential privacy for any local dataset. To solve this problem, we present a differentially private sequential update of Bayesian network (DP-SUBN) solution. In DP-SUBN, the parties and the curator collaboratively identify the Bayesian network ℕ that best fits the integrated dataset D in a sequential manner, from which a synthetic dataset can then be generated. The fundamental advantage of adopting the sequential update manner is that the parties can treat the statistical results provided by previous parties as their prior knowledge to direct how to learn ℕ. The core of DP-SUBN is the construction of the search frontier, which can be seen as a priori knowledge to guide the parties to update ℕ. To improve the fitness of ℕ and reduce the communication cost, we introduce a correlation-aware search frontier construction (CSFC) approach, where attribute pairs with strong correlations are used to construct the search frontier. In particular, to privately quantify the correlations of attribute pairs without introducing too much noise, we first propose a non-overlapping covering design (NOCD) method, and then introduce a dynamic programming method to find the optimal parameters used in NOCD to ensure that the injected noise is minimum. Through formal privacy analysis, we show that DP-SUBN satisfies ε-differential privacy for any local dataset. Extensive experiments on a real dataset demonstrate that DP-SUBN offers desirable data utility with low communication cost.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"65 5 1","pages":"205-216"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89862333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Scaling up truth discovery 扩大真理发现
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498359
Laure Berti-Équille
The evolution of the Web from a technology platform to a social ecosystem has resulted in unprecedented data volumes being continuously generated, exchanged, and consumed. User-generated content on the Web is massive, highly dynamic, and characterized by a combination of factual data and opinion data. False information, rumors, and fake contents can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. Truth discovery (also known as fact-checking) has recently gained lot of interest from Data Science communities. This tutorial will attempt to cover recent work on truth-finding and how it can scale Big Data. We will provide a broad overview with new insights, highlighting the progress made on truth discovery from information extraction, data and knowledge fusion, as well as modeling of misinformation dynamics in social networks. We will review in details current models, algorithms, and techniques proposed by various research communities whose contributions converge towards the same goal of estimating the veracity of data in a dynamic world. Our aim is to bridge theory and practice and introduce recent work from diverse disciplines to database people to be better equipped for addressing the challenges of truth discovery in Big Data.
Web从技术平台向社会生态系统的演变导致了前所未有的数据量不断生成、交换和消费。Web上用户生成的内容是大量的、高度动态的,并且以事实数据和观点数据的组合为特征。虚假信息、谣言和虚假内容可以很容易地通过多个来源传播,这使得很难区分什么是真实的,什么是虚假的。事实发现(也称为事实核查)最近引起了数据科学界的极大兴趣。本教程将尝试介绍关于真相发现的最新工作以及它如何扩展大数据。我们将提供一个广泛的概述和新的见解,重点介绍在信息提取,数据和知识融合以及社交网络中错误信息动态建模的真相发现方面取得的进展。我们将详细回顾各种研究团体提出的当前模型、算法和技术,这些研究团体的贡献汇聚在动态世界中估计数据准确性的同一目标上。我们的目标是在理论和实践之间架起桥梁,并向数据库人员介绍不同学科的最新研究成果,以便更好地应对大数据中真相发现的挑战。
{"title":"Scaling up truth discovery","authors":"Laure Berti-Équille","doi":"10.1109/ICDE.2016.7498359","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498359","url":null,"abstract":"The evolution of the Web from a technology platform to a social ecosystem has resulted in unprecedented data volumes being continuously generated, exchanged, and consumed. User-generated content on the Web is massive, highly dynamic, and characterized by a combination of factual data and opinion data. False information, rumors, and fake contents can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. Truth discovery (also known as fact-checking) has recently gained lot of interest from Data Science communities. This tutorial will attempt to cover recent work on truth-finding and how it can scale Big Data. We will provide a broad overview with new insights, highlighting the progress made on truth discovery from information extraction, data and knowledge fusion, as well as modeling of misinformation dynamics in social networks. We will review in details current models, algorithms, and techniques proposed by various research communities whose contributions converge towards the same goal of estimating the veracity of data in a dynamic world. Our aim is to bridge theory and practice and introduce recent work from diverse disciplines to database people to be better equipped for addressing the challenges of truth discovery in Big Data.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"1418-1419"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89345769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Event regularity and irregularity in a time unit 一个时间单位内事件的规律性和非规律性
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498302
Lijian Wan, Tingjian Ge
In this paper, we study the problem of learning a regular model from a number of sequences, each of which contains events in a time unit. Assuming some regularity in such sequences, we determine what events should be deemed irregular in their contexts. We perform an in-depth analysis of the model we build, and propose two optimization techniques, one of which is also of independent interest in solving a new problem named the Group Counting problem. Our comprehensive experiments on real and hybrid datasets show that the model we build is very effective in characterizing regularities and identifying irregular events. One of our optimizations improves model building speed by more than an order of magnitude, and the other significantly saves space consumption.
在本文中,我们研究了从一系列序列中学习正则模型的问题,每个序列在一个时间单元中包含事件。假设这些序列有一定的规律性,我们就可以确定哪些事件在其上下文中应该被认为是不规则的。我们对我们建立的模型进行了深入的分析,并提出了两种优化技术,其中一种技术在解决一个名为群计数问题的新问题时也有独立的兴趣。我们在真实数据集和混合数据集上的综合实验表明,我们建立的模型在描述规律和识别不规则事件方面是非常有效的。我们的一项优化将模型构建速度提高了一个数量级以上,而另一项优化则显著节省了空间消耗。
{"title":"Event regularity and irregularity in a time unit","authors":"Lijian Wan, Tingjian Ge","doi":"10.1109/ICDE.2016.7498302","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498302","url":null,"abstract":"In this paper, we study the problem of learning a regular model from a number of sequences, each of which contains events in a time unit. Assuming some regularity in such sequences, we determine what events should be deemed irregular in their contexts. We perform an in-depth analysis of the model we build, and propose two optimization techniques, one of which is also of independent interest in solving a new problem named the Group Counting problem. Our comprehensive experiments on real and hybrid datasets show that the model we build is very effective in characterizing regularities and identifying irregular events. One of our optimizations improves model building speed by more than an order of magnitude, and the other significantly saves space consumption.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"129 1","pages":"930-941"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89493360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
HAWK: Hardware support for unstructured log processing HAWK:非结构化日志处理的硬件支持
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498263
Prateek Tandon, Faissal M. Sleiman, Michael J. Cafarella, T. Wenisch
Rapidly processing high-velocity text data is critical for many technical and business applications. Widely used software solutions for processing these large text corpora target disk-resident data and rely on pre-computed indexes and large clusters to achieve high performance. However, greater capacity and falling costs are enabling a shift to RAM-resident data sets. The enormous bandwidth of RAM can facilitate scan operations that are competitive with pre-computed indexes for interactive, ad-hoc queries. However, software approaches for processing these large text corpora fall far short of saturating available bandwidth and meeting peak scan rates possible on modern memory systems. In this paper, we present HAWK, a hardware accelerator for ad hoc queries against large in-memory logs. HAWK comprises a stall-free hardware pipeline that scans input data at a constant rate, examining multiple input characters in parallel during a single accelerator clock cycle. We describe a 1GHz 32-characterwide HAWK design targeting ASIC implementation, designed to process data at 32GB/s (up to two orders of magnitude faster than software solutions), and demonstrate a scaled-down FPGA prototype that operates at 100MHz with 4-wide parallelism, which processes at 400MB/s (13× faster than software grep for large multi-pattern scans).
快速处理高速文本数据对于许多技术和业务应用程序至关重要。用于处理这些大型文本语料库的广泛使用的软件解决方案以磁盘驻留数据为目标,并依赖于预先计算的索引和大型集群来实现高性能。然而,更大的容量和不断下降的成本正在推动向驻留在ram上的数据集的转变。RAM的巨大带宽可以促进扫描操作,这些操作可以与交互式临时查询的预计算索引相竞争。然而,处理这些大型文本语料库的软件方法远远不能达到饱和可用带宽和满足现代存储系统上可能的峰值扫描速率。在本文中,我们介绍了HAWK,这是一个硬件加速器,用于针对大型内存日志进行临时查询。HAWK包括一个无失速硬件管道,以恒定速率扫描输入数据,在单个加速器时钟周期内并行检查多个输入字符。我们描述了一个针对ASIC实现的1GHz 32字符宽HAWK设计,旨在以32GB/s的速度处理数据(比软件解决方案快两个数量级),并演示了一个按比例缩小的FPGA原型,其工作频率为100MHz,并行度为4-wide,处理速度为400MB/s(比软件grep快13倍,用于大型多模式扫描)。
{"title":"HAWK: Hardware support for unstructured log processing","authors":"Prateek Tandon, Faissal M. Sleiman, Michael J. Cafarella, T. Wenisch","doi":"10.1109/ICDE.2016.7498263","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498263","url":null,"abstract":"Rapidly processing high-velocity text data is critical for many technical and business applications. Widely used software solutions for processing these large text corpora target disk-resident data and rely on pre-computed indexes and large clusters to achieve high performance. However, greater capacity and falling costs are enabling a shift to RAM-resident data sets. The enormous bandwidth of RAM can facilitate scan operations that are competitive with pre-computed indexes for interactive, ad-hoc queries. However, software approaches for processing these large text corpora fall far short of saturating available bandwidth and meeting peak scan rates possible on modern memory systems. In this paper, we present HAWK, a hardware accelerator for ad hoc queries against large in-memory logs. HAWK comprises a stall-free hardware pipeline that scans input data at a constant rate, examining multiple input characters in parallel during a single accelerator clock cycle. We describe a 1GHz 32-characterwide HAWK design targeting ASIC implementation, designed to process data at 32GB/s (up to two orders of magnitude faster than software solutions), and demonstrate a scaled-down FPGA prototype that operates at 100MHz with 4-wide parallelism, which processes at 400MB/s (13× faster than software grep for large multi-pattern scans).","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"4 4 1","pages":"469-480"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75939254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1