Proceedings of the 2016 International Conference on Management of Data最新文献

英文中文

DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine DUALSIM:单机上海量图的并行子图枚举

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915209

Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah

Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.

子图枚举对于子图频率、网络motif发现、graphlet核计算以及研究社会网络演化等许多应用都具有重要意义。大多数关于子图枚举的早期工作都假设图驻留在内存中，这会导致严重的可伸缩性问题。最近，通过划分数据图和利用分布式框架(如MapReduce和分布式图引擎)，在大规模图中枚举所有子图的努力似乎取得了一些成功。然而，我们注意到，由于部分结果的爆炸性数量，所有现有的分布式方法在子图枚举方面都存在严重的性能问题。在本文中，我们设计并实现了一个基于磁盘的，称为DualSim的单机并行子图枚举解决方案，该解决方案可以处理大量图，而无需维护指数数量的部分结果。具体来说，我们提出了子图枚举的对偶方法的新概念。这种双重方法交换了数据图和查询图的角色。具体来说，它不是在查询中固定匹配顺序，然后匹配数据顶点，而是通过固定一组磁盘页面来固定数据顶点，然后找到这些页面中的所有子图匹配。这使我们能够显著减少磁盘读取的次数。我们对各种真实世界的图进行了广泛的实验，以系统地证明DualSim优于最先进的分布式子图枚举方法。DualSim的性能比最先进的方法高出数量级，但由于中间结果爆炸性，它们在许多查询中失败。

{"title":"DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine","authors":"Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah","doi":"10.1145/2882903.2915209","DOIUrl":"https://doi.org/10.1145/2882903.2915209","url":null,"abstract":"Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82660389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets 数据一夫多妻:城市时空数据集之间的多-多关系

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915245

F. Chirigati, Harish Doraiswamy, T. Damoulas, J. Freire

The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these relationships is difficult. First, a relationship between two data sets may occur only at certain locations and/or time periods. Second, the sheer number and size of the data sets, coupled with the diverse spatial and temporal scales at which the data is available, presents computational challenges on all fronts, from indexing and querying to analyzing them. Finally, it is non-trivial to differentiate between meaningful and spurious relationships. To address these challenges, we propose Data Polygamy, a scalable topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets. We have performed an experimental evaluation using over 300 spatial-temporal urban data sets which shows that our approach is scalable and effective at identifying interesting relationships.

从城市环境中收集数据的能力日益增强，再加上政府对开放的推动，导致了覆盖城市各个方面的大量时空数据集的可用性。发现这些数据集之间的关系可以产生新的见解，使领域专家不仅可以测试，还可以产生假设。然而，发现这些关系是困难的。首先，两个数据集之间的关系可能只发生在特定的位置和/或时间段。其次，数据集的绝对数量和规模，加上数据可用的不同空间和时间尺度，从索引、查询到分析，在所有方面都提出了计算挑战。最后，区分有意义的关系和虚假的关系是很重要的。为了应对这些挑战，我们提出了数据一夫多妻制，这是一个可扩展的基于拓扑的框架，允许用户查询时空数据集之间的统计显著关系。我们使用300多个时空城市数据集进行了实验评估，结果表明我们的方法在识别有趣的关系方面具有可扩展性和有效性。

{"title":"Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets","authors":"F. Chirigati, Harish Doraiswamy, T. Damoulas, J. Freire","doi":"10.1145/2882903.2915245","DOIUrl":"https://doi.org/10.1145/2882903.2915245","url":null,"abstract":"The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these relationships is difficult. First, a relationship between two data sets may occur only at certain locations and/or time periods. Second, the sheer number and size of the data sets, coupled with the diverse spatial and temporal scales at which the data is available, presents computational challenges on all fronts, from indexing and querying to analyzing them. Finally, it is non-trivial to differentiate between meaningful and spurious relationships. To address these challenges, we propose Data Polygamy, a scalable topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets. We have performed an experimental evaluation using over 300 spatial-temporal urban data sets which shows that our approach is scalable and effective at identifying interesting relationships.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80307507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Ambry: LinkedIn's Scalable Geo-Distributed Object Store Ambry: LinkedIn的可伸缩地理分布式对象存储

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903738

S. Noghabi, Sriram Ganapathi Subramanian, Priyesh Narayanan, Sivabalan Narayanan, G. Holla, M. Zadeh, Tianwei Li, Indranil Gupta, R. Campbell

The infrastructure beneath a worldwide social network has to continually serve billions of variable-sized media objects such as photos, videos, and audio clips. These objects must be stored and served with low latency and high throughput by a system that is geo-distributed, highly scalable, and load-balanced. Existing file systems and object stores face several challenges when serving such large objects. We present Ambry, a production-quality system for storing large immutable data (called blobs). Ambry is designed in a decentralized way and leverages techniques such as logical blob grouping, asynchronous replication, rebalancing mechanisms, zero-cost failure detection, and OS caching. Ambry has been running in LinkedIn's production environment for the past 2 years, serving up to 10K requests per second across more than 400 million users. Our experimental evaluation reveals that Ambry offers high efficiency (utilizing up to 88% of the network bandwidth), low latency (less than 50 ms latency for a 1 MB object), and load balancing (improving imbalance of request rate among disks by 8x-10x).

全球社交网络下的基础设施必须持续地为数十亿不同大小的媒体对象(如照片、视频和音频剪辑)提供服务。这些对象必须通过地理分布、高度可扩展和负载均衡的系统以低延迟和高吞吐量进行存储和服务。现有的文件系统和对象存储在服务如此大的对象时面临几个挑战。我们介绍Ambry，一个用于存储大型不可变数据(称为blob)的生产质量系统。Ambry以分散的方式设计，并利用了逻辑blob分组、异步复制、再平衡机制、零成本故障检测和操作系统缓存等技术。Ambry已经在LinkedIn的生产环境中运行了两年，为超过4亿用户提供每秒1万次请求的服务。我们的实验评估表明，Ambry提供了高效率(利用高达88%的网络带宽)、低延迟(1 MB对象的延迟小于50 ms)和负载平衡(将磁盘间请求率的不平衡改善了8 -10倍)。

{"title":"Ambry: LinkedIn's Scalable Geo-Distributed Object Store","authors":"S. Noghabi, Sriram Ganapathi Subramanian, Priyesh Narayanan, Sivabalan Narayanan, G. Holla, M. Zadeh, Tianwei Li, Indranil Gupta, R. Campbell","doi":"10.1145/2882903.2903738","DOIUrl":"https://doi.org/10.1145/2882903.2903738","url":null,"abstract":"The infrastructure beneath a worldwide social network has to continually serve billions of variable-sized media objects such as photos, videos, and audio clips. These objects must be stored and served with low latency and high throughput by a system that is geo-distributed, highly scalable, and load-balanced. Existing file systems and object stores face several challenges when serving such large objects. We present Ambry, a production-quality system for storing large immutable data (called blobs). Ambry is designed in a decentralized way and leverages techniques such as logical blob grouping, asynchronous replication, rebalancing mechanisms, zero-cost failure detection, and OS caching. Ambry has been running in LinkedIn's production environment for the past 2 years, serving up to 10K requests per second across more than 400 million users. Our experimental evaluation reveals that Ambry offers high efficiency (utilizing up to 88% of the network bandwidth), low latency (less than 50 ms latency for a 1 MB object), and load balancing (improving imbalance of request rate among disks by 8x-10x).","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77796050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Estimating the Impact of Unknown Unknowns on Aggregate Query Results 估计未知未知数对聚合查询结果的影响

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882909

Yeounoh Chung, Michael L. Mortensen, Carsten Binnig, Tim Kraska

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

数据科学家通常的做法是获取和集成不同的数据源，以获得更高质量的结果。但是，即使有一个完全清理和合并的数据集，仍然存在两个基本问题:(1)集成的数据集是完整的吗?(2)任何未知(即未观察到的)数据对查询结果的影响是什么?在这项工作中，我们开发和分析技术来估计未知数据(又称未知未知数)对简单聚合查询的影响。关键思想是，不同数据源之间的重叠使我们能够估计缺失数据项的数量和值。我们的主要技术是无参数的，不假设关于分布的先验知识。通过一系列实验，我们表明，估计未知未知数的影响对于更好地评估集成数据源上的聚合查询结果是非常宝贵的。

引用次数: 5

Expressive Query Construction through Direct Manipulation of Nested Relational Results 通过直接操作嵌套关系结果构建表达性查询

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915210

Eirik Bakke, David R Karger

Despite extensive research on visual query systems, the standard way to interact with relational databases remains to be through SQL queries and tailored form interfaces. We consider three requirements to be essential to a successful alternative: (1) query specification through direct manipulation of results, (2) the ability to view and modify any part of the current query without departing from the direct manipulation interface, and (3) SQL-like expressiveness. This paper presents the first visual query system to meet all three requirements in a single design. By directly manipulating nested relational results, and using spreadsheet idioms such as formulas and filters, the user can express a relationally complete set of query operators plus calculation, aggregation, outer joins, sorting, and nesting, while always remaining able to track and modify the state of the complete query. Our prototype gives the user an experience of responsive, incremental query building while pushing all actual query processing to the database layer. We evaluate our system with formative and controlled user studies on 28 spreadsheet users; the controlled study shows our system significantly outperforming Microsoft Access on the System Usability Scale.

尽管对可视化查询系统进行了广泛的研究，但与关系数据库交互的标准方式仍然是通过SQL查询和定制表单接口。我们认为，对于成功的替代方案来说，有三个需求是必不可少的:(1)通过直接操作结果实现查询规范;(2)在不脱离直接操作接口的情况下查看和修改当前查询的任何部分的能力;(3)类似sql的表达能力。本文提出了第一个在单一设计中满足这三个要求的可视化查询系统。通过直接操作嵌套的关系结果，并使用公式和过滤器等电子表格习惯用法，用户可以表达一组相对完整的查询操作符，加上计算、聚合、外部连接、排序和嵌套，同时始终能够跟踪和修改完整查询的状态。我们的原型为用户提供了一种响应式的、增量式的查询构建体验，同时将所有实际的查询处理都推到了数据库层。我们通过对28个电子表格用户进行形成性和受控的用户研究来评估我们的系统;对照研究表明，我们的系统在系统可用性方面明显优于Microsoft Access。

{"title":"Expressive Query Construction through Direct Manipulation of Nested Relational Results","authors":"Eirik Bakke, David R Karger","doi":"10.1145/2882903.2915210","DOIUrl":"https://doi.org/10.1145/2882903.2915210","url":null,"abstract":"Despite extensive research on visual query systems, the standard way to interact with relational databases remains to be through SQL queries and tailored form interfaces. We consider three requirements to be essential to a successful alternative: (1) query specification through direct manipulation of results, (2) the ability to view and modify any part of the current query without departing from the direct manipulation interface, and (3) SQL-like expressiveness. This paper presents the first visual query system to meet all three requirements in a single design. By directly manipulating nested relational results, and using spreadsheet idioms such as formulas and filters, the user can express a relationally complete set of query operators plus calculation, aggregation, outer joins, sorting, and nesting, while always remaining able to track and modify the state of the complete query. Our prototype gives the user an experience of responsive, incremental query building while pushing all actual query processing to the database layer. We evaluate our system with formative and controlled user studies on 28 spreadsheet users; the controlled study shows our system significantly outperforming Microsoft Access on the System Usability Scale.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78759289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People 亚德瓦谢姆多来源不确定实体解决方案:将大屠杀受害者报告转化为人

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903737

Tomer Sagi, A. Gal, Omer Barkol, Ruth Bergman, Alexander Avram

In this work we describe an entity resolution project performed at Yad Vashem, the central repository of Holocaust-era information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project sets an example for multi-source resolution on a big-data scale. We discuss a set of requirements that led us to choose the MFIBlocks entity resolution algorithm in achieving the goals of the application. We also provide a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset, highlighting the shortcomings of current methods and proposing avenues for future research in this realm.

在这项工作中，我们描述了在大屠杀时代信息的中央存储库Yad Vashem进行的实体解析项目。Yad Vashem数据集在经典实体分辨率方面是独一无二的，因为它是大规模多源的，并且需要多层次的实体分辨率。随着当今信息来源的丰富，该项目为大数据规模下的多源分辨率树立了榜样。我们讨论了一组需求，这些需求引导我们选择MFIBlocks实体解析算法来实现应用程序的目标。我们还提供了一种基于决策树的机器学习方法，将软聚类转换为代表可能实体的记录的排名聚类。广泛的实证评估展示了该数据集的独特属性，突出了当前方法的缺点，并为该领域的未来研究提出了途径。

引用次数: 10

Augmented Sketch: Faster and More Accurate Stream Processing 增强草图:更快，更准确的流处理

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882948

Pratanu Roy, Arijit Khan, G. Alonso

Approximated algorithms are often used to estimate the frequency of items on high volume, fast data streams. The most common ones are variations of Count-Min sketch, which use sub-linear space for the count, but can produce errors in the counts of the most frequent items and can misclassify low-frequency items. In this paper, we improve the accuracy of sketch-based algorithms by increasing the frequency estimation accuracy of the most frequent items and reducing the possible misclassification of low-frequency items, while also improving the overall throughput. Our solution, called Augmented Sketch (ASketch), is based on a pre-filtering stage that dynamically identifies and aggregates the most frequent items. Items overflowing the pre-filtering stage are processed using a conventional sketch algorithm, thereby making the solution general and applicable in a wide range of contexts. The pre-filtering stage can be efficiently implemented with SIMD instructions on multi-core machines and can be further parallelized through pipeline parallelism where the filtering stage runs in one core and the sketch algorithm runs in another core.

近似算法通常用于估计大容量、快速数据流上项目的频率。最常见的是count - min草图的变体，它使用次线性空间进行计数，但可能在最频繁的项目计数中产生错误，并可能对低频项目进行错误分类。在本文中，我们通过提高最频繁项目的频率估计精度和减少低频项目可能的误分类来提高基于草图的算法的准确性，同时也提高了总体吞吐量。我们的解决方案，称为增强草图(assketch)，是基于一个预过滤阶段，动态识别和聚合最频繁的项目。超出预过滤阶段的项目使用传统的草图算法进行处理，从而使解决方案具有通用性并适用于广泛的上下文。预滤波阶段可以在多核机器上使用SIMD指令有效地实现，并且可以通过管道并行进一步并行化，其中滤波阶段在一个核中运行，草图算法在另一个核中运行。

引用次数: 135

PrivateClean: Data Cleaning and Differential Privacy PrivateClean:数据清理和差异隐私

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915248

S. Krishnan, Jiannan Wang, M. Franklin, Ken Goldberg, Tim Kraska

Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.

差分隐私的最新进展使得在保留数据的主要特征的同时保证用户隐私成为可能。然而，大多数差分隐私机制都假定底层数据集是干净的。本文在我们称为PrivateClean的框架中探讨了数据清理和差异隐私之间的联系。PrivateClean包括一种用于创建数字和离散值属性的私有数据集的技术，一种用于保护隐私的数据清理的形式，以及用于在清理后回答sum、count和avg查询的技术。我们展示:(1)隐私程度如何影响随后的聚合查询准确性，(2)隐私如何潜在地放大数据集中某些类型的错误，以及(3)如何使用此分析来调整隐私程度。关键的见解是维护一个关于脏值和干净值的二部图，并使用这个图来估计由于清洁和隐私之间的相互作用而产生的偏差。我们在四个数据集上验证了这些结果，这些数据集使用了各种经过充分研究的清理技术，包括使用功能依赖、异常值过滤和解决不一致的属性。

{"title":"PrivateClean: Data Cleaning and Differential Privacy","authors":"S. Krishnan, Jiannan Wang, M. Franklin, Ken Goldberg, Tim Kraska","doi":"10.1145/2882903.2915248","DOIUrl":"https://doi.org/10.1145/2882903.2915248","url":null,"abstract":"Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86123559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Efficient and Progressive Group Steiner Tree Search 高效进步的群斯坦纳树搜索

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915217

Ronghua Li, Lu Qin, J. Yu, Rui Mao

The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3kn+2k(n log n + m)) time, where k is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of k (e.g., k=8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A*-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.

组斯坦纳树(GST)问题是数据库领域的一个基础性问题，已成功应用于关系数据库中的关键词搜索和社交网络中的团队搜索。GST问题的最先进算法是参数化动态规划(DP)算法，该算法在O(3kn+2k(n log n+ m))时间内找到最优树，其中k为给定组的数量，m和n分别为图的边和节点的数量。参数化DP算法的主要限制是双重的:(i)由于其指数复杂性，即使对于非常小的k值(例如，k=8)在大型图中也是难以处理的，并且(ii)在算法完成整个执行之前，它无法生成解。为了克服这些限制，我们在本文中提出了一种高效且渐进的GST算法，称为PrunedDP。它是基于新发展的最优树分解和条件树合并技术。该算法不仅大大缩小了参数化DP算法的搜索空间，而且在算法执行过程中产生逐步细化的可行解。为了进一步提高PrunedDP算法的速度，我们提出了一种基于几种精心设计的下限技术的渐进式a *搜索算法。我们进行了大量的实验来评估我们的算法在几个大规模的真实世界的图表。结果表明，我们的最佳算法不仅能够生成逐步细化的可行解，而且还能找到比最先进的算法至少加速两个数量级的最优解，使用更少的内存。

{"title":"Efficient and Progressive Group Steiner Tree Search","authors":"Ronghua Li, Lu Qin, J. Yu, Rui Mao","doi":"10.1145/2882903.2915217","DOIUrl":"https://doi.org/10.1145/2882903.2915217","url":null,"abstract":"The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3kn+2k(n log n + m)) time, where k is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of k (e.g., k=8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A*-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"96 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91473943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Accelerating Relational Databases by Leveraging Remote Memory and RDMA 利用远程内存和RDMA加速关系数据库

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882949

Feng Li, Sudipto Das, M. Syamala, Vivek R. Narasayya

Memory is a crucial resource in relational databases (RDBMSs). When there is insufficient memory, RDBMSs are forced to use slower media such as SSDs or HDDs, which can significantly degrade workload performance. Cloud database services are deployed in data centers where network adapters supporting remote direct memory access (RDMA) at low latency and high bandwidth are becoming prevalent. We study the novel problem of how a Symmetric Multi-Processing (SMP) RDBMS, whose memory demands exceed locally-available memory, can leverage available remote memory in the cluster accessed via RDMA to improve query performance. We expose available memory on remote servers using a lightweight file API that allows an SMP RDBMS to leverage the benefits of remote memory with modest changes. We identify and implement several novel scenarios to demonstrate these benefits, and address design challenges that are crucial for efficient implementation. We implemented the scenarios in Microsoft SQL Server engine and present the first end-to-end study to demonstrate benefits of remote memory for a variety of micro-benchmarks and industry-standard benchmarks. Compared to using disks when memory is insufficient, we improve the throughput and latency of queries with short reads and writes by 3X to 10X, while improving the latency of multiple TPC-H and TPC-DS queries by 2X to 100X.

内存是关系数据库(rdbms)中的重要资源。当内存不足时，rdbms被迫使用较慢的介质，如ssd或hdd，这可能会显著降低工作负载性能。云数据库服务部署在数据中心中，在这些数据中心中，支持低延迟和高带宽的远程直接内存访问(RDMA)的网络适配器变得越来越普遍。我们研究了对称多处理(SMP) RDBMS如何在内存需求超过本地可用内存的情况下，利用通过RDMA访问的集群中的可用远程内存来提高查询性能的新问题。我们使用轻量级文件API公开远程服务器上的可用内存，该API允许SMP RDBMS通过适度的更改来利用远程内存的好处。我们确定并实现了几个新颖的场景来展示这些好处，并解决了对有效实现至关重要的设计挑战。我们在Microsoft SQL Server引擎中实现了这些场景，并展示了第一个端到端研究，以演示远程内存对各种微基准测试和行业标准基准测试的好处。与在内存不足时使用磁盘相比，我们将具有短读和短写的查询的吞吐量和延迟提高了3到10倍，而将多个TPC-H和TPC-DS查询的延迟提高了2到100倍。

{"title":"Accelerating Relational Databases by Leveraging Remote Memory and RDMA","authors":"Feng Li, Sudipto Das, M. Syamala, Vivek R. Narasayya","doi":"10.1145/2882903.2882949","DOIUrl":"https://doi.org/10.1145/2882903.2882949","url":null,"abstract":"Memory is a crucial resource in relational databases (RDBMSs). When there is insufficient memory, RDBMSs are forced to use slower media such as SSDs or HDDs, which can significantly degrade workload performance. Cloud database services are deployed in data centers where network adapters supporting remote direct memory access (RDMA) at low latency and high bandwidth are becoming prevalent. We study the novel problem of how a Symmetric Multi-Processing (SMP) RDBMS, whose memory demands exceed locally-available memory, can leverage available remote memory in the cluster accessed via RDMA to improve query performance. We expose available memory on remote servers using a lightweight file API that allows an SMP RDBMS to leverage the benefits of remote memory with modest changes. We identify and implement several novel scenarios to demonstrate these benefits, and address design challenges that are crucial for efficient implementation. We implemented the scenarios in Microsoft SQL Server engine and present the first end-to-end study to demonstrate benefits of remote memory for a variety of micro-benchmarks and industry-standard benchmarks. Compared to using disks when memory is insufficient, we improve the throughput and latency of queries with short reads and writes by 3X to 10X, while improving the latency of multiple TPC-H and TPC-DS queries by 2X to 100X.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81799012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2016 International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀