首页 > 最新文献

Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献

英文 中文
Mobile interaction and query optimizationin a protein-ligand data analysis system 蛋白质配体数据分析系统的移动交互与查询优化
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465344
Marvin Lapeine, K. Herbert, Emily Hill, N. Goodey
With current trends in integrating phylogenetic analysis into pharma-research, computing systems that integrate the two areas can help the drug discovery field. DrugTree is a tool that overlays ligand data on a protein-motivated phylogenetic tree. While initial tests of DrugTree are successful, it has been noticed that there are a number of lags concerning querying the tree. Due to the interleaving nature of the data, query optimization can become problematic since the data is being obtained from multiple sources, integrated and then presented to the user with the phylogenetic imposed upon the phylogenetic analysis layer. This poster presents our initial methodologies for addressing the query optimization issues. Our approach applies standards as well as uses novel mechanisms to help improve performance time.
随着目前将系统发育分析整合到药物研究中的趋势,整合这两个领域的计算系统可以帮助药物发现领域。DrugTree是一个工具,覆盖配体数据上的蛋白质驱动的系统发育树。虽然DrugTree的初步测试是成功的,但人们已经注意到,在查询该树时存在一些滞后。由于数据的交错特性,查询优化可能会出现问题,因为数据是从多个来源获得的,经过集成,然后在系统发生分析层上以系统发生方式呈现给用户。这张海报展示了我们解决查询优化问题的初步方法。我们的方法应用标准,并使用新的机制来帮助提高性能时间。
{"title":"Mobile interaction and query optimizationin a protein-ligand data analysis system","authors":"Marvin Lapeine, K. Herbert, Emily Hill, N. Goodey","doi":"10.1145/2463676.2465344","DOIUrl":"https://doi.org/10.1145/2463676.2465344","url":null,"abstract":"With current trends in integrating phylogenetic analysis into pharma-research, computing systems that integrate the two areas can help the drug discovery field. DrugTree is a tool that overlays ligand data on a protein-motivated phylogenetic tree. While initial tests of DrugTree are successful, it has been noticed that there are a number of lags concerning querying the tree. Due to the interleaving nature of the data, query optimization can become problematic since the data is being obtained from multiple sources, integrated and then presented to the user with the phylogenetic imposed upon the phylogenetic analysis layer. This poster presents our initial methodologies for addressing the query optimization issues. Our approach applies standards as well as uses novel mechanisms to help improve performance time.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"60 1","pages":"1291-1292"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83601266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Inter-media hashing for large-scale retrieval from heterogeneous data sources 用于从异构数据源进行大规模检索的跨媒体散列
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465274
Jingkuan Song, Yang Yang, Yi Yang, Zi-Liang Huang, Heng Tao Shen
In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques.
本文提出了一种新的多媒体检索范式,以创新异构多媒体数据的大规模检索。它能够从异构数据源返回不同媒体类型的结果,例如,使用查询图像从不同的数据源检索相关的文本文档或图像。这利用了来自不同来源的广泛可用的数据,满足了当前用户同时接收包含多种类型数据的结果列表的需求,从而获得对查询结果的全面理解。为了实现大规模的跨媒体检索,我们提出了一种新的跨媒体哈希(IMH)模型来探索来自不同数据源的多种媒体类型之间的相关性,并解决可扩展性问题。为此,将来自异构数据源的多媒体数据转换成一个通用的汉明空间,通过异或和位计数操作可以很容易地实现快速搜索。此外,我们还集成了一个线性回归模型来学习哈希函数,以便有效地生成新数据点的哈希码。在真实世界的大规模多媒体数据集上进行的实验表明,与最先进的技术相比,我们提出的方法具有优越性。
{"title":"Inter-media hashing for large-scale retrieval from heterogeneous data sources","authors":"Jingkuan Song, Yang Yang, Yi Yang, Zi-Liang Huang, Heng Tao Shen","doi":"10.1145/2463676.2465274","DOIUrl":"https://doi.org/10.1145/2463676.2465274","url":null,"abstract":"In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2014 1","pages":"785-796"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86489332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 517
Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs 描述租户行为,以便在多租户dbms中进行安置和缓解危机
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465308
Aaron J. Elmore, Sudipto Das, A. Pucher, D. Agrawal, A. E. Abbadi, Xifeng Yan
A multitenant database management system (DBMS) in the cloud must continuously monitor the trade-off between efficient resource sharing among multiple application databases (tenants) and their performance. Considering the scale of attn{hundreds to} thousands of tenants in such multitenant DBMSs, manual approaches for continuous monitoring are not tenable. A self-managing controller of a multitenant DBMS faces several challenges. For instance, how to characterize a tenant given its variety of workloads, how to reduce the impact of tenant colocation, and how to detect and mitigate a performance crisis where one or more tenants' desired service level objective (SLO) is not achieved. We present Delphi, a self-managing system controller for a multitenant DBMS, and Pythia, a technique to learn behavior through observation and supervision using DBMS-agnostic database level performance measures. Pythia accurately learns tenant behavior even when multiple tenants share a database process, learns good and bad tenant consolidation plans (or packings), and maintains a pertenant history to detect behavior changes. Delphi detects performance crises, and leverages Pythia to suggests remedial actions using a hill-climbing search algorithm to identify a new tenant placement strategy to mitigate violating SLOs. Our evaluation using a variety of tenant types and workloads shows that Pythia can learn a tenant's behavior with more than 92% accuracy and learn the quality of packings with more than 86% accuracy. During a performance crisis, Delphi is able to reduce 99th percentile latencies by 80%, and can consolidate 45% more tenants than a greedy baseline, which balances tenant load without modeling tenant behavior.
云中的多租户数据库管理系统(DBMS)必须持续监控多个应用程序数据库(租户)之间的有效资源共享及其性能之间的权衡。考虑到这种多租户dbms中数百到数千个租户的规模,持续监控的手动方法是站不住脚的。多租户DBMS的自管理控制器面临几个挑战。例如,如何在给定各种工作负载的情况下描述租户的特征,如何减少租户托管的影响,以及如何在一个或多个租户的预期服务水平目标(SLO)未实现时检测和减轻性能危机。我们介绍了Delphi,一种多租户DBMS的自我管理系统控制器,以及Pythia,一种通过使用与DBMS无关的数据库级性能度量通过观察和监督来学习行为的技术。即使多个租户共享一个数据库进程,Pythia也能准确地学习租户行为,学习好的和坏的租户整合计划(或打包),并维护一个百分比历史记录以检测行为变化。Delphi检测性能危机,并利用Pythia使用爬坡搜索算法建议补救措施,以确定新的租户安置策略,以减轻违反slo的情况。我们使用各种租户类型和工作负载进行的评估表明,Pythia学习租户行为的准确率超过92%,学习包装质量的准确率超过86%。在性能危机期间,Delphi能够将第99百分位延迟减少80%,并且可以合并比贪婪基线多45%的租户,从而在不建模租户行为的情况下平衡租户负载。
{"title":"Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs","authors":"Aaron J. Elmore, Sudipto Das, A. Pucher, D. Agrawal, A. E. Abbadi, Xifeng Yan","doi":"10.1145/2463676.2465308","DOIUrl":"https://doi.org/10.1145/2463676.2465308","url":null,"abstract":"A multitenant database management system (DBMS) in the cloud must continuously monitor the trade-off between efficient resource sharing among multiple application databases (tenants) and their performance. Considering the scale of attn{hundreds to} thousands of tenants in such multitenant DBMSs, manual approaches for continuous monitoring are not tenable. A self-managing controller of a multitenant DBMS faces several challenges. For instance, how to characterize a tenant given its variety of workloads, how to reduce the impact of tenant colocation, and how to detect and mitigate a performance crisis where one or more tenants' desired service level objective (SLO) is not achieved.\u0000 We present Delphi, a self-managing system controller for a multitenant DBMS, and Pythia, a technique to learn behavior through observation and supervision using DBMS-agnostic database level performance measures. Pythia accurately learns tenant behavior even when multiple tenants share a database process, learns good and bad tenant consolidation plans (or packings), and maintains a pertenant history to detect behavior changes. Delphi detects performance crises, and leverages Pythia to suggests remedial actions using a hill-climbing search algorithm to identify a new tenant placement strategy to mitigate violating SLOs. Our evaluation using a variety of tenant types and workloads shows that Pythia can learn a tenant's behavior with more than 92% accuracy and learn the quality of packings with more than 86% accuracy. During a performance crisis, Delphi is able to reduce 99th percentile latencies by 80%, and can consolidate 45% more tenants than a greedy baseline, which balances tenant load without modeling tenant behavior.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"9 1","pages":"517-528"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82587154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Iterative parallel data processing with stratosphere: an inside look 基于平流层的迭代并行数据处理:内部观察
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463693
Stephan Ewen, Sebastian Schelter, K. Tzoumas, Daniel Warneke, V. Markl
Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms. In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithm's code and a visualization of the produced data flow programs. The second step shows the optimizer's execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates. To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.
迭代算法出现在数据分析的许多领域,如机器学习或图分析。随着人们对在非常大的数据集上运行这些算法越来越感兴趣,我们看到需要新的技术来以大规模并行的方式执行迭代。在之前的工作中,我们已经展示了如何扩展和使用并行数据流系统,以便在无共享环境中有效地运行迭代算法。我们的方法支持许多这些算法的增量处理特性。在本演示提案中,我们使用图分析和机器学习的示例说明了在Stratosphere上实现,编译,优化和执行迭代算法的过程。对于第一步,我们展示了算法的代码和生成的数据流程序的可视化。第二步显示优化器的执行计划选择,而最后一个阶段监视程序的执行,可视化操作符的状态和附加指标,例如每次迭代运行时和更新次数。为了显示数据流抽象支持轻松创建自定义编程API,我们还提供了针对轻量级Pregel API编写的程序,该API通过少量编程工作分层在我们的系统之上。
{"title":"Iterative parallel data processing with stratosphere: an inside look","authors":"Stephan Ewen, Sebastian Schelter, K. Tzoumas, Daniel Warneke, V. Markl","doi":"10.1145/2463676.2463693","DOIUrl":"https://doi.org/10.1145/2463676.2463693","url":null,"abstract":"Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms.\u0000 In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithm's code and a visualization of the produced data flow programs. The second step shows the optimizer's execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates.\u0000 To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"113 1","pages":"1053-1056"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89301429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Efficient sentiment correlation for large-scale demographics 大规模人口统计数据的有效情感关联
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465317
Mikalai Tsytsarau, S. Amer-Yahia, Themis Palpanas
Analyzing sentiments of demographic groups is becoming important for the Social Web, where millions of users provide opinions on a wide variety of content. While several approaches exist for mining sentiments from product reviews or micro-blogs, little attention has been devoted to aggregating and comparing extracted sentiments for different demographic groups over time, such as 'Students in Italy' or 'Teenagers in Europe'. This problem demands efficient and scalable methods for sentiment aggregation and correlation, which account for the evolution of sentiment values, sentiment bias, and other factors associated with the special characteristics of web data. We propose a scalable approach for sentiment indexing and aggregation that works on multiple time granularities and uses incrementally updateable data structures for online operation. Furthermore, we describe efficient methods for computing meaningful sentiment correlations, which exploit pruning based on demographics and use top-k correlations compression techniques. We present an extensive experimental evaluation with both synthetic and real datasets, demonstrating the effectiveness of our pruning techniques and the efficiency of our solution.
分析人口统计群体的情绪对社交网络来说变得越来越重要,在社交网络上,数百万用户对各种各样的内容发表意见。虽然有几种方法可以从产品评论或微博中挖掘情感,但很少有人关注汇总和比较不同人口群体的情感,比如“意大利的学生”或“欧洲的青少年”。这个问题需要有效和可扩展的情感聚合和关联方法,这些方法考虑了情感值、情感偏差和其他与web数据特殊特征相关的因素的演变。我们提出了一种可扩展的情感索引和聚合方法,该方法适用于多时间粒度,并使用增量可更新的数据结构进行在线操作。此外,我们描述了计算有意义的情感相关性的有效方法,这些方法利用基于人口统计数据的修剪,并使用top-k相关性压缩技术。我们用合成和真实数据集进行了广泛的实验评估,证明了我们的修剪技术的有效性和我们的解决方案的效率。
{"title":"Efficient sentiment correlation for large-scale demographics","authors":"Mikalai Tsytsarau, S. Amer-Yahia, Themis Palpanas","doi":"10.1145/2463676.2465317","DOIUrl":"https://doi.org/10.1145/2463676.2465317","url":null,"abstract":"Analyzing sentiments of demographic groups is becoming important for the Social Web, where millions of users provide opinions on a wide variety of content. While several approaches exist for mining sentiments from product reviews or micro-blogs, little attention has been devoted to aggregating and comparing extracted sentiments for different demographic groups over time, such as 'Students in Italy' or 'Teenagers in Europe'. This problem demands efficient and scalable methods for sentiment aggregation and correlation, which account for the evolution of sentiment values, sentiment bias, and other factors associated with the special characteristics of web data. We propose a scalable approach for sentiment indexing and aggregation that works on multiple time granularities and uses incrementally updateable data structures for online operation. Furthermore, we describe efficient methods for computing meaningful sentiment correlations, which exploit pruning based on demographics and use top-k correlations compression techniques. We present an extensive experimental evaluation with both synthetic and real datasets, demonstrating the effectiveness of our pruning techniques and the efficiency of our solution.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"6 1","pages":"253-264"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79567694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Timeline index: a unified data structure for processing queries on temporal data in SAP HANA 时间轴索引:在SAP HANA中处理时间数据查询的统一数据结构
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465293
Martin Kaufmann, Amin Amiri Manjili, Panagiotis Vagenas, Peter M. Fischer, Donald Kossmann, Franz Färber, Norman May
Managing temporal data is becoming increasingly important for many applications. Several database systems already support the time dimension, but provide only few temporal operators, which also often exhibit poor performance characteristics. On the academic side, a large number of algorithms and data structures have been proposed, but they often address a subset of these temporal operators only. In this paper, we develop the Timeline Index as a novel, unified data structure that efficiently supports temporal operators such as temporal aggregation, time travel, and temporal joins. As the Timeline Index is independent of the physical order of the data, it provides flexibility in physical design; e.g., it supports any kind of compression scheme, which is crucial for main memory column stores. Our experiments show that the Timeline Index has predictable performance and beats state-of-the-art approaches significantly, sometimes by orders of magnitude.
对于许多应用程序来说,管理时态数据变得越来越重要。一些数据库系统已经支持时间维度,但只提供了很少的时间操作符,这也经常表现出较差的性能特征。在学术方面,已经提出了大量的算法和数据结构,但它们通常只处理这些时间算子的子集。在本文中,我们开发了时间轴索引作为一种新的、统一的数据结构,它有效地支持时间算子,如时间聚合、时间旅行和时间连接。由于时间轴索引与数据的物理顺序无关,因此它提供了物理设计的灵活性;例如,它支持任何类型的压缩方案,这对于主存列存储是至关重要的。我们的实验表明,时间轴索引具有可预测的性能,并且显著优于最先进的方法,有时甚至超过数量级。
{"title":"Timeline index: a unified data structure for processing queries on temporal data in SAP HANA","authors":"Martin Kaufmann, Amin Amiri Manjili, Panagiotis Vagenas, Peter M. Fischer, Donald Kossmann, Franz Färber, Norman May","doi":"10.1145/2463676.2465293","DOIUrl":"https://doi.org/10.1145/2463676.2465293","url":null,"abstract":"Managing temporal data is becoming increasingly important for many applications. Several database systems already support the time dimension, but provide only few temporal operators, which also often exhibit poor performance characteristics. On the academic side, a large number of algorithms and data structures have been proposed, but they often address a subset of these temporal operators only. In this paper, we develop the Timeline Index as a novel, unified data structure that efficiently supports temporal operators such as temporal aggregation, time travel, and temporal joins. As the Timeline Index is independent of the physical order of the data, it provides flexibility in physical design; e.g., it supports any kind of compression scheme, which is crucial for main memory column stores. Our experiments show that the Timeline Index has predictable performance and beats state-of-the-art approaches significantly, sometimes by orders of magnitude.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"21 1","pages":"1173-1184"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77917838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
Information preservation in statistical privacy and bayesian estimation of unattributed histograms 统计隐私中的信息保存与无属性直方图的贝叶斯估计
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463721
Bing-Rong Lin, Daniel Kifer
In statistical privacy, utility refers to two concepts: information preservation -- how much statistical information is retained by a sanitizing algorithm, and usability -- how (and with how much difficulty) does one extract this information to build statistical models, answer queries, etc. Some scenarios incentivize a separation between information preservation and usability, so that the data owner first chooses a sanitizing algorithm to maximize a measure of information preservation and, afterward, the data consumers process the sanitized output according to their needs [22, 46]. We analyze a variety of utility measures and show that the average (over possible outputs of the sanitizer) error of Bayesian decision makers forms the unique class of utility measures that satisfy three axioms related to information preservation. The axioms are agnostic to Bayesian concepts such as subjective probabilities and hence strengthen support for Bayesian views in privacy research. In particular, this result connects information preservation to aspects of usability -- if the information preservation of a sanitizing algorithm should be measured as the average error of a Bayesian decision maker, shouldn't Bayesian decision theory be a good choice when it comes to using the sanitized outputs for various purposes? We put this idea to the test in the unattributed histogram problem where our decision- theoretic post-processing algorithm empirically outperforms previously proposed approaches.
在统计隐私中,效用指的是两个概念:信息保存(信息处理算法保留了多少统计信息)和可用性(如何提取这些信息以构建统计模型、回答查询等)。一些场景鼓励将信息保存和可用性分开,因此数据所有者首先选择一种净化算法来最大化信息保存的度量,然后,数据消费者根据自己的需要处理净化后的输出[22,46]。我们分析了各种效用度量,并表明贝叶斯决策者的平均(超过消毒器的可能输出)误差形成了满足与信息保存相关的三个公理的独特效用度量类。这些公理与主观概率等贝叶斯概念无关,从而加强了贝叶斯观点在隐私研究中的支持。特别地,这个结果将信息保存与可用性的各个方面联系起来——如果一个消毒算法的信息保存应该用贝叶斯决策者的平均误差来衡量,那么贝叶斯决策理论在使用经过消毒的输出用于各种目的时难道不是一个很好的选择吗?我们在无归因直方图问题中对这个想法进行了测试,我们的决策理论后处理算法在经验上优于先前提出的方法。
{"title":"Information preservation in statistical privacy and bayesian estimation of unattributed histograms","authors":"Bing-Rong Lin, Daniel Kifer","doi":"10.1145/2463676.2463721","DOIUrl":"https://doi.org/10.1145/2463676.2463721","url":null,"abstract":"In statistical privacy, utility refers to two concepts: information preservation -- how much statistical information is retained by a sanitizing algorithm, and usability -- how (and with how much difficulty) does one extract this information to build statistical models, answer queries, etc. Some scenarios incentivize a separation between information preservation and usability, so that the data owner first chooses a sanitizing algorithm to maximize a measure of information preservation and, afterward, the data consumers process the sanitized output according to their needs [22, 46].\u0000 We analyze a variety of utility measures and show that the average (over possible outputs of the sanitizer) error of Bayesian decision makers forms the unique class of utility measures that satisfy three axioms related to information preservation. The axioms are agnostic to Bayesian concepts such as subjective probabilities and hence strengthen support for Bayesian views in privacy research. In particular, this result connects information preservation to aspects of usability -- if the information preservation of a sanitizing algorithm should be measured as the average error of a Bayesian decision maker, shouldn't Bayesian decision theory be a good choice when it comes to using the sanitized outputs for various purposes? We put this idea to the test in the unattributed histogram problem where our decision- theoretic post-processing algorithm empirically outperforms previously proposed approaches.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"77 1","pages":"677-688"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76089754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Petabyte scale databases and storage systems at Facebook Facebook的pb级数据库和存储系统
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463713
Dhruba Borthakur
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack data store for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of click logs and combine it with the power of Apache HBase to store all Facebook Messages. This paper describes the reasons why each of these databases is appropriate for that workload and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the techniques we have used to map the Facebook Graph Database into a set of relational tables. We speak of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database. Esteemed researchers in the Database Management community have benchmarked query latencies on Hive/Hadoop to be less performant than a traditional Parallel DBMS. We describe why these benchmarks are insufficient for Big Data deployments and why we continue to use Hadoop/Hive. We present an alternate set of benchmark techniques that measure capacity of a database, the value/byte in that database and the efficiency of inbuilt crowd-sourcing techniques to reduce administration costs of that database.
在Facebook,我们使用各种类型的数据库和存储系统来满足不同应用程序的需求。围绕这些数据存储系统构建的解决方案有一组共同的要求:它们必须具有高度可扩展性,维护成本应该很低,并且必须高效地执行。我们使用一个分片mySQL+memcache解决方案来支持实时访问数十pb的数据,我们使用TAO来提供跨地理距离的web级数据库的一致性。我们使用Haystack数据存储存储我们每周托管的30亿张新照片。我们使用Apache Hadoop从100 pb的点击日志中挖掘情报,并将其与Apache HBase的强大功能结合起来存储所有Facebook消息。本文描述了为什么这些数据库都适合于这种工作负载,以及在实现这些解决方案时所做的设计决策和权衡。我们将讨论这些解决方案的一致性、可用性和分区容忍度。我们将讨论其中一些系统需要ACID语义而其他系统不需要的原因。我们描述了将Facebook图形数据库映射到一组关系表的技术。我们谈到了我们计划如何跨地理位置进行大数据部署,以及我们对新型纯内存和基于纯ssd的事务性数据库的需求。数据库管理社区中受人尊敬的研究人员对Hive/Hadoop上的查询延迟进行了基准测试,发现它的性能低于传统的并行DBMS。我们描述了为什么这些基准对大数据部署来说是不够的,以及为什么我们继续使用Hadoop/Hive。我们提出了另一组基准测试技术,用于测量数据库的容量、该数据库中的值/字节以及用于降低该数据库管理成本的内置众包技术的效率。
{"title":"Petabyte scale databases and storage systems at Facebook","authors":"Dhruba Borthakur","doi":"10.1145/2463676.2463713","DOIUrl":"https://doi.org/10.1145/2463676.2463713","url":null,"abstract":"At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack data store for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of click logs and combine it with the power of Apache HBase to store all Facebook Messages.\u0000 This paper describes the reasons why each of these databases is appropriate for that workload and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the techniques we have used to map the Facebook Graph Database into a set of relational tables. We speak of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.\u0000 Esteemed researchers in the Database Management community have benchmarked query latencies on Hive/Hadoop to be less performant than a traditional Parallel DBMS. We describe why these benchmarks are insufficient for Big Data deployments and why we continue to use Hadoop/Hive. We present an alternate set of benchmark techniques that measure capacity of a database, the value/byte in that database and the efficiency of inbuilt crowd-sourcing techniques to reduce administration costs of that database.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"70 1","pages":"1267-1268"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85351532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Cumulon: optimizing statistical data analysis in the cloud 积云:优化云中的统计数据分析
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465273
Botong Huang, S. Babu, Jun Yang
We present Cumulon, a system designed to help users rapidly develop and intelligently deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible execution model and new operators especially suited for such workloads. We show how to implement Cumulon on top of Hadoop/HDFS while avoiding limitations of MapReduce, and demonstrate Cumulon's performance advantages over existing Hadoop-based systems for statistical data analysis. To support intelligent deployment in the cloud according to time/budget constraints, Cumulon goes beyond database-style optimization to make choices automatically on not only physical operators and their parameters, but also hardware provisioning and configuration settings. We apply a suite of benchmarking, simulation, modeling, and search techniques to support effective cost-based optimization over this rich space of deployment plans.
我们介绍了Cumulon,一个旨在帮助用户在云端快速开发和智能部署基于矩阵的大数据分析程序的系统。Cumulon具有灵活的执行模型和特别适合此类工作负载的新操作符。我们展示了如何在Hadoop/HDFS之上实现Cumulon,同时避免了MapReduce的限制,并展示了Cumulon在统计数据分析方面相对于现有基于Hadoop的系统的性能优势。为了支持根据时间/预算限制在云中进行智能部署,Cumulon超越了数据库式的优化,不仅可以自动选择物理操作员及其参数,还可以自动选择硬件供应和配置设置。我们应用了一套基准测试、模拟、建模和搜索技术,以支持在这个丰富的部署计划空间上进行有效的基于成本的优化。
{"title":"Cumulon: optimizing statistical data analysis in the cloud","authors":"Botong Huang, S. Babu, Jun Yang","doi":"10.1145/2463676.2465273","DOIUrl":"https://doi.org/10.1145/2463676.2465273","url":null,"abstract":"We present Cumulon, a system designed to help users rapidly develop and intelligently deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible execution model and new operators especially suited for such workloads. We show how to implement Cumulon on top of Hadoop/HDFS while avoiding limitations of MapReduce, and demonstrate Cumulon's performance advantages over existing Hadoop-based systems for statistical data analysis. To support intelligent deployment in the cloud according to time/budget constraints, Cumulon goes beyond database-style optimization to make choices automatically on not only physical operators and their parameters, but also hardware provisioning and configuration settings. We apply a suite of benchmarking, simulation, modeling, and search techniques to support effective cost-based optimization over this rich space of deployment plans.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"39 1","pages":"1-12"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90142903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
A direct mining approach to efficient constrained graph pattern discovery 一种高效约束图模式发现的直接挖掘方法
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463723
Feida Zhu, Zequn Zhang, Qiang Qu
Despite the wealth of research on frequent graph pattern mining, how to efficiently mine the complete set of those with constraints still poses a huge challenge to the existing algorithms mainly due to the inherent bottleneck in the mining paradigm. In essence, mining requests with explicitly-specified constraints cannot be handled in a way that is direct and precise. In this paper, we propose a direct mining framework to solve the problem and illustrate our ideas in the context of a particular type of constrained frequent patterns --- the "skinny" patterns, which are graph patterns with a long backbone from which short twigs branch out. These patterns, which we formally define as l-long δ-skinny patterns, are able to reveal insightful spatial and temporal trajectory patterns in mobile data mining, information diffusion, adoption propagation, and many others. Based on the key concept of a canonical diameter, we develop SkinnyMine, an efficient algorithm to mine all the l-long δ-skinny patterns guaranteeing both the completeness of our mining result as well as the unique generation of each target pattern. We also present a general direct mining framework together with two properties of reducibility and continuity for qualified constraints. Our experiments on both synthetic and real data demonstrate the effectiveness and scalability of our approach.
尽管对频繁图模式挖掘的研究非常丰富,但由于挖掘范式固有的瓶颈,如何高效地挖掘出具有约束的频繁图模式的完整集合仍然是现有算法面临的巨大挑战。从本质上讲,不能以直接和精确的方式处理带有显式指定约束的挖掘请求。在本文中,我们提出了一个直接挖掘框架来解决这个问题,并在特定类型的约束频繁模式(“瘦”模式)的背景下阐述了我们的想法,“瘦”模式是具有长主干的图形模式,其中有短分支。我们将这些模式正式定义为l-long - δ-skinny模式,它们能够揭示移动数据挖掘、信息扩散、采用传播和许多其他方面的空间和时间轨迹模式。基于典型直径的关键概念,我们开发了一种高效的挖掘所有l-long δ-skinny模式的算法SkinnyMine,保证了挖掘结果的完整性和每个目标模式的唯一生成。我们还提出了一个通用的直接挖掘框架,并对限定约束给出了可约性和连续性两个性质。我们在合成数据和真实数据上的实验证明了我们方法的有效性和可扩展性。
{"title":"A direct mining approach to efficient constrained graph pattern discovery","authors":"Feida Zhu, Zequn Zhang, Qiang Qu","doi":"10.1145/2463676.2463723","DOIUrl":"https://doi.org/10.1145/2463676.2463723","url":null,"abstract":"Despite the wealth of research on frequent graph pattern mining, how to efficiently mine the complete set of those with constraints still poses a huge challenge to the existing algorithms mainly due to the inherent bottleneck in the mining paradigm. In essence, mining requests with explicitly-specified constraints cannot be handled in a way that is direct and precise. In this paper, we propose a direct mining framework to solve the problem and illustrate our ideas in the context of a particular type of constrained frequent patterns --- the \"skinny\" patterns, which are graph patterns with a long backbone from which short twigs branch out. These patterns, which we formally define as l-long δ-skinny patterns, are able to reveal insightful spatial and temporal trajectory patterns in mobile data mining, information diffusion, adoption propagation, and many others.\u0000 Based on the key concept of a canonical diameter, we develop SkinnyMine, an efficient algorithm to mine all the l-long δ-skinny patterns guaranteeing both the completeness of our mining result as well as the unique generation of each target pattern. We also present a general direct mining framework together with two properties of reducibility and continuity for qualified constraints. Our experiments on both synthetic and real data demonstrate the effectiveness and scalability of our approach.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"85 1","pages":"821-832"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90615953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
期刊
Proceedings. ACM-SIGMOD International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1