Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献

英文中文

Enhancements to SQL server column stores 对SQL server列存储的增强

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463708

P. Larson, C. Clinciu, Campbell Fraser, E. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan Price, Srikumar Rangarajan, Remus Rusanu, Mayukh Saubhasik

SQL Server 2012 introduced two innovations targeted for data warehousing workloads: column store indexes and batch (vectorized) processing mode. Together they greatly improve performance of typical data warehouse queries, routinely by 10X and in some cases by a 100X or more. The main limitations of the initial version are addressed in the upcoming release. Column store indexes are updatable and can be used as the base storage for a table. The repertoire of batch mode operators has been expanded, existing operators have been improved, and query optimization has been enhanced. This paper gives an overview of SQL Server's column stores and batch processing, in particular the enhancements introduced in the upcoming release.

SQL Server 2012引入了两项针对数据仓库工作负载的创新:列存储索引和批处理(矢量化)处理模式。它们一起极大地提高了典型数据仓库查询的性能，通常提高10倍，在某些情况下提高100倍甚至更多。在即将发布的版本中解决了初始版本的主要限制。列存储索引是可更新的，可以用作表的基础存储。批处理模式操作符的列表得到了扩展，现有操作符得到了改进，查询优化得到了增强。本文概述了SQL Server的列存储和批处理，特别是即将发布的版本中引入的增强功能。

引用次数: 67

Information preservation in statistical privacy and bayesian estimation of unattributed histograms 统计隐私中的信息保存与无属性直方图的贝叶斯估计

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463721

Bing-Rong Lin, Daniel Kifer

In statistical privacy, utility refers to two concepts: information preservation -- how much statistical information is retained by a sanitizing algorithm, and usability -- how (and with how much difficulty) does one extract this information to build statistical models, answer queries, etc. Some scenarios incentivize a separation between information preservation and usability, so that the data owner first chooses a sanitizing algorithm to maximize a measure of information preservation and, afterward, the data consumers process the sanitized output according to their needs [22, 46]. We analyze a variety of utility measures and show that the average (over possible outputs of the sanitizer) error of Bayesian decision makers forms the unique class of utility measures that satisfy three axioms related to information preservation. The axioms are agnostic to Bayesian concepts such as subjective probabilities and hence strengthen support for Bayesian views in privacy research. In particular, this result connects information preservation to aspects of usability -- if the information preservation of a sanitizing algorithm should be measured as the average error of a Bayesian decision maker, shouldn't Bayesian decision theory be a good choice when it comes to using the sanitized outputs for various purposes? We put this idea to the test in the unattributed histogram problem where our decision- theoretic post-processing algorithm empirically outperforms previously proposed approaches.

在统计隐私中，效用指的是两个概念:信息保存(信息处理算法保留了多少统计信息)和可用性(如何提取这些信息以构建统计模型、回答查询等)。一些场景鼓励将信息保存和可用性分开，因此数据所有者首先选择一种净化算法来最大化信息保存的度量，然后，数据消费者根据自己的需要处理净化后的输出[22,46]。我们分析了各种效用度量，并表明贝叶斯决策者的平均(超过消毒器的可能输出)误差形成了满足与信息保存相关的三个公理的独特效用度量类。这些公理与主观概率等贝叶斯概念无关，从而加强了贝叶斯观点在隐私研究中的支持。特别地，这个结果将信息保存与可用性的各个方面联系起来——如果一个消毒算法的信息保存应该用贝叶斯决策者的平均误差来衡量，那么贝叶斯决策理论在使用经过消毒的输出用于各种目的时难道不是一个很好的选择吗?我们在无归因直方图问题中对这个想法进行了测试，我们的决策理论后处理算法在经验上优于先前提出的方法。

{"title":"Information preservation in statistical privacy and bayesian estimation of unattributed histograms","authors":"Bing-Rong Lin, Daniel Kifer","doi":"10.1145/2463676.2463721","DOIUrl":"https://doi.org/10.1145/2463676.2463721","url":null,"abstract":"In statistical privacy, utility refers to two concepts: information preservation -- how much statistical information is retained by a sanitizing algorithm, and usability -- how (and with how much difficulty) does one extract this information to build statistical models, answer queries, etc. Some scenarios incentivize a separation between information preservation and usability, so that the data owner first chooses a sanitizing algorithm to maximize a measure of information preservation and, afterward, the data consumers process the sanitized output according to their needs [22, 46].\u0000 We analyze a variety of utility measures and show that the average (over possible outputs of the sanitizer) error of Bayesian decision makers forms the unique class of utility measures that satisfy three axioms related to information preservation. The axioms are agnostic to Bayesian concepts such as subjective probabilities and hence strengthen support for Bayesian views in privacy research. In particular, this result connects information preservation to aspects of usability -- if the information preservation of a sanitizing algorithm should be measured as the average error of a Bayesian decision maker, shouldn't Bayesian decision theory be a good choice when it comes to using the sanitized outputs for various purposes? We put this idea to the test in the unattributed histogram problem where our decision- theoretic post-processing algorithm empirically outperforms previously proposed approaches.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76089754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Photon: fault-tolerant and scalable joining of continuous data streams Photon:连续数据流的容错和可扩展连接

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465272

R. Ananthanarayanan, Venkatesh Basker, Sumit Das, A. Gupta, H. Jiang, Tianhao Qiu, Alexey Reznichenko, D.Yu. Ryabkov, Manpreet Singh, S. Venkataraman

Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually. Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.

基于web的企业处理由数百万用户与其网站交互产生的事件。从近乎实时地组合这些交互中提取的丰富统计数据产生了巨大的业务价值。在本文中，我们描述了Photon的架构，Photon是一个地理分布式系统，用于实时连接多个连续流动的数据流，具有高可扩展性和低延迟，其中流可能是无序或延迟的。该系统完全容忍基础设施退化和数据中心级别的中断，无需任何人工干预。Photon保证在任何时间点都不会有重复的连接输出(最多一次语义)，大多数可连接事件将实时出现在输出中(近精确语义)，最终精确一次语义。Photon部署在谷歌广告系统中，以连接网络搜索查询和用户点击广告等数据流。它生成用于派生关键业务指标的连接日志，包括广告商的计费。我们的生产部署在峰值时每分钟处理数百万个事件，平均端到端延迟不到10秒。我们还提出了在地理位置遥远的地方维护大型持久状态的挑战和解决方案，并强调了从我们的经验中产生的设计原则。

{"title":"Photon: fault-tolerant and scalable joining of continuous data streams","authors":"R. Ananthanarayanan, Venkatesh Basker, Sumit Das, A. Gupta, H. Jiang, Tianhao Qiu, Alexey Reznichenko, D.Yu. Ryabkov, Manpreet Singh, S. Venkataraman","doi":"10.1145/2463676.2465272","DOIUrl":"https://doi.org/10.1145/2463676.2465272","url":null,"abstract":"Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually.\u0000 Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76421080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 146

Efficient sentiment correlation for large-scale demographics 大规模人口统计数据的有效情感关联

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465317

Mikalai Tsytsarau, S. Amer-Yahia, Themis Palpanas

Analyzing sentiments of demographic groups is becoming important for the Social Web, where millions of users provide opinions on a wide variety of content. While several approaches exist for mining sentiments from product reviews or micro-blogs, little attention has been devoted to aggregating and comparing extracted sentiments for different demographic groups over time, such as 'Students in Italy' or 'Teenagers in Europe'. This problem demands efficient and scalable methods for sentiment aggregation and correlation, which account for the evolution of sentiment values, sentiment bias, and other factors associated with the special characteristics of web data. We propose a scalable approach for sentiment indexing and aggregation that works on multiple time granularities and uses incrementally updateable data structures for online operation. Furthermore, we describe efficient methods for computing meaningful sentiment correlations, which exploit pruning based on demographics and use top-k correlations compression techniques. We present an extensive experimental evaluation with both synthetic and real datasets, demonstrating the effectiveness of our pruning techniques and the efficiency of our solution.

分析人口统计群体的情绪对社交网络来说变得越来越重要，在社交网络上，数百万用户对各种各样的内容发表意见。虽然有几种方法可以从产品评论或微博中挖掘情感，但很少有人关注汇总和比较不同人口群体的情感，比如“意大利的学生”或“欧洲的青少年”。这个问题需要有效和可扩展的情感聚合和关联方法，这些方法考虑了情感值、情感偏差和其他与web数据特殊特征相关的因素的演变。我们提出了一种可扩展的情感索引和聚合方法，该方法适用于多时间粒度，并使用增量可更新的数据结构进行在线操作。此外，我们描述了计算有意义的情感相关性的有效方法，这些方法利用基于人口统计数据的修剪，并使用top-k相关性压缩技术。我们用合成和真实数据集进行了广泛的实验评估，证明了我们的修剪技术的有效性和我们的解决方案的效率。

{"title":"Efficient sentiment correlation for large-scale demographics","authors":"Mikalai Tsytsarau, S. Amer-Yahia, Themis Palpanas","doi":"10.1145/2463676.2465317","DOIUrl":"https://doi.org/10.1145/2463676.2465317","url":null,"abstract":"Analyzing sentiments of demographic groups is becoming important for the Social Web, where millions of users provide opinions on a wide variety of content. While several approaches exist for mining sentiments from product reviews or micro-blogs, little attention has been devoted to aggregating and comparing extracted sentiments for different demographic groups over time, such as 'Students in Italy' or 'Teenagers in Europe'. This problem demands efficient and scalable methods for sentiment aggregation and correlation, which account for the evolution of sentiment values, sentiment bias, and other factors associated with the special characteristics of web data. We propose a scalable approach for sentiment indexing and aggregation that works on multiple time granularities and uses incrementally updateable data structures for online operation. Furthermore, we describe efficient methods for computing meaningful sentiment correlations, which exploit pruning based on demographics and use top-k correlations compression techniques. We present an extensive experimental evaluation with both synthetic and real datasets, demonstrating the effectiveness of our pruning techniques and the efficiency of our solution.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79567694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Big data in capital markets 资本市场的大数据

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2486082

A. Nazaruk, M. Rauchman

Over the past decade global securities markets have dramatically changed. Evolution of market structure in combination with advances in computer technologies led to emergence of electronic securities trading. Securities transactions that used to be conducted in person and over the phone are now predominantly executed by automated trading systems. This resulted in significant fragmentation of the markets, vast increase in the exchange volumes and even greater increase in the number of orders. In this talk we present and analyze forces behind the wide proliferation of electronic securities trading in US stocks and options markets. We also make a high-level introduction into electronic securities market structure. We discuss trading objectives of different classes of market participants and analyze how their activity affects data volumes. We also present typical securities trading firm data flow and analyze various types of data it uses in its trading operations. We close with the implications this "sea change" has on DBMS requirements in capital markets.

过去十年，全球证券市场发生了巨大变化。市场结构的演变和计算机技术的进步导致了电子证券交易的出现。过去亲自或通过电话进行的证券交易现在主要由自动交易系统执行。这导致了市场的严重分裂，交易量大幅增加，订单数量甚至更多。在这次演讲中，我们将介绍并分析美国股票和期权市场中电子证券交易广泛扩散背后的力量。本文还对电子证券市场的结构作了高层次的介绍。我们讨论了不同类别的市场参与者的交易目标，并分析了他们的活动如何影响数据量。本文还介绍了典型的证券交易公司的数据流程，并分析了其在交易操作中使用的各种类型的数据。最后，我们讨论了这种“翻天覆地的变化”对资本市场DBMS需求的影响。

引用次数: 19

CS2: a new database synopsis for query estimation CS2:用于查询估计的新数据库概要

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463701

Feng Yu, W. Hou, Cheng Luo, D. Che, Mengxia Zhu

Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.

对复杂查询进行快速而准确的估计对于具有繁重工作负载的大型数据库非常有益。在这项研究中，我们提出了一个数据库的统计摘要，称为CS2(相关样本概要)，为所有具有连接和任意选择的查询提供快速和准确的结果大小估计。与最先进的技术不同，CS2并不完全依赖于简单的随机样本，而是主要由相关的样本元组组成，这些元组用较少的存储空间保留了连接关系。我们引入了一种称为反向样本的统计技术，并设计了一个强大的估计器，称为反向估计器，以充分利用相关样本元组进行查询估计。利用CS2从理论上和经验上证明了反向估计量的无偏性和准确性。在多个数据集上的大量实验表明，在相同空间预算下，CS2比现有方法构建速度快，得到的估计精度更高。

引用次数: 36

On optimal worst-case matching 关于最优最坏匹配

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465321

Cheng Long, R. C. Wong, Philip S. Yu, Minhao Jiang

Bichromatic reverse nearest neighbor (BRNN) queries have been studied extensively in the literature of spatial databases. Given a set P of service-providers and a set O of customers, a BRNN query is to find which customers in O are "interested" in a given service-provider in P. Recently, it has been found that this kind of queries lacks the consideration of the capacities of service-providers and the demands of customers. In order to address this issue, some spatial matching problems have been proposed, which, however, cannot be used for some real-life applications like emergency facility allocation where the maximum matching cost (or distance) should be minimized. In this paper, we propose a new problem called Spatial Matching for Minimizing Maximum matching distance (SPM-MM). Then, we design two algorithms for SPM-MM, Threshold-Adapt and Swap-Chain. Threshold-Adapt is simple and easy to understand but not scalable to large datasets due to its relatively high time/space complexity. Swap-Chain, which follows a fundamentally different idea from Threshold-Adapt, runs faster than Threshold-Adapt by orders of magnitude and uses significantly less memory. We conducted extensive empirical studies which verified the efficiency and scalability of Swap-Chain.

双色逆最近邻查询在空间数据库中得到了广泛的研究。给定一组P个服务提供商和一组O个客户，BRNN查询的目的是找出O中哪些客户对P中给定的服务提供商“感兴趣”。最近，人们发现这种查询缺乏对服务提供商能力和客户需求的考虑。为了解决这一问题，人们提出了一些空间匹配问题，但这些问题不能用于实际应用，如应急设施分配，在实际应用中需要最小化最大匹配成本(或距离)。在本文中，我们提出了一个新的问题，称为空间匹配最小化最大匹配距离(SPM-MM)。然后，我们设计了阈值自适应和交换链两种SPM-MM算法。Threshold-Adapt简单易懂，但由于其相对较高的时间/空间复杂性，无法扩展到大型数据集。Swap-Chain遵循与Threshold-Adapt完全不同的思想，其运行速度比Threshold-Adapt快几个数量级，并且使用的内存也少得多。我们进行了广泛的实证研究，验证了Swap-Chain的效率和可扩展性。

{"title":"On optimal worst-case matching","authors":"Cheng Long, R. C. Wong, Philip S. Yu, Minhao Jiang","doi":"10.1145/2463676.2465321","DOIUrl":"https://doi.org/10.1145/2463676.2465321","url":null,"abstract":"Bichromatic reverse nearest neighbor (BRNN) queries have been studied extensively in the literature of spatial databases. Given a set P of service-providers and a set O of customers, a BRNN query is to find which customers in O are \"interested\" in a given service-provider in P. Recently, it has been found that this kind of queries lacks the consideration of the capacities of service-providers and the demands of customers. In order to address this issue, some spatial matching problems have been proposed, which, however, cannot be used for some real-life applications like emergency facility allocation where the maximum matching cost (or distance) should be minimized. In this paper, we propose a new problem called Spatial Matching for Minimizing Maximum matching distance (SPM-MM). Then, we design two algorithms for SPM-MM, Threshold-Adapt and Swap-Chain. Threshold-Adapt is simple and easy to understand but not scalable to large datasets due to its relatively high time/space complexity. Swap-Chain, which follows a fundamentally different idea from Threshold-Adapt, runs faster than Threshold-Adapt by orders of magnitude and uses significantly less memory. We conducted extensive empirical studies which verified the efficiency and scalability of Swap-Chain.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73900038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes 不要害怕:使用具有最大可能性和有限更改的可伸缩自动修复

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463706

M. Yakout, Laure Berti-Équille, A. Elmagarmid

Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.

在过去的几十年里，已经提出了各种计算程序或基于约束的数据修复方法来识别错误，并在可能的情况下纠正它们。然而，这些方法有一些限制，包括可伸缩性和用于替换错误的值的质量。在本文中，我们提出了一种新的数据修复方法，该方法基于给定数据分布的替换数据的可能性最大化，可以使用统计机器学习技术进行建模。这是一种结合机器学习和似然方法的新方法，通过值修改来清理脏数据库。我们根据可能的收益和应用于数据库的更改量开发修复更新的质量度量。我们提出了SCARE(可伸缩自动修复)，这是一个遵循我们方法的系统可伸缩框架。SCARE依靠强大的水平数据分区机制和机器学习技术的组合来预测可能的更新集。由于存在数据分区，可以基于每个数据分区上的本地视图预测单个记录的多个更新。因此，我们提出了一种结合局部预测并获得准确的最终预测的机制。最后，我们通过实验证明了与最近的数据清理方法相比，我们的方法在真实数据集上的有效性、效率和可扩展性。

{"title":"Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes","authors":"M. Yakout, Laure Berti-Équille, A. Elmagarmid","doi":"10.1145/2463676.2463706","DOIUrl":"https://doi.org/10.1145/2463676.2463706","url":null,"abstract":"Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73100375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 144

Cumulon: optimizing statistical data analysis in the cloud 积云:优化云中的统计数据分析

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465273

Botong Huang, S. Babu, Jun Yang

We present Cumulon, a system designed to help users rapidly develop and intelligently deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible execution model and new operators especially suited for such workloads. We show how to implement Cumulon on top of Hadoop/HDFS while avoiding limitations of MapReduce, and demonstrate Cumulon's performance advantages over existing Hadoop-based systems for statistical data analysis. To support intelligent deployment in the cloud according to time/budget constraints, Cumulon goes beyond database-style optimization to make choices automatically on not only physical operators and their parameters, but also hardware provisioning and configuration settings. We apply a suite of benchmarking, simulation, modeling, and search techniques to support effective cost-based optimization over this rich space of deployment plans.

我们介绍了Cumulon，一个旨在帮助用户在云端快速开发和智能部署基于矩阵的大数据分析程序的系统。Cumulon具有灵活的执行模型和特别适合此类工作负载的新操作符。我们展示了如何在Hadoop/HDFS之上实现Cumulon，同时避免了MapReduce的限制，并展示了Cumulon在统计数据分析方面相对于现有基于Hadoop的系统的性能优势。为了支持根据时间/预算限制在云中进行智能部署，Cumulon超越了数据库式的优化，不仅可以自动选择物理操作员及其参数，还可以自动选择硬件供应和配置设置。我们应用了一套基准测试、模拟、建模和搜索技术，以支持在这个丰富的部署计划空间上进行有效的基于成本的优化。

引用次数: 82

A direct mining approach to efficient constrained graph pattern discovery 一种高效约束图模式发现的直接挖掘方法

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463723

Feida Zhu, Zequn Zhang, Qiang Qu

Despite the wealth of research on frequent graph pattern mining, how to efficiently mine the complete set of those with constraints still poses a huge challenge to the existing algorithms mainly due to the inherent bottleneck in the mining paradigm. In essence, mining requests with explicitly-specified constraints cannot be handled in a way that is direct and precise. In this paper, we propose a direct mining framework to solve the problem and illustrate our ideas in the context of a particular type of constrained frequent patterns --- the "skinny" patterns, which are graph patterns with a long backbone from which short twigs branch out. These patterns, which we formally define as l-long δ-skinny patterns, are able to reveal insightful spatial and temporal trajectory patterns in mobile data mining, information diffusion, adoption propagation, and many others. Based on the key concept of a canonical diameter, we develop SkinnyMine, an efficient algorithm to mine all the l-long δ-skinny patterns guaranteeing both the completeness of our mining result as well as the unique generation of each target pattern. We also present a general direct mining framework together with two properties of reducibility and continuity for qualified constraints. Our experiments on both synthetic and real data demonstrate the effectiveness and scalability of our approach.

尽管对频繁图模式挖掘的研究非常丰富，但由于挖掘范式固有的瓶颈，如何高效地挖掘出具有约束的频繁图模式的完整集合仍然是现有算法面临的巨大挑战。从本质上讲，不能以直接和精确的方式处理带有显式指定约束的挖掘请求。在本文中，我们提出了一个直接挖掘框架来解决这个问题，并在特定类型的约束频繁模式(“瘦”模式)的背景下阐述了我们的想法，“瘦”模式是具有长主干的图形模式，其中有短分支。我们将这些模式正式定义为l-long - δ-skinny模式，它们能够揭示移动数据挖掘、信息扩散、采用传播和许多其他方面的空间和时间轨迹模式。基于典型直径的关键概念，我们开发了一种高效的挖掘所有l-long δ-skinny模式的算法SkinnyMine，保证了挖掘结果的完整性和每个目标模式的唯一生成。我们还提出了一个通用的直接挖掘框架，并对限定约束给出了可约性和连续性两个性质。我们在合成数据和真实数据上的实验证明了我们方法的有效性和可扩展性。

{"title":"A direct mining approach to efficient constrained graph pattern discovery","authors":"Feida Zhu, Zequn Zhang, Qiang Qu","doi":"10.1145/2463676.2463723","DOIUrl":"https://doi.org/10.1145/2463676.2463723","url":null,"abstract":"Despite the wealth of research on frequent graph pattern mining, how to efficiently mine the complete set of those with constraints still poses a huge challenge to the existing algorithms mainly due to the inherent bottleneck in the mining paradigm. In essence, mining requests with explicitly-specified constraints cannot be handled in a way that is direct and precise. In this paper, we propose a direct mining framework to solve the problem and illustrate our ideas in the context of a particular type of constrained frequent patterns --- the \"skinny\" patterns, which are graph patterns with a long backbone from which short twigs branch out. These patterns, which we formally define as l-long δ-skinny patterns, are able to reveal insightful spatial and temporal trajectory patterns in mobile data mining, information diffusion, adoption propagation, and many others.\u0000 Based on the key concept of a canonical diameter, we develop SkinnyMine, an efficient algorithm to mine all the l-long δ-skinny patterns guaranteeing both the completeness of our mining result as well as the unique generation of each target pattern. We also present a general direct mining framework together with two properties of reducibility and continuity for qualified constraints. Our experiments on both synthetic and real data demonstrate the effectiveness and scalability of our approach.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90615953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀