Proceedings of the 2018 International Conference on Management of Data最新文献

英文中文

Crowdsourcing Analytics With CrowdCur 使用CrowdCur进行众包分析

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193563

M. Esfandiari, Kavan Patel, S. Amer-Yahia, Senjuti Basu Roy

We propose to demonstrate CrowdCur xspace, a system that allows platform administrators, requesters, and workers to conduct various analytics of interest. CrowdCur xspace includes a worker curation component that relies on explicit feedback elicitation to best capture workers' preferences, a task curation component that monitors task completion and aggregates their statistics, and an OLAP-style component to query and combine analytics by a worker, by task type, etc. Administrators can fine tune their system's performance. Requesters can compare platforms and better choose the set of workers to target. Workers can compare themselves to others and find tasks and requesters that suit them best.

我们建议演示CrowdCur xspace，一个允许平台管理员、请求者和工作人员进行各种感兴趣的分析的系统。CrowdCur xspace包括一个工人管理组件，它依靠显式反馈来最好地捕捉工人的偏好，一个任务管理组件，监控任务完成情况并汇总他们的统计数据，以及一个olap风格的组件，用于按工人、按任务类型等查询和组合分析。管理员可以对系统的性能进行微调。请求者可以比较平台并更好地选择要瞄准的工作器集。员工可以将自己与他人进行比较，找到最适合自己的任务和请求者。

引用次数: 2

On the Calculation of Optimality Ranges for Relational Query Execution Plans 关系型查询执行计划的最优范围计算

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183742

Florian Wolf, Norman May, P. Willems, K. Sattler

Cardinality estimation is a crucial task in query optimization and typically relies on heuristics and basic statistical approximations. At execution time, estimation errors might result in situations where intermediate result sizes may differ from the estimated ones, so that the originally chosen plan is not the optimal plan anymore. In this paper we analyze the deviation from the estimate, and denote the cardinality range of an intermediate result, where the optimal plan remains optimal as the optimality range. While previous work used simple heuristics to calculate similar ranges, we generate the precise bounds for the optimality range considering all relevant plan alternatives. Our experimental results show that the fixed optimality ranges used in previous work fail to characterize the range of cardinalities where a plan is optimal. We derive theoretical worst case bounds for the number of enumerated plans required to compute the precise optimality range, and experimentally show that in real queries this number is significantly smaller. Our experiments also show the benefit for applications like Mid-Query Re-Optimization in terms of significant execution time improvement.

基数估计是查询优化中的一项关键任务，通常依赖于启发式和基本统计近似。在执行时，估计错误可能会导致中间结果大小与估计结果大小不同的情况，因此最初选择的计划不再是最优计划。在本文中，我们分析了与估计的偏差，并表示中间结果的基数范围，其中最优方案保持最优作为最优范围。虽然以前的工作使用简单的启发式方法来计算类似的范围，但我们考虑所有相关的计划替代方案，为最优范围生成精确的界限。我们的实验结果表明，在以前的工作中使用的固定最优性范围不能表征计划最优的基数范围。我们为计算精确的最优范围所需的枚举计划数量导出了理论上的最坏情况界限，并且实验表明，在实际查询中，这个数字要小得多。我们的实验还显示了在显著改善执行时间方面，对诸如中期查询重新优化之类的应用程序的好处。

{"title":"On the Calculation of Optimality Ranges for Relational Query Execution Plans","authors":"Florian Wolf, Norman May, P. Willems, K. Sattler","doi":"10.1145/3183713.3183742","DOIUrl":"https://doi.org/10.1145/3183713.3183742","url":null,"abstract":"Cardinality estimation is a crucial task in query optimization and typically relies on heuristics and basic statistical approximations. At execution time, estimation errors might result in situations where intermediate result sizes may differ from the estimated ones, so that the originally chosen plan is not the optimal plan anymore. In this paper we analyze the deviation from the estimate, and denote the cardinality range of an intermediate result, where the optimal plan remains optimal as the optimality range. While previous work used simple heuristics to calculate similar ranges, we generate the precise bounds for the optimality range considering all relevant plan alternatives. Our experimental results show that the fixed optimality ranges used in previous work fail to characterize the range of cardinalities where a plan is optimal. We derive theoretical worst case bounds for the number of enumerated plans required to compute the precise optimality range, and experimentally show that in real queries this number is significantly smaller. Our experiments also show the benefit for applications like Mid-Query Re-Optimization in terms of significant execution time improvement.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79765130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Tighter Upper Bounds for Join Cardinality Estimates 连接基数估计的更严格上界

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183714

Walter Cai

1 PROBLEM AND MOTIVATION Despite decades of research, modern database systems still struggle with multijoin queries. Users will often experience long wait times occurring with unpredictable frequency detracting from the usability of the system. In this work we develop a new method to tighten join cardinality upper bounds. The intention for these bounds is to assist the query optimizer (QO) in avoiding expensive physical join plans. Our approach is as follows: leveraging data sketching, and randomized hashing we generate and tighten theoretical join cardinality upper bounds. We outline our base data structures and methodology, and how these bounds may be introduced to a traditional QO framework as a new statistic for physical join plan selection. We evaluate the tightness of our bounds on GooglePlus community graphs and successfully generate degree of magnitude upper bounds even in the presence of multiway cyclic joins.

尽管经过了几十年的研究，现代数据库系统仍然在与多连接查询作斗争。用户经常会经历长时间的等待，并且无法预测的频率会影响系统的可用性。本文提出了一种收紧连接基数上界的新方法。这些边界的目的是帮助查询优化器(QO)避免昂贵的物理连接计划。我们的方法如下:利用数据草图和随机散列，我们生成并收紧理论连接基数上界。我们概述了我们的基本数据结构和方法，以及如何将这些边界作为物理连接计划选择的新统计数据引入传统的QO框架。我们在GooglePlus社区图上评估了边界的紧密性，并在存在多路循环连接的情况下成功地生成了数量级上界。

引用次数: 0

SuRF: Practical Range Query Filtering with Fast Succinct Tries SuRF:实用范围查询过滤与快速简洁的尝试

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196931

Huanchen Zhang, Hyeontaek Lim, Viktor Leis, D. Andersen, M. Kaminsky, Kimberly Keeton, Andrew Pavlo

We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries: open-range queries, closed-range queries, and range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the point and range query performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node. The false positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access on-disk data structures. Our experiments on a 100 GB dataset show that replacing RocksDB's Bloom filters with SuRFs speeds up open-seek (without upper-bound) and closed-seek (with upper-bound) queries by up to 1.5× and 5× with a modest cost on the worst-case (all-missing) point query throughput due to slightly higher false positive rate.

摘要提出了一种快速、紧凑的近似隶属度检验数据结构——简洁范围滤波器(SuRF)。与传统的Bloom过滤器不同，SuRF支持单键查找和常见的范围查询:开放范围查询，封闭范围查询和范围计数。SuRF基于一种新的数据结构，称为快速简洁Trie (FST)，它匹配最先进的顺序保持索引的点和范围查询性能，同时每个Trie节点仅消耗10比特。SuRF中点查询和范围查询的误报率是可调的，以满足不同的应用程序需求。我们评估了RocksDB中的SuRF作为Bloom过滤器的替代品，通过在请求访问磁盘数据结构之前过滤请求来减少I/O。我们在100 GB数据集上的实验表明，用surf替换RocksDB的Bloom过滤器，可以将开放寻道(没有上界)和封闭寻道(有上界)查询的速度提高1.5倍和5倍，并且由于误报率略高，在最坏情况(全部缺失)点查询吞吐量上的代价不大。

引用次数: 117

Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing 冷过滤器:一个元框架，更快，更准确的流处理

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183726

Yang Zhou, Tong Yang, Jie Jiang, B. Cui, Minlan Yu, Xiaoming Li, S. Uhlig

Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, etc., support numerous applications in databases, storage systems, networking, and other domains. However, the unbalanced distribution in real data streams poses great challenges to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter (CF), that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot items, our filter captures cold items in the first stage, and hot items in the second stage. Also, existing filters require two-direction communication - with frequent exchanges between the two stages; our filter on the other hand is one-direction - each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, giving it a genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on three typical stream processing tasks and experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times. All source code is made publicly available at Github.

近似流处理算法，如Count-Min sketch、Space-Saving等，支持数据库、存储系统、网络和其他领域的众多应用。然而，实际数据流的不平衡分布对现有算法提出了很大的挑战。为了增强这些算法，我们提出了一个称为冷过滤器(CF)的元框架，它可以实现更快，更准确的流处理。与现有的主要针对热点的过滤器不同，我们的过滤器在第一阶段捕获冷项，在第二阶段捕获热项。此外，现有的过滤器需要双向通信——在两个阶段之间频繁交换;另一方面，我们的过滤器是单向的——每个项目最多进入一个阶段一次。我们的过滤器可以准确地估计冷项目和热项目，使其具有通用性，使其适用于许多流处理任务。为了说明我们的过滤器的好处，我们将其部署在三个典型的流处理任务上，实验结果表明速度提高了4.7倍，精度提高了51倍。所有源代码都在Github上公开提供。

{"title":"Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing","authors":"Yang Zhou, Tong Yang, Jie Jiang, B. Cui, Minlan Yu, Xiaoming Li, S. Uhlig","doi":"10.1145/3183713.3183726","DOIUrl":"https://doi.org/10.1145/3183713.3183726","url":null,"abstract":"Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, etc., support numerous applications in databases, storage systems, networking, and other domains. However, the unbalanced distribution in real data streams poses great challenges to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter (CF), that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot items, our filter captures cold items in the first stage, and hot items in the second stage. Also, existing filters require two-direction communication - with frequent exchanges between the two stages; our filter on the other hand is one-direction - each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, giving it a genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on three typical stream processing tasks and experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times. All source code is made publicly available at Github.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78561034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 99

Accelerating Machine Learning Inference with Probabilistic Predicates 用概率谓词加速机器学习推理

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183751

Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, S. Chaudhuri

Classic query optimization techniques, including predicate pushdown, are of limited use for machine learning inference queries, because the user-defined functions (UDFs) which extract relational columns from unstructured inputs are often very expensive; query predicates will remain stuck behind these UDFs if they happen to require relational columns that are generated by the UDFs. In this work, we demonstrate constructing and applying probabilistic predicates to filter data blobs that do not satisfy the query predicate; such filtering is parametrized to different target accuracies. Furthermore, to support complex predicates and to avoid per-query training, we augment a cost-based query optimizer to choose plans with appropriate combinations of simpler probabilistic predicates. Experiments with several machine learning workloads on a big-data cluster show that query processing improves by as much as 10x.

经典的查询优化技术，包括谓词下推，对于机器学习推理查询的使用是有限的，因为从非结构化输入中提取关系列的用户定义函数(udf)通常非常昂贵;如果查询谓词恰好需要由udf生成的关系列，则查询谓词将保留在这些udf后面。在这项工作中，我们演示了构造和应用概率谓词来过滤不满足查询谓词的数据团;这种滤波是根据不同的目标精度参数化的。此外，为了支持复杂的谓词并避免对每个查询进行训练，我们增加了一个基于成本的查询优化器，以使用更简单的概率谓词的适当组合来选择计划。在大数据集群上对几个机器学习工作负载进行的实验表明，查询处理能力提高了10倍。

引用次数: 85

Fine-grained Concept Linking using Neural Networks in Healthcare 在医疗保健中使用神经网络的细粒度概念链接

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196907

Jian Dai, Meihui Zhang, Gang Chen, Ju Fan, K. Ngiam, B. Ooi

To unlock the wealth of the healthcare data, we often need to link the real-world text snippets to the referred medical concepts described by the canonical descriptions. However, existing healthcare concept linking methods, such as dictionary-based and simple machine learning methods, are not effective due to the word discrepancy between the text snippet and the canonical concept description, and the overlapping concept meaning among the fine-grained concepts. To address these challenges, we propose a Neural Concept Linking (NCL) approach for accurate concept linking using systematically integrated neural networks. We call the novel neural network architecture as the COMposite AttentIonal encode-Decode neural network (COM-AID). COM-AID performs an encode-decode process that encodes a concept into a vector and decodes the vector into a text snippet with the help of two devised contexts. On the one hand, it injects the textual context into the neural network through the attention mechanism, so that the word discrepancy can be overcome from the semantic perspective. On the other hand, it incorporates the structural context into the neural network through the attention mechanism, so that minor concept meaning differences can be enlarged and effectively differentiated. Empirical studies on two real-world datasets confirm that the NCL produces accurate concept linking results and significantly outperforms state-of-the-art techniques.

为了释放医疗保健数据的财富，我们通常需要将真实世界的文本片段链接到规范描述所描述的参考医学概念。然而，现有的医疗保健概念链接方法，如基于字典和简单的机器学习方法，由于文本片段与规范概念描述之间的单词差异以及细粒度概念之间的概念含义重叠而效果不佳。为了解决这些挑战，我们提出了一种神经概念链接(NCL)方法，使用系统集成的神经网络进行准确的概念链接。我们将这种新的神经网络结构称为复合注意编码-解码神经网络(COM-AID)。COM-AID执行一个编码-解码过程，将一个概念编码为一个矢量，并在两个设计的上下文的帮助下将该矢量解码为一个文本片段。一方面，它通过注意机制将文本语境注入到神经网络中，从而从语义角度克服词语差异;另一方面，它通过注意机制将结构语境纳入神经网络，使细微的概念意义差异得以放大和有效区分。对两个真实世界数据集的实证研究证实，NCL产生准确的概念链接结果，并且显著优于最先进的技术。

{"title":"Fine-grained Concept Linking using Neural Networks in Healthcare","authors":"Jian Dai, Meihui Zhang, Gang Chen, Ju Fan, K. Ngiam, B. Ooi","doi":"10.1145/3183713.3196907","DOIUrl":"https://doi.org/10.1145/3183713.3196907","url":null,"abstract":"To unlock the wealth of the healthcare data, we often need to link the real-world text snippets to the referred medical concepts described by the canonical descriptions. However, existing healthcare concept linking methods, such as dictionary-based and simple machine learning methods, are not effective due to the word discrepancy between the text snippet and the canonical concept description, and the overlapping concept meaning among the fine-grained concepts. To address these challenges, we propose a Neural Concept Linking (NCL) approach for accurate concept linking using systematically integrated neural networks. We call the novel neural network architecture as the COMposite AttentIonal encode-Decode neural network (COM-AID). COM-AID performs an encode-decode process that encodes a concept into a vector and decodes the vector into a text snippet with the help of two devised contexts. On the one hand, it injects the textual context into the neural network through the attention mechanism, so that the word discrepancy can be overcome from the semantic perspective. On the other hand, it incorporates the structural context into the neural network through the attention mechanism, so that minor concept meaning differences can be enlarged and effectively differentiated. Empirical studies on two real-world datasets confirm that the NCL produces accurate concept linking results and significantly outperforms state-of-the-art techniques.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80129732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

ZigZag: Supporting Similarity Queries on Vector Space Models ZigZag:支持向量空间模型上的相似性查询

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196936

Wenhai Li, Lingfeng Deng, Yang Li, Chen Li

In this paper we study the problem of supporting similarity queries on a large number of records using a vector space model, where each record is a bag of tokens. We consider similarity functions that incorporate non-negative global token weights as well as record-specific token degrees. We develop a family of algorithms based on an inverted index for large data sets, especially for the case of using external storage such as hard disks or flash drives, and present pruning techniques based on various bounds to improve their performance. We formally prove the correctness of these techniques, and show how to achieve better pruning power by iteratively tightening these bounds to exactly filter dissimilar records. We conduct an extensive experimental study using real, large-scale data sets based on different storage platforms, including memory, hard disks, and flash drives. The results show that these algorithms and techniques can efficiently support similarity queries on large data sets.

在本文中，我们使用向量空间模型研究了在大量记录上支持相似性查询的问题，其中每个记录都是一个令牌包。我们考虑了包含非负全局令牌权重以及特定于记录的令牌度的相似函数。我们开发了一系列基于大型数据集倒排索引的算法，特别是对于使用外部存储(如硬盘或闪存驱动器)的情况，并提出了基于各种界限的修剪技术来提高其性能。我们正式证明了这些技术的正确性，并展示了如何通过迭代收紧这些边界来精确过滤不相似的记录来获得更好的修剪能力。我们使用基于不同存储平台(包括内存、硬盘和闪存驱动器)的真实大规模数据集进行了广泛的实验研究。结果表明，这些算法和技术能够有效地支持大型数据集上的相似度查询。

引用次数: 3

Machine Learning for Data Management: Problems and Solutions 数据管理中的机器学习:问题与解决方案

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3199515

Pedro M. Domingos

Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.

近年来，机器学习取得了长足的进步，其应用正在迅速蔓延。不幸的是，标准的机器学习公式不能很好地与数据管理问题匹配。例如，大多数学习算法假设数据包含在单个表中，并由i.i.d(独立且同分布)样本组成。这将导致临时解决方案的激增、开发缓慢和次优结果。幸运的是，机器学习理论和实践的主体正在开发，免除这些假设，并承诺使机器学习数据管理更容易和更有效[1]。特别是，像马尔可夫逻辑这样的表示，它包括许多类型的深度网络作为特殊情况，允许我们在非深度网络上定义非常丰富的概率分布。，多关系数据[2]。尽管它们具有普遍性，但学习这些模型的参数仍然是一个凸优化问题，允许有效的解决方案。学习结构——在马尔可夫逻辑的情况下，一阶逻辑中的一组公式——是难以处理的，就像在更传统的表示中一样，但可以使用归纳逻辑编程技术有效地完成。推理使用定理证明的概率推广进行，并且在可处理的马尔可夫逻辑中占用线性时间和空间，马尔可夫逻辑是马尔可夫逻辑的面向对象专门化[3]。这些技术为实体解析、模式匹配、本体对齐和信息提取等问题提供了最先进的原则性解决方案。使用可处理的马尔可夫逻辑，我们从Web中提取了一个包含数百万个对象和数十亿个参数的概率知识库，可以使用RDBMS后端在亚秒内精确地查询[3]。有了这些基础，我们预计机器学习在数据管理中的应用将在未来几年继续加速。

{"title":"Machine Learning for Data Management: Problems and Solutions","authors":"Pedro M. Domingos","doi":"10.1145/3183713.3199515","DOIUrl":"https://doi.org/10.1145/3183713.3199515","url":null,"abstract":"Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84682798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Auto-Detect: Data-Driven Error Detection in Tables 自动检测:表中数据驱动的错误检测

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196889

Zhipeng Huang, Yeye He

Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose sj, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test sj on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, sj makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research.

给定一列值，现有的方法通常使用类似regex的规则，通过查找与其他值不一致的异常值来检测错误。这种技术仅根据给定输入列中的值做出局部决策，而不考虑从大型干净表语料库推断出的更全局的兼容性概念。我们提出了sj，这是一种基于统计的技术，它利用来自大型语料库的共现统计数据进行错误检测，这与现有的基于规则的方法有很大的不同。我们的方法可以自动检测不兼容的值，通过利用一组明智选择的泛化语言，每种语言使用不同的泛化，对不同类型的错误敏感。这样检测到的错误是基于全局统计的，这是健壮的，并且与人类对错误的直觉很好地一致。我们在一大组公共维基百科表以及专有企业Excel文件上测试了sj。虽然这两个测试集都应该是高质量的，但sj在这两种情况下都令人惊讶地发现了超过数万个错误，这些错误被人工验证为高精度(超过0.98)。我们在维基百科表上的标记基准集被发布用于未来的研究。

{"title":"Auto-Detect: Data-Driven Error Detection in Tables","authors":"Zhipeng Huang, Yeye He","doi":"10.1145/3183713.3196889","DOIUrl":"https://doi.org/10.1145/3183713.3196889","url":null,"abstract":"Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose sj, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test sj on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, sj makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85874798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2018 International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀