Proceedings of the 21st International Workshop on the Web and Databases最新文献

英文中文

Cleaning Data with Constraints and Experts 使用约束和专家清理数据

Proceedings of the 21st International Workshop on the Web and Databases

Pub Date : 2018-06-10 DOI: 10.1145/3201463.3201464

A. Assadi, T. Milo, Slava Novgorodov

Popular techniques for data cleaning use integrity constraints to identify errors in the data and to automatically resolve them, e.g. by using predefined priorities among possible updates and finding a minimal repair that will resolve violations. Such automatic solutions however cannot ensure precision of the repairs since they do not have enough evidence about the actual errors and may in fact lead to wrong results with respect to the ground truth. It has thus been suggested to use domain experts to examine the potential updates and choose which should be applied to the database. However, the sheer volume of the databases and the large number of possible updates that may resolve a given constraint violation, may make such a manual examination prohibitory expensive. The goal of the DANCE system presented here is to help to optimize the experts work and reduce as much as possible the number of questions (updates verification) they need to address. Given a constraint violation, our algorithm identifies the suspicious tuples whose update may contribute (directly or indirectly) to the constraint resolution, as well as the possible dependencies among them. Using this information it builds a graph whose nodes are the suspicious tuples and whose weighted edges capture the likelihood of an error in one tuple to occur and affect the other. PageRank-style algorithm then allows us to identify the most beneficial tuples to ask about first. Incremental graph maintenance is used to assure interactive response time. We implemented our solution in the DANCE system and show its effectiveness and efficiency through a comprehensive suite of experiments.

流行的数据清理技术使用完整性约束来识别数据中的错误并自动解决这些错误，例如，通过在可能的更新中使用预定义的优先级，并找到将解决违规的最小修复。然而，这种自动解决方案不能确保维修的精度，因为它们没有足够的证据证明实际错误，实际上可能导致与实际情况有关的错误结果。因此，建议使用领域专家来检查潜在的更新，并选择应该应用于数据库的更新。然而，数据库的绝对数量和可能解决给定约束违反的大量可能更新，可能使这种手动检查的成本高得令人望而却步。这里介绍的DANCE系统的目标是帮助优化专家的工作，并尽可能减少他们需要解决的问题(更新验证)的数量。给定一个违反约束的情况，我们的算法识别出其更新可能(直接或间接)有助于约束解析的可疑元组，以及它们之间可能的依赖关系。使用这些信息，它构建一个图，其节点是可疑的元组，其加权边捕获一个元组中发生错误并影响另一个元组的可能性。然后，pagerank样式的算法允许我们首先确定最有益的元组。增量图维护用于保证交互响应时间。我们在DANCE系统中实现了我们的解决方案，并通过一套全面的实验证明了它的有效性和效率。

{"title":"Cleaning Data with Constraints and Experts","authors":"A. Assadi, T. Milo, Slava Novgorodov","doi":"10.1145/3201463.3201464","DOIUrl":"https://doi.org/10.1145/3201463.3201464","url":null,"abstract":"Popular techniques for data cleaning use integrity constraints to identify errors in the data and to automatically resolve them, e.g. by using predefined priorities among possible updates and finding a minimal repair that will resolve violations. Such automatic solutions however cannot ensure precision of the repairs since they do not have enough evidence about the actual errors and may in fact lead to wrong results with respect to the ground truth. It has thus been suggested to use domain experts to examine the potential updates and choose which should be applied to the database. However, the sheer volume of the databases and the large number of possible updates that may resolve a given constraint violation, may make such a manual examination prohibitory expensive. The goal of the DANCE system presented here is to help to optimize the experts work and reduce as much as possible the number of questions (updates verification) they need to address. Given a constraint violation, our algorithm identifies the suspicious tuples whose update may contribute (directly or indirectly) to the constraint resolution, as well as the possible dependencies among them. Using this information it builds a graph whose nodes are the suspicious tuples and whose weighted edges capture the likelihood of an error in one tuple to occur and affect the other. PageRank-style algorithm then allows us to identify the most beneficial tuples to ask about first. Incremental graph maintenance is used to assure interactive response time. We implemented our solution in the DANCE system and show its effectiveness and efficiency through a comprehensive suite of experiments.","PeriodicalId":365496,"journal":{"name":"Proceedings of the 21st International Workshop on the Web and Databases","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115636464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Searching for Truth in a Database of Statistics 在统计数据库中寻找真相

Proceedings of the 21st International Workshop on the Web and Databases

Pub Date : 2018-06-10 DOI: 10.1145/3201463.3201467

Tien-Duc Cao, I. Manolescu, Xavier Tannier

The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by (i) identifying the statistic datasets most relevant for a given fact-checking query, and (ii) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.

虚假和错误信息的扩散，特别是通过网络传播的，导致越来越多的精力投入到新闻事实核查中。事实核查记者通常会根据一些可信的数据来源来核查声明的准确性。统计数据库(如由国家机构编制的数据库)通常被用作可信的数据源，因为它们包含有价值的高质量信息。然而，当它们以HTML或电子表格等格式共享时，它们的可用性受到限制:这使得很难找到最相关的数据集来检查特定的索赔，或者从数据集中快速提取给定查询的最佳答案。我们提出了一种新的算法，通过(i)识别与给定事实检查查询最相关的统计数据集，以及(ii)从每个数据集中提取它可能包含的最佳特定(精确)查询答案，从而能够利用这些统计表。我们已经实施了我们的方法，并在从法国国家统计研究所INSEE获得的完整统计语料库上进行了实验。实验和比较表明了所提方法的有效性。

{"title":"Searching for Truth in a Database of Statistics","authors":"Tien-Duc Cao, I. Manolescu, Xavier Tannier","doi":"10.1145/3201463.3201467","DOIUrl":"https://doi.org/10.1145/3201463.3201467","url":null,"abstract":"The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by (i) identifying the statistic datasets most relevant for a given fact-checking query, and (ii) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.","PeriodicalId":365496,"journal":{"name":"Proceedings of the 21st International Workshop on the Web and Databases","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128233496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Proceedings of the 21st International Workshop on the Web and Databases 第21届网络与数据库国际研讨会论文集

Proceedings of the 21st International Workshop on the Web and Databases

Pub Date : 2018-06-10 DOI: 10.1145/3201463

引用次数: 0

Leveraging Wikipedia Table Schemas for Knowledge Graph Augmentation 利用Wikipedia表模式增强知识图谱

Proceedings of the 21st International Workshop on the Web and Databases

Pub Date : 2018-06-10 DOI: 10.1145/3201463.3201468

Matteo Cannaviccio, Lorenzo Ariemma, Denilson Barbosa, P. Merialdo

General solutions to augment Knowledge Graphs (KGs) with facts extracted from Web tables aim to associate pairs of columns from the table with a KG relation based on the matches between pairs of entities in the table and facts in the KG. These approaches suffer from intrinsic limitations due to the incompleteness of the KGs. In this paper we investigate an alternative solution, which leverages the patterns that occur on the schemas of a large corpus of Wikipedia tables. Our experimental evaluation, which used DBpedia as reference KG, demonstrates the advantages of our approach over state-of-the-art solutions and reveals that we can extract more than 1.7M of facts with an estimated accuracy of 0.81 even from tables that do not expose any fact on the KG.

利用从Web表中提取的事实增强知识图(Knowledge Graphs, KGs)的一般解决方案旨在将表中的列对与基于表中实体对与KG中的事实对之间的匹配的KG关系相关联。由于知识库的不完整性，这些方法受到了内在的限制。在本文中，我们研究了一种替代解决方案，该解决方案利用了出现在大型维基百科表语料库模式上的模式。我们的实验评估，使用DBpedia作为参考KG，证明了我们的方法比最先进的解决方案的优势，并表明我们可以提取超过1.7M的事实，估计精度为0.81，即使是在KG上没有显示任何事实的表。

引用次数: 12

DataVizard

Proceedings of the 21st International Workshop on the Web and Databases

Pub Date : 2018-06-10 DOI: 10.1145/3201463.3201465

Rema Ananthanarayanan, P. Lohia, Srikanta J. Bedathur

Selecting the appropriate visual presentation of the data such that it not only preserves the semantics but also provides an intuitive summary of the data is an important, often the final step of data analytics. Unfortunately, this is also a step involving significant human effort starting from selection of groups of columns in the structured results from analytics stages, to the selection of right visualization by experimenting with various alternatives. In this paper, we describe our DataVizard system aimed at reducing this overhead by automatically recommending the most appropriate visual presentation for the structured result. Specifically, we consider the following two scenarios: first, when one needs to visualize the results of a structured query such as SQL; and the second, when one has acquired a data table with an associated short description (e.g., tables from the Web). Using a corpus of real-world database queries (and their results) and a number of statistical tables crawled from the Web, we show that DataVizard is capable of recommending visual presentations with high accuracy.

引用次数: 12

Processing Class-Constraint K-NN Queries with MISP 用MISP处理类约束K-NN查询

Proceedings of the 21st International Workshop on the Web and Databases

Pub Date : 2018-06-10 DOI: 10.1145/3201463.3201466

Evica Milchevski, Fabian Neffgen, S. Michel

In this work, we consider processing k-nearest-neighbor (k-NN) queries, with the additional requirement that the result objects are of a specific type. To solve this problem, we propose an approach based on a combination of an inverted index and state-of-the-art similarity search index structure for efficiently pruning the search space early-on. Furthermore, we provide a cost model, and an extensive experimental study, that analyzes the performance of the proposed index structure under different configurations, with the aim of finding the most efficient one for the dataset being searched.

在这项工作中，我们考虑处理k-最近邻(k-NN)查询，并要求结果对象具有特定类型。为了解决这个问题，我们提出了一种基于倒排索引和最先进的相似性搜索索引结构相结合的方法，以便在早期有效地修剪搜索空间。此外，我们提供了一个成本模型，并进行了广泛的实验研究，分析了所提出的索引结构在不同配置下的性能，目的是为正在搜索的数据集找到最有效的索引结构。

引用次数: 1

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 21st International Workshop on the Web and Databases

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀