Fourth Workshop in Exploiting AI Techniques for Data Management最新文献

英文中文

Pre-Trained Web Table Embeddings for Table Discovery 预训练的Web表嵌入表发现

Fourth Workshop in Exploiting AI Techniques for Data Management

Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464892

Michael Günther, Maik Thiele, Julius Gonsior, Wolfgang Lehner

Pre-trained word embedding models have become the de-facto standard to model text in state-of-the-art analysis tools and frameworks. However, while there are massive amounts of textual data stored in tables, word embedding models are usually pre-trained on large documents. This mismatch can lead to narrowed performance on tasks where text values in tables are analyzed. To improve analysis and retrieval tasks working with tabular data, we propose a novel embedding technique to be pre-trained directly on a large Web table corpus. In an experimental evaluation, we employ our models for various data analysis tasks on different data sources. Our evaluation shows that models using pre-trained Web table embeddings outperform the same models when applied to embeddings pre-trained on text. Moreover, we show that by using Web table embeddings state-of-the-art models for the investigated tasks can be outperformed.

预训练的词嵌入模型已经成为在最先进的分析工具和框架中对文本进行建模的事实上的标准。然而，当表中存储了大量的文本数据时，词嵌入模型通常是在大型文档上进行预训练的。这种不匹配可能导致分析表中的文本值的任务的性能降低。为了改进表格数据的分析和检索任务，我们提出了一种新的嵌入技术，可以直接在大型Web表语料库上进行预训练。在实验评估中，我们将我们的模型用于不同数据源上的各种数据分析任务。我们的评估表明，使用预训练的Web表嵌入的模型在应用于文本预训练的嵌入时优于相同的模型。此外，我们表明，通过使用Web表嵌入的最先进的模型可以被调查的任务优于。

引用次数: 5

Balancing Familiarity and Curiosity in Data Exploration with Deep Reinforcement Learning 用深度强化学习平衡数据探索中的熟悉度和好奇心

Fourth Workshop in Exploiting AI Techniques for Data Management

Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464884

Aurélien Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, S. Subramanian

The ability to find a set of records in Exploratory Data Analysis (EDA) hinges on the scattering of objects in the data set and the on users’ knowledge of data and their ability to express their needs. This yields a wide range of EDA scenarios and solutions that differ in the guidance they provide to users. In this paper, we investigate the interplay between modeling curiosity and familiarity in Deep Reinforcement Learning (DRL) and expressive data exploration operators. We formalize curiosity as intrinsic reward and familiarity as extrinsic reward. We examine the behavior of several policies learned for different weights for those rewards. Our experiments on SDSS, a very large sky survey data set1 provide several insights and justify the need for a deeper examination of combining DRL and data exploration operators that go beyond drill-downs and roll-ups.

探索性数据分析(Exploratory Data Analysis, EDA)中查找一组记录的能力取决于数据集中对象的分散程度以及用户对数据的了解程度和表达需求的能力。这就产生了各种各样的EDA场景和解决方案，它们提供给用户的指导各不相同。在本文中，我们研究了深度强化学习(DRL)和表达性数据探索算子中建模好奇心和熟悉度之间的相互作用。我们将好奇心形式化为内在奖励，将熟悉感形式化为外在奖励。我们研究了针对这些奖励的不同权重而学习的几种策略的行为。我们在SDSS(一个非常大的巡天数据集)上的实验提供了一些见解，并证明需要更深入地检查DRL和数据探索操作的结合，而不仅仅是钻取和卷起。

引用次数: 7

Leveraging Approximate Constraints for Localized Data Error Detection 利用近似约束进行局部数据错误检测

Fourth Workshop in Exploiting AI Techniques for Data Management

Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464888

Mohan Zhang, O. Schulte, Yudong Luo

Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.

错误检测是数据质量管理的关键。人工智能技术可以利用用户领域知识来识别与领域知识相冲突的错误记录集。为了表示广泛的用户领域知识，最近的一些论文开发并利用了软近似约束(ACs)，数据关系只期望在一定程度上满足，而不是完全满足。我们介绍了一种新的基于人工智能的错误定位技术，用于增强ACs的错误检测。我们最初的观察是，近似约束是上下文敏感的:它们被满足的程度取决于所考虑的子群体。错误区域是数据的一个子集，它违反AC的程度高于整个数据，因此更有可能包含错误记录。例如，错误区域可能包含特定年份之前或特定位置的一组记录。我们描述了一种有效的错误定位优化算法:基于递归树划分方案识别最违反给定AC的不同错误区域。树表示根据用户容易解释的数据属性描述不同的错误区域(例如，2003年之前的所有记录)。这有助于向用户解释为什么有些记录被识别为可能的错误。在识别错误区域后，我们将错误检测方法分别应用于每个错误区域，而不是将数据集作为一个整体。我们基于包含真实世界和合成误差的四个数据集的经验评估表明，错误定位提高了基于ACs的错误检测的准确性和速度。

{"title":"Leveraging Approximate Constraints for Localized Data Error Detection","authors":"Mohan Zhang, O. Schulte, Yudong Luo","doi":"10.1145/3464509.3464888","DOIUrl":"https://doi.org/10.1145/3464509.3464888","url":null,"abstract":"Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Tailored Regression for Learned Indexes: Logarithmic Error Regression 学习指标的定制回归:对数误差回归

Fourth Workshop in Exploiting AI Techniques for Data Management

Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464891

Martin Eppert, Philipp Fent, Thomas Neumann

Although linear regressions are essential for learned index structures, most implementations use Simple Linear Regression, which optimizes the squared error. Since learned indexes use exponential search, regressions that optimize the logarithmic error are much better tailored for the use-case. By using this fitting optimization target, we can significantly improve learned index’s lookup performance with no architectural changes. While the log-error is harder to optimize, our novel algorithms and optimization heuristics can bring a practical performance improvement of the lookup latency. Even in cases where fast build times are paramount, log-error regressions still provide a robust fallback for degenerated leaf models. The resulting regressions are much better suited for learned indexes, and speed up lookups on data sets with outliers by over a factor of 2.

虽然线性回归对于学习索引结构是必不可少的，但大多数实现使用简单线性回归，它优化了平方误差。由于学习索引使用指数搜索，因此优化对数误差的回归可以更好地为用例量身定制。通过使用这个拟合优化目标，我们可以在不改变体系结构的情况下显著提高学习索引的查找性能。虽然日志错误更难优化，但我们的新算法和优化启发式可以带来查找延迟的实际性能改进。即使在快速构建时间至关重要的情况下，对数错误回归仍然为退化的叶模型提供了一个健壮的回退。由此产生的回归更适合于学习索引，并且将查找具有异常值的数据集的速度提高了2倍以上。

引用次数: 4

RUSLI: Real-time Updatable Spline Learned Index 实时可更新样条学习索引

Fourth Workshop in Exploiting AI Techniques for Data Management

Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464886

Mayank Mishra, Rekha Singhal

Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.

机器学习算法通过“学习索引”加速了数据访问，其中一组数据项由根据数据键对和相应记录在内存中的位置学习的模型索引。大多数学习索引都需要对模型进行重新训练，以便在数据集中插入新的数据。再培训是昂贵的，花费的时间和模型培训一样多。所以，今天，学习到的索引是通过在批插入上重新训练来更新的，以摊销成本。而实时应用，如数据驱动的推荐应用，需要实时访问用户的特性库，读取已有用户的数据，添加新用户。这促使我们提出一个实时更新的样条学习索引，RUSLI，通过样条学习数据键的分布及其在内存中的位置。我们扩展了RadixSpline[8]来构建可更新的学习索引，同时支持数据集中的实时插入，而不会影响对更新数据集的查找时间。我们已经证明，RUSLI可以在常量时间内使用与样条数目成比例的额外临时存储器来更新索引。我们已经讨论了如何在构建基数表时使用样条键的分布来减小索引的大小。RULSI显示查找需要270ns，插入操作需要50ns。此外，我们还展示了RUSLI支持并发查找和插入操作，吞吐量为4000万次/秒。我们已经介绍并讨论了RUSLI在SOSD[9]基准测试上用于单个和并发插入、查找和范围查询的性能数字。

{"title":"RUSLI: Real-time Updatable Spline Learned Index","authors":"Mayank Mishra, Rekha Singhal","doi":"10.1145/3464509.3464886","DOIUrl":"https://doi.org/10.1145/3464509.3464886","url":null,"abstract":"Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

LEA: A Learned Encoding Advisor for Column Stores LEA:列存储的学习编码顾问

Fourth Workshop in Exploiting AI Techniques for Data Management

Pub Date : 2021-05-18 DOI: 10.1145/3464509.3464885

Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska

Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encoding Advisor (LEA), a learned approach to column encoding selection. LEA is trained on synthetic datasets with various distributions on the target system. Once trained, LEA uses sample data and statistics (such as cardinality) from the user’s database to predict the optimal column encodings. LEA can optimize for encoded size, query performance, or a combination of the two. Compared to the heuristic-based encoding advisor of a commercial column store on TPC-H, LEA achieves 19% lower query latency while using 26% less space.

数据仓库以柱状格式组织数据，以实现更快的扫描和更好的压缩。现代系统提供了各种列编码，可以减少存储占用并提高查询性能。为特定列选择一个好的编码方案是一个优化问题，它取决于数据、查询工作负载和底层硬件。我们介绍了学习编码顾问(LEA)，这是一种学习的列编码选择方法。LEA是在目标系统上具有不同分布的合成数据集上训练的。经过训练后，LEA使用来自用户数据库的样本数据和统计数据(如基数)来预测最佳的列编码。LEA可以针对编码大小、查询性能或两者的组合进行优化。与TPC-H上基于启发式的商业列存储编码顾问相比，LEA的查询延迟降低了19%，而使用的空间减少了26%。

引用次数: 7

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Fourth Workshop in Exploiting AI Techniques for Data Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀