Michael Günther, Maik Thiele, Julius Gonsior, Wolfgang Lehner
Pre-trained word embedding models have become the de-facto standard to model text in state-of-the-art analysis tools and frameworks. However, while there are massive amounts of textual data stored in tables, word embedding models are usually pre-trained on large documents. This mismatch can lead to narrowed performance on tasks where text values in tables are analyzed. To improve analysis and retrieval tasks working with tabular data, we propose a novel embedding technique to be pre-trained directly on a large Web table corpus. In an experimental evaluation, we employ our models for various data analysis tasks on different data sources. Our evaluation shows that models using pre-trained Web table embeddings outperform the same models when applied to embeddings pre-trained on text. Moreover, we show that by using Web table embeddings state-of-the-art models for the investigated tasks can be outperformed.
{"title":"Pre-Trained Web Table Embeddings for Table Discovery","authors":"Michael Günther, Maik Thiele, Julius Gonsior, Wolfgang Lehner","doi":"10.1145/3464509.3464892","DOIUrl":"https://doi.org/10.1145/3464509.3464892","url":null,"abstract":"Pre-trained word embedding models have become the de-facto standard to model text in state-of-the-art analysis tools and frameworks. However, while there are massive amounts of textual data stored in tables, word embedding models are usually pre-trained on large documents. This mismatch can lead to narrowed performance on tasks where text values in tables are analyzed. To improve analysis and retrieval tasks working with tabular data, we propose a novel embedding technique to be pre-trained directly on a large Web table corpus. In an experimental evaluation, we employ our models for various data analysis tasks on different data sources. Our evaluation shows that models using pre-trained Web table embeddings outperform the same models when applied to embeddings pre-trained on text. Moreover, we show that by using Web table embeddings state-of-the-art models for the investigated tasks can be outperformed.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"518 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133944727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aurélien Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, S. Subramanian
The ability to find a set of records in Exploratory Data Analysis (EDA) hinges on the scattering of objects in the data set and the on users’ knowledge of data and their ability to express their needs. This yields a wide range of EDA scenarios and solutions that differ in the guidance they provide to users. In this paper, we investigate the interplay between modeling curiosity and familiarity in Deep Reinforcement Learning (DRL) and expressive data exploration operators. We formalize curiosity as intrinsic reward and familiarity as extrinsic reward. We examine the behavior of several policies learned for different weights for those rewards. Our experiments on SDSS, a very large sky survey data set1 provide several insights and justify the need for a deeper examination of combining DRL and data exploration operators that go beyond drill-downs and roll-ups.
探索性数据分析(Exploratory Data Analysis, EDA)中查找一组记录的能力取决于数据集中对象的分散程度以及用户对数据的了解程度和表达需求的能力。这就产生了各种各样的EDA场景和解决方案,它们提供给用户的指导各不相同。在本文中,我们研究了深度强化学习(DRL)和表达性数据探索算子中建模好奇心和熟悉度之间的相互作用。我们将好奇心形式化为内在奖励,将熟悉感形式化为外在奖励。我们研究了针对这些奖励的不同权重而学习的几种策略的行为。我们在SDSS(一个非常大的巡天数据集)上的实验提供了一些见解,并证明需要更深入地检查DRL和数据探索操作的结合,而不仅仅是钻取和卷起。
{"title":"Balancing Familiarity and Curiosity in Data Exploration with Deep Reinforcement Learning","authors":"Aurélien Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, S. Subramanian","doi":"10.1145/3464509.3464884","DOIUrl":"https://doi.org/10.1145/3464509.3464884","url":null,"abstract":"The ability to find a set of records in Exploratory Data Analysis (EDA) hinges on the scattering of objects in the data set and the on users’ knowledge of data and their ability to express their needs. This yields a wide range of EDA scenarios and solutions that differ in the guidance they provide to users. In this paper, we investigate the interplay between modeling curiosity and familiarity in Deep Reinforcement Learning (DRL) and expressive data exploration operators. We formalize curiosity as intrinsic reward and familiarity as extrinsic reward. We examine the behavior of several policies learned for different weights for those rewards. Our experiments on SDSS, a very large sky survey data set1 provide several insights and justify the need for a deeper examination of combining DRL and data exploration operators that go beyond drill-downs and roll-ups.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"10 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126256952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.
{"title":"Leveraging Approximate Constraints for Localized Data Error Detection","authors":"Mohan Zhang, O. Schulte, Yudong Luo","doi":"10.1145/3464509.3464888","DOIUrl":"https://doi.org/10.1145/3464509.3464888","url":null,"abstract":"Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although linear regressions are essential for learned index structures, most implementations use Simple Linear Regression, which optimizes the squared error. Since learned indexes use exponential search, regressions that optimize the logarithmic error are much better tailored for the use-case. By using this fitting optimization target, we can significantly improve learned index’s lookup performance with no architectural changes. While the log-error is harder to optimize, our novel algorithms and optimization heuristics can bring a practical performance improvement of the lookup latency. Even in cases where fast build times are paramount, log-error regressions still provide a robust fallback for degenerated leaf models. The resulting regressions are much better suited for learned indexes, and speed up lookups on data sets with outliers by over a factor of 2.
{"title":"A Tailored Regression for Learned Indexes: Logarithmic Error Regression","authors":"Martin Eppert, Philipp Fent, Thomas Neumann","doi":"10.1145/3464509.3464891","DOIUrl":"https://doi.org/10.1145/3464509.3464891","url":null,"abstract":"Although linear regressions are essential for learned index structures, most implementations use Simple Linear Regression, which optimizes the squared error. Since learned indexes use exponential search, regressions that optimize the logarithmic error are much better tailored for the use-case. By using this fitting optimization target, we can significantly improve learned index’s lookup performance with no architectural changes. While the log-error is harder to optimize, our novel algorithms and optimization heuristics can bring a practical performance improvement of the lookup latency. Even in cases where fast build times are paramount, log-error regressions still provide a robust fallback for degenerated leaf models. The resulting regressions are much better suited for learned indexes, and speed up lookups on data sets with outliers by over a factor of 2.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129730166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.
{"title":"RUSLI: Real-time Updatable Spline Learned Index","authors":"Mayank Mishra, Rekha Singhal","doi":"10.1145/3464509.3464886","DOIUrl":"https://doi.org/10.1145/3464509.3464886","url":null,"abstract":"Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encoding Advisor (LEA), a learned approach to column encoding selection. LEA is trained on synthetic datasets with various distributions on the target system. Once trained, LEA uses sample data and statistics (such as cardinality) from the user’s database to predict the optimal column encodings. LEA can optimize for encoded size, query performance, or a combination of the two. Compared to the heuristic-based encoding advisor of a commercial column store on TPC-H, LEA achieves 19% lower query latency while using 26% less space.
{"title":"LEA: A Learned Encoding Advisor for Column Stores","authors":"Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska","doi":"10.1145/3464509.3464885","DOIUrl":"https://doi.org/10.1145/3464509.3464885","url":null,"abstract":"Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encoding Advisor (LEA), a learned approach to column encoding selection. LEA is trained on synthetic datasets with various distributions on the target system. Once trained, LEA uses sample data and statistics (such as cardinality) from the user’s database to predict the optimal column encodings. LEA can optimize for encoded size, query performance, or a combination of the two. Compared to the heuristic-based encoding advisor of a commercial column store on TPC-H, LEA achieves 19% lower query latency while using 26% less space.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}