Learned Indexes with Distribution Smoothing via Virtual Points

arXiv - CS - Databases Pub Date : 2024-08-12 DOI:arxiv-2408.06134

Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey

{"title":"Learned Indexes with Distribution Smoothing via Virtual Points","authors":"Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey","doi":"arxiv-2408.06134","DOIUrl":null,"url":null,"abstract":"Recent research on learned indexes has created a new perspective for indexes\nas models that map keys to their respective storage locations. These learned\nindexes are created to approximate the cumulative distribution function of the\nkey set, where using only a single model may have limited accuracy. To overcome\nthis limitation, a typical method is to use multiple models, arranged in a\nhierarchical manner, where the query performance depends on two aspects: (i)\ntraversal time to find the correct model and (ii) search time to find the key\nin the selected model. Such a method may cause some key space regions that are\ndifficult to model to be placed at deeper levels in the hierarchy. To address\nthis issue, we propose an alternative method that modifies the key space as\nopposed to any structural or model modifications. This is achieved through\nmaking the key set more learnable (i.e., smoothing the distribution) by\ninserting virtual points. Further, we develop an algorithm named CSV to\nintegrate our virtual point insertion method into existing learned indexes,\nreducing both their traversal and search time. We implement CSV on\nstate-of-the-art learned indexes and evaluate them on real-world datasets. The\nextensive experimental results show significant query performance improvement\nfor the keys in deeper levels of the index structures at a low storage cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent research on learned indexes has created a new perspective for indexes as models that map keys to their respective storage locations. These learned indexes are created to approximate the cumulative distribution function of the key set, where using only a single model may have limited accuracy. To overcome this limitation, a typical method is to use multiple models, arranged in a hierarchical manner, where the query performance depends on two aspects: (i) traversal time to find the correct model and (ii) search time to find the key in the selected model. Such a method may cause some key space regions that are difficult to model to be placed at deeper levels in the hierarchy. To address this issue, we propose an alternative method that modifies the key space as opposed to any structural or model modifications. This is achieved through making the key set more learnable (i.e., smoothing the distribution) by inserting virtual points. Further, we develop an algorithm named CSV to integrate our virtual point insertion method into existing learned indexes, reducing both their traversal and search time. We implement CSV on state-of-the-art learned indexes and evaluate them on real-world datasets. The extensive experimental results show significant query performance improvement for the keys in deeper levels of the index structures at a low storage cost.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过虚拟点平滑分布的学习索引

最近关于学习型索引的研究为索引开创了一个新的视角，即把键映射到各自存储位置的模型。创建这些学习索引是为了近似键集的累积分布函数，在这种情况下，仅使用单一模型的准确性可能有限。为了克服这种局限性，一种典型的方法是使用多个模型，这些模型以等级方式排列，查询性能取决于两个方面：(i) 查找正确模型的遍历时间和 (ii) 在所选模型中查找密钥的搜索时间。这种方法可能会导致一些难以建模的密钥空间区域被置于层次结构中更深的层次。为了解决这个问题，我们提出了一种替代方法，即修改密钥空间，而不是修改任何结构或模型。这是通过插入虚拟点使密钥集更具可学习性（即平滑分布）来实现的。此外，我们还开发了一种名为 CSV 的算法，将我们的虚拟点插入法集成到现有的学习索引中，减少了索引的遍历和搜索时间。我们在最先进的学习索引上实现了 CSV，并在实际数据集上对其进行了评估。广泛的实验结果表明，在较低的存储成本下，索引结构较深层次的键的查询性能有了显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes