Learned Indexes with Distribution Smoothing via Virtual Points

Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey
{"title":"Learned Indexes with Distribution Smoothing via Virtual Points","authors":"Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey","doi":"arxiv-2408.06134","DOIUrl":null,"url":null,"abstract":"Recent research on learned indexes has created a new perspective for indexes\nas models that map keys to their respective storage locations. These learned\nindexes are created to approximate the cumulative distribution function of the\nkey set, where using only a single model may have limited accuracy. To overcome\nthis limitation, a typical method is to use multiple models, arranged in a\nhierarchical manner, where the query performance depends on two aspects: (i)\ntraversal time to find the correct model and (ii) search time to find the key\nin the selected model. Such a method may cause some key space regions that are\ndifficult to model to be placed at deeper levels in the hierarchy. To address\nthis issue, we propose an alternative method that modifies the key space as\nopposed to any structural or model modifications. This is achieved through\nmaking the key set more learnable (i.e., smoothing the distribution) by\ninserting virtual points. Further, we develop an algorithm named CSV to\nintegrate our virtual point insertion method into existing learned indexes,\nreducing both their traversal and search time. We implement CSV on\nstate-of-the-art learned indexes and evaluate them on real-world datasets. The\nextensive experimental results show significant query performance improvement\nfor the keys in deeper levels of the index structures at a low storage cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent research on learned indexes has created a new perspective for indexes as models that map keys to their respective storage locations. These learned indexes are created to approximate the cumulative distribution function of the key set, where using only a single model may have limited accuracy. To overcome this limitation, a typical method is to use multiple models, arranged in a hierarchical manner, where the query performance depends on two aspects: (i) traversal time to find the correct model and (ii) search time to find the key in the selected model. Such a method may cause some key space regions that are difficult to model to be placed at deeper levels in the hierarchy. To address this issue, we propose an alternative method that modifies the key space as opposed to any structural or model modifications. This is achieved through making the key set more learnable (i.e., smoothing the distribution) by inserting virtual points. Further, we develop an algorithm named CSV to integrate our virtual point insertion method into existing learned indexes, reducing both their traversal and search time. We implement CSV on state-of-the-art learned indexes and evaluate them on real-world datasets. The extensive experimental results show significant query performance improvement for the keys in deeper levels of the index structures at a low storage cost.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过虚拟点平滑分布的学习索引
最近关于学习型索引的研究为索引开创了一个新的视角,即把键映射到各自存储位置的模型。创建这些学习索引是为了近似键集的累积分布函数,在这种情况下,仅使用单一模型的准确性可能有限。为了克服这种局限性,一种典型的方法是使用多个模型,这些模型以等级方式排列,查询性能取决于两个方面:(i) 查找正确模型的遍历时间和 (ii) 在所选模型中查找密钥的搜索时间。这种方法可能会导致一些难以建模的密钥空间区域被置于层次结构中更深的层次。为了解决这个问题,我们提出了一种替代方法,即修改密钥空间,而不是修改任何结构或模型。这是通过插入虚拟点使密钥集更具可学习性(即平滑分布)来实现的。此外,我们还开发了一种名为 CSV 的算法,将我们的虚拟点插入法集成到现有的学习索引中,减少了索引的遍历和搜索时间。我们在最先进的学习索引上实现了 CSV,并在实际数据集上对其进行了评估。广泛的实验结果表明,在较低的存储成本下,索引结构较深层次的键的查询性能有了显著提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1