arXiv - CS - Databases最新文献_第7页

Learned Indexes with Distribution Smoothing via Virtual Points 通过虚拟点平滑分布的学习索引

arXiv - CS - Databases

Pub Date : 2024-08-12 DOI: arxiv-2408.06134

Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey

Recent research on learned indexes has created a new perspective for indexesas models that map keys to their respective storage locations. These learnedindexes are created to approximate the cumulative distribution function of thekey set, where using only a single model may have limited accuracy. To overcomethis limitation, a typical method is to use multiple models, arranged in ahierarchical manner, where the query performance depends on two aspects: (i)traversal time to find the correct model and (ii) search time to find the keyin the selected model. Such a method may cause some key space regions that aredifficult to model to be placed at deeper levels in the hierarchy. To addressthis issue, we propose an alternative method that modifies the key space asopposed to any structural or model modifications. This is achieved throughmaking the key set more learnable (i.e., smoothing the distribution) byinserting virtual points. Further, we develop an algorithm named CSV tointegrate our virtual point insertion method into existing learned indexes,reducing both their traversal and search time. We implement CSV onstate-of-the-art learned indexes and evaluate them on real-world datasets. Theextensive experimental results show significant query performance improvementfor the keys in deeper levels of the index structures at a low storage cost.

最近关于学习型索引的研究为索引开创了一个新的视角，即把键映射到各自存储位置的模型。创建这些学习索引是为了近似键集的累积分布函数，在这种情况下，仅使用单一模型的准确性可能有限。为了克服这种局限性，一种典型的方法是使用多个模型，这些模型以等级方式排列，查询性能取决于两个方面：(i) 查找正确模型的遍历时间和 (ii) 在所选模型中查找密钥的搜索时间。这种方法可能会导致一些难以建模的密钥空间区域被置于层次结构中更深的层次。为了解决这个问题，我们提出了一种替代方法，即修改密钥空间，而不是修改任何结构或模型。这是通过插入虚拟点使密钥集更具可学习性（即平滑分布）来实现的。此外，我们还开发了一种名为 CSV 的算法，将我们的虚拟点插入法集成到现有的学习索引中，减少了索引的遍历和搜索时间。我们在最先进的学习索引上实现了 CSV，并在实际数据集上对其进行了评估。广泛的实验结果表明，在较低的存储成本下，索引结构较深层次的键的查询性能有了显著提高。

{"title":"Learned Indexes with Distribution Smoothing via Virtual Points","authors":"Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey","doi":"arxiv-2408.06134","DOIUrl":"https://doi.org/arxiv-2408.06134","url":null,"abstract":"Recent research on learned indexes has created a new perspective for indexes\u0000as models that map keys to their respective storage locations. These learned\u0000indexes are created to approximate the cumulative distribution function of the\u0000key set, where using only a single model may have limited accuracy. To overcome\u0000this limitation, a typical method is to use multiple models, arranged in a\u0000hierarchical manner, where the query performance depends on two aspects: (i)\u0000traversal time to find the correct model and (ii) search time to find the key\u0000in the selected model. Such a method may cause some key space regions that are\u0000difficult to model to be placed at deeper levels in the hierarchy. To address\u0000this issue, we propose an alternative method that modifies the key space as\u0000opposed to any structural or model modifications. This is achieved through\u0000making the key set more learnable (i.e., smoothing the distribution) by\u0000inserting virtual points. Further, we develop an algorithm named CSV to\u0000integrate our virtual point insertion method into existing learned indexes,\u0000reducing both their traversal and search time. We implement CSV on\u0000state-of-the-art learned indexes and evaluate them on real-world datasets. The\u0000extensive experimental results show significant query performance improvement\u0000for the keys in deeper levels of the index structures at a low storage cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Formal Concept Analysis for Data Modeling in Data Lakes 利用形式概念分析进行数据湖中的数据建模

arXiv - CS - Databases

Pub Date : 2024-08-11 DOI: arxiv-2408.13265

Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue

Data lakes are widely used to store extensive and heterogeneous datasets foradvanced analytics. However, the unstructured nature of data in theserepositories introduces complexities in exploiting them and extractingmeaningful insights. This motivates the need of exploring efficient approachesfor consolidating data lakes and deriving a common and unified schema. Thispaper introduces a practical data visualization and analysis approach rooted inFormal Concept Analysis (FCA) to systematically clean, organize, and designdata structures within a data lake. We explore diverse data structures storedin our data lake at Infologic, including InfluxDB measurements andElasticsearch indexes, aiming to derive conventions for a more accessible datamodel. Leveraging FCA, we represent data structures as objects, analyze theconcept lattice, and present two strategies-top-down and bottom-up-to unifythese structures and establish a common schema. Our methodology yieldssignificant results, enabling the identification of common concepts in the datastructures, such as resources along with their underlying shared fields(timestamp, type, usedRatio, etc.). Moreover, the number of distinct datastructure field names is reduced by 54 percent (from 190 to 88) in the studiedsubset of our data lake. We achieve a complete coverage of 80 percent of datastructures with only 34 distinct field names, a significant improvement fromthe initial 121 field names that were needed to reach such coverage. The paperprovides insights into the Infologic ecosystem, problem formulation,exploration strategies, and presents both qualitative and quantitative results.

数据湖被广泛用于存储用于高级分析的大量异构数据集。然而，这些存储库中数据的非结构化性质给利用这些数据和提取有意义的见解带来了复杂性。这就促使人们需要探索有效的方法来整合数据湖，并衍生出通用的统一模式。本文介绍了一种植根于规范概念分析（FCA）的实用数据可视化和分析方法，用于系统地清理、组织和设计数据湖中的数据结构。我们探索了存储在 Infologic 数据湖中的各种数据结构，包括 InfluxDB 测量和 Elasticsearch 索引，旨在为更易于访问的数据模型制定惯例。利用 FCA，我们将数据结构表示为对象，分析概念网格，并提出了两种策略--自上而下和自下而上--以统一这些结构并建立通用模式。我们的方法取得了显著的成果，能够识别数据结构中的共同概念，如资源及其底层共享字段（时间戳、类型、使用率等）。此外，在我们研究的数据湖子集中，不同数据结构字段名称的数量减少了 54%（从 190 个减少到 88 个）。我们只用了 34 个不同的字段名就实现了对 80% 数据结构的完全覆盖，这与最初要达到这样的覆盖率所需的 121 个字段名相比有了显著改善。本文深入介绍了 Infologic 生态系统、问题制定、探索策略，并展示了定性和定量结果。

{"title":"Exploiting Formal Concept Analysis for Data Modeling in Data Lakes","authors":"Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue","doi":"arxiv-2408.13265","DOIUrl":"https://doi.org/arxiv-2408.13265","url":null,"abstract":"Data lakes are widely used to store extensive and heterogeneous datasets for\u0000advanced analytics. However, the unstructured nature of data in these\u0000repositories introduces complexities in exploiting them and extracting\u0000meaningful insights. This motivates the need of exploring efficient approaches\u0000for consolidating data lakes and deriving a common and unified schema. This\u0000paper introduces a practical data visualization and analysis approach rooted in\u0000Formal Concept Analysis (FCA) to systematically clean, organize, and design\u0000data structures within a data lake. We explore diverse data structures stored\u0000in our data lake at Infologic, including InfluxDB measurements and\u0000Elasticsearch indexes, aiming to derive conventions for a more accessible data\u0000model. Leveraging FCA, we represent data structures as objects, analyze the\u0000concept lattice, and present two strategies-top-down and bottom-up-to unify\u0000these structures and establish a common schema. Our methodology yields\u0000significant results, enabling the identification of common concepts in the data\u0000structures, such as resources along with their underlying shared fields\u0000(timestamp, type, usedRatio, etc.). Moreover, the number of distinct data\u0000structure field names is reduced by 54 percent (from 190 to 88) in the studied\u0000subset of our data lake. We achieve a complete coverage of 80 percent of data\u0000structures with only 34 distinct field names, a significant improvement from\u0000the initial 121 field names that were needed to reach such coverage. The paper\u0000provides insights into the Infologic ecosystem, problem formulation,\u0000exploration strategies, and presents both qualitative and quantitative results.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Memento Filter: A Fast, Dynamic, and Robust Range Filter Memento 滤波器：快速、动态、稳健的范围滤波器

arXiv - CS - Databases

Pub Date : 2024-08-10 DOI: arxiv-2408.05625

Navid Eslami, Niv Dayan

Range filters are probabilistic data structures that answer approximate rangeemptiness queries. They aid in avoiding processing empty range queries and haveuse cases in many application domains such as key-value stores and social webanalytics. However, current range filter designs do not support dynamicallychanging and growing datasets. Moreover, several of these designs also exhibitimpractically high false positive rates under correlated workloads, which arecommon in practice. These impediments restrict the applicability of rangefilters across a wide range of use cases. We introduce Memento filter, the first range filter to offer dynamicity, fastoperations, and a robust false positive rate guarantee for any workload.Memento filter partitions the key universe and clusters its keys according tothis partitioning. For each cluster, it stores a fingerprint and a list of keysuffixes contiguously. The encoding of these lists makes them amenable toexisting dynamic filter structures. Due to the well-defined one-to-one mappingfrom keys to suffixes, Memento filter supports inserts and deletes and can evenexpand to accommodate a growing dataset. We implement Memento filter on top of a Rank-and-Select Quotient filter andInfiniFilter and demonstrate that it achieves competitive false positive ratesand performance with the state-of-the-art while also providing dynamicity. Dueto its dynamicity, Memento filter is the first range filter applicable toB-Trees. We showcase this by integrating Memento filter into WiredTiger, aB-Tree-based key-value store. Memento filter doubles WiredTiger's range querythroughput when 50% of the queries are empty while keeping all other costmetrics unharmed.

范围过滤器是一种概率数据结构，用于回答近似范围空查询。它们有助于避免处理空范围查询，在键值存储和社交网络分析等许多应用领域都有使用案例。然而，当前的范围过滤器设计不支持动态变化和增长的数据集。此外，其中一些设计还在相关工作负载下表现出高误报率，这在实践中很常见。这些障碍限制了范围过滤器在各种应用案例中的适用性。我们引入了 Memento 过滤器，它是第一款为任何工作负载提供动态性、快速操作和稳健误报率保证的范围过滤器。对于每个群集，它都会连续存储一个指纹和密钥后缀列表。对这些列表的编码使它们适合于现有的动态过滤器结构。由于从键到后缀的一对一映射定义明确，Memento 过滤器支持插入和删除，甚至可以扩展以适应不断增长的数据集。我们在 Rank-and-Select Quotient 过滤器和 InfiniFilter 过滤器的基础上实现了 Memento 过滤器，并证明它的误报率和性能与最先进的过滤器相比具有竞争力，同时还提供了动态性。由于其动态性，Memento 过滤器是第一个适用于 B 树的范围过滤器。我们将 Memento 过滤器集成到基于 B 树的键值存储 WiredTiger 中，展示了这一点。当 50% 的查询为空时，Memento 过滤器能将 WiredTiger 的范围查询吞吐量提高一倍，同时保持所有其他成本指标不受影响。

{"title":"Memento Filter: A Fast, Dynamic, and Robust Range Filter","authors":"Navid Eslami, Niv Dayan","doi":"arxiv-2408.05625","DOIUrl":"https://doi.org/arxiv-2408.05625","url":null,"abstract":"Range filters are probabilistic data structures that answer approximate range\u0000emptiness queries. They aid in avoiding processing empty range queries and have\u0000use cases in many application domains such as key-value stores and social web\u0000analytics. However, current range filter designs do not support dynamically\u0000changing and growing datasets. Moreover, several of these designs also exhibit\u0000impractically high false positive rates under correlated workloads, which are\u0000common in practice. These impediments restrict the applicability of range\u0000filters across a wide range of use cases. We introduce Memento filter, the first range filter to offer dynamicity, fast\u0000operations, and a robust false positive rate guarantee for any workload.\u0000Memento filter partitions the key universe and clusters its keys according to\u0000this partitioning. For each cluster, it stores a fingerprint and a list of key\u0000suffixes contiguously. The encoding of these lists makes them amenable to\u0000existing dynamic filter structures. Due to the well-defined one-to-one mapping\u0000from keys to suffixes, Memento filter supports inserts and deletes and can even\u0000expand to accommodate a growing dataset. We implement Memento filter on top of a Rank-and-Select Quotient filter and\u0000InfiniFilter and demonstrate that it achieves competitive false positive rates\u0000and performance with the state-of-the-art while also providing dynamicity. Due\u0000to its dynamicity, Memento filter is the first range filter applicable to\u0000B-Trees. We showcase this by integrating Memento filter into WiredTiger, a\u0000B-Tree-based key-value store. Memento filter doubles WiredTiger's range query\u0000throughput when 50% of the queries are empty while keeping all other cost\u0000metrics unharmed.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Context-Driven Index Trimming: A Data Quality Perspective to Enhancing Precision of RALMs 上下文驱动的索引修剪：从数据质量的角度提高 RALMs 的精确度

arXiv - CS - Databases

Pub Date : 2024-08-10 DOI: arxiv-2408.05524

Kexin Ma, Ruochun Jin, Xi Wang, Huan Chen, Jing Ren, Yuhua Tang

Retrieval-Augmented Large Language Models (RALMs) have made significantstrides in enhancing the accuracy of generated responses.However, existingresearch often overlooks the data quality issues within retrieval results,often caused by inaccurate existing vector-distance-based retrieval methods.Wepropose to boost the precision of RALMs' answers from a data qualityperspective through the Context-Driven Index Trimming (CDIT) framework, whereContext Matching Dependencies (CMDs) are employed as logical data quality rulesto capture and regulate the consistency between retrieved contexts.Based on thesemantic comprehension capabilities of Large Language Models (LLMs), CDIT caneffectively identify and discard retrieval results that are inconsistent withthe query context and further modify indexes in the database, thereby improvinganswer quality.Experiments demonstrate on challenging question-answeringtasks.Also, the flexibility of CDIT is verified through its compatibility withvarious language models and indexing methods, which offers a promising approachto bolster RALMs' data quality and retrieval precision jointly.

检索增强大语言模型（RALMs）在提高生成回答的准确性方面取得了长足的进步。然而，现有的研究往往忽视了检索结果中的数据质量问题，而这些问题往往是由现有的基于向量距离的检索方法不准确造成的。我们建议通过上下文驱动的索引修剪（CDIT）框架，从数据质量的角度提高 RALMs 答案的精确度，其中上下文匹配依赖（CMD）被用作逻辑数据质量规则，以捕捉和调节检索上下文之间的一致性。基于大语言模型（LLMs）的语义理解能力，CDIT可以有效地识别和丢弃与查询上下文不一致的检索结果，并进一步修改数据库中的索引，从而提高答案质量。

{"title":"Context-Driven Index Trimming: A Data Quality Perspective to Enhancing Precision of RALMs","authors":"Kexin Ma, Ruochun Jin, Xi Wang, Huan Chen, Jing Ren, Yuhua Tang","doi":"arxiv-2408.05524","DOIUrl":"https://doi.org/arxiv-2408.05524","url":null,"abstract":"Retrieval-Augmented Large Language Models (RALMs) have made significant\u0000strides in enhancing the accuracy of generated responses.However, existing\u0000research often overlooks the data quality issues within retrieval results,\u0000often caused by inaccurate existing vector-distance-based retrieval methods.We\u0000propose to boost the precision of RALMs' answers from a data quality\u0000perspective through the Context-Driven Index Trimming (CDIT) framework, where\u0000Context Matching Dependencies (CMDs) are employed as logical data quality rules\u0000to capture and regulate the consistency between retrieved contexts.Based on the\u0000semantic comprehension capabilities of Large Language Models (LLMs), CDIT can\u0000effectively identify and discard retrieval results that are inconsistent with\u0000the query context and further modify indexes in the database, thereby improving\u0000answer quality.Experiments demonstrate on challenging question-answering\u0000tasks.Also, the flexibility of CDIT is verified through its compatibility with\u0000various language models and indexing methods, which offers a promising approach\u0000to bolster RALMs' data quality and retrieval precision jointly.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simpler is More: Efficient Top-K Nearest Neighbors Search on Large Road Networks 越简单越好：大型道路网络上的高效 Top-K 近邻搜索

arXiv - CS - Databases

Pub Date : 2024-08-10 DOI: arxiv-2408.05432

Yiqi Wang, Long Yuan, Wenjie Zhang, Xuemin Lin, Zi Chen, Qing Liu

Top-k Nearest Neighbors (kNN) problem on road network has numerousapplications on location-based services. As direct search using the Dijkstra'salgorithm results in a large search space, a plethora of complex-index-basedapproaches have been proposed to speedup the query processing. However, evenwith the current state-of-the-art approach, long query processing delayspersist, along with significant space overhead and prohibitively long indexingtime. In this paper, we depart from the complex index designs prevalent inexisting literature and propose a simple index named KNN-Index. With KNN-Index,we can answer a kNN query optimally and progressively with small andsize-bounded index. To improve the index construction performance, we propose abidirectional construction algorithm which can effectively share the commoncomputation during the construction. Theoretical analysis and experimentalresults on real road networks demonstrate the superiority of KNN-Index over thestate-of-the-art approach in query processing performance, index size, andindex construction efficiency.

道路网络上的顶k近邻（kNN）问题在基于位置的服务中有着广泛的应用。由于使用 Dijkstra 算法进行直接搜索会产生很大的搜索空间，因此人们提出了大量基于复杂索引的方法来加快查询处理速度。然而，即使是当前最先进的方法，也存在查询处理延迟长、空间开销大、索引时间过长等问题。在本文中，我们摒弃了现有文献中普遍存在的复杂索引设计，提出了一种名为 KNN-Index 的简单索引。通过 KNN-Index ，我们可以用小规模、有限制的索引来优化和渐进地回答 kNN 查询。为了提高索引构建性能，我们提出了一种双向构建算法，它能有效地分担构建过程中的公共计算。理论分析和在真实道路网络上的实验结果表明，KNN-Index 在查询处理性能、索引大小和索引构建效率方面都优于最先进的方法。

{"title":"Simpler is More: Efficient Top-K Nearest Neighbors Search on Large Road Networks","authors":"Yiqi Wang, Long Yuan, Wenjie Zhang, Xuemin Lin, Zi Chen, Qing Liu","doi":"arxiv-2408.05432","DOIUrl":"https://doi.org/arxiv-2408.05432","url":null,"abstract":"Top-k Nearest Neighbors (kNN) problem on road network has numerous\u0000applications on location-based services. As direct search using the Dijkstra's\u0000algorithm results in a large search space, a plethora of complex-index-based\u0000approaches have been proposed to speedup the query processing. However, even\u0000with the current state-of-the-art approach, long query processing delays\u0000persist, along with significant space overhead and prohibitively long indexing\u0000time. In this paper, we depart from the complex index designs prevalent in\u0000existing literature and propose a simple index named KNN-Index. With KNN-Index,\u0000we can answer a kNN query optimally and progressively with small and\u0000size-bounded index. To improve the index construction performance, we propose a\u0000bidirectional construction algorithm which can effectively share the common\u0000computation during the construction. Theoretical analysis and experimental\u0000results on real road networks demonstrate the superiority of KNN-Index over the\u0000state-of-the-art approach in query processing performance, index size, and\u0000index construction efficiency.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SEA-SQL: Semantic-Enhanced Text-to-SQL with Adaptive Refinement SEA-SQL：语义增强型文本到 SQL 自适应细化

arXiv - CS - Databases

Pub Date : 2024-08-09 DOI: arxiv-2408.04919

Chaofan Li, Yingxia Shao, Zheng Liu

Recent advancements in large language models (LLMs) have significantlycontributed to the progress of the Text-to-SQL task. A common requirement inmany of these works is the post-correction of SQL queries. However, themajority of this process entails analyzing error cases to develop prompts withrules that eliminate model bias. And there is an absence of executionverification for SQL queries. In addition, the prevalent techniques primarilydepend on GPT-4 and few-shot prompts, resulting in expensive costs. Toinvestigate the effective methods for SQL refinement in a cost-efficientmanner, we introduce Semantic-Enhanced Text-to-SQL with Adaptive Refinement(SEA-SQL), which includes Adaptive Bias Elimination and Dynamic ExecutionAdjustment, aims to improve performance while minimizing resource expenditurewith zero-shot prompts. Specifically, SEA-SQL employs a semantic-enhancedschema to augment database information and optimize SQL queries. During the SQLquery generation, a fine-tuned adaptive bias eliminator is applied to mitigateinherent biases caused by the LLM. The dynamic execution adjustment is utilizedto guarantee the executability of the bias eliminated SQL query. We conductexperiments on the Spider and BIRD datasets to demonstrate the effectiveness ofthis framework. The results demonstrate that SEA-SQL achieves state-of-the-artperformance in the GPT3.5 scenario with 9%-58% of the generation cost.Furthermore, SEA-SQL is comparable to GPT-4 with only 0.9%-5.3% of thegeneration cost.

大型语言模型（LLM）的最新进展极大地推动了文本到 SQL 任务的进展。这些工作中的一个共同要求是对 SQL 查询进行事后纠正。然而，这一过程的大部分工作都需要分析错误案例，以制定具有消除模型偏差的规则的提示。此外，还缺乏对 SQL 查询的执行验证。此外，目前流行的技术主要依赖于 GPT-4 和少量提示，导致成本高昂。为了探索低成本高效率的 SQL 精炼方法，我们引入了语义增强型文本到 SQL 自适应精炼（SEA-SQL），其中包括自适应偏差消除和动态执行调整，旨在提高性能的同时最大限度地减少资源支出，并实现零次提示。具体来说，SEA-SQL 采用语义增强模式来增强数据库信息并优化 SQL 查询。在 SQL 查询生成过程中，应用微调自适应偏差消除器来减轻由 LLM 引起的固有偏差。利用动态执行调整来保证消除了偏差的 SQL 查询的可执行性。我们在 Spider 和 BIRD 数据集上进行了实验，以证明该框架的有效性。结果表明，SEA-SQL在GPT3.5场景下实现了最先进的性能，生成成本降低了9%-58%，而且SEA-SQL与GPT-4相当，生成成本仅降低了0.9%-5.3%。

{"title":"SEA-SQL: Semantic-Enhanced Text-to-SQL with Adaptive Refinement","authors":"Chaofan Li, Yingxia Shao, Zheng Liu","doi":"arxiv-2408.04919","DOIUrl":"https://doi.org/arxiv-2408.04919","url":null,"abstract":"Recent advancements in large language models (LLMs) have significantly\u0000contributed to the progress of the Text-to-SQL task. A common requirement in\u0000many of these works is the post-correction of SQL queries. However, the\u0000majority of this process entails analyzing error cases to develop prompts with\u0000rules that eliminate model bias. And there is an absence of execution\u0000verification for SQL queries. In addition, the prevalent techniques primarily\u0000depend on GPT-4 and few-shot prompts, resulting in expensive costs. To\u0000investigate the effective methods for SQL refinement in a cost-efficient\u0000manner, we introduce Semantic-Enhanced Text-to-SQL with Adaptive Refinement\u0000(SEA-SQL), which includes Adaptive Bias Elimination and Dynamic Execution\u0000Adjustment, aims to improve performance while minimizing resource expenditure\u0000with zero-shot prompts. Specifically, SEA-SQL employs a semantic-enhanced\u0000schema to augment database information and optimize SQL queries. During the SQL\u0000query generation, a fine-tuned adaptive bias eliminator is applied to mitigate\u0000inherent biases caused by the LLM. The dynamic execution adjustment is utilized\u0000to guarantee the executability of the bias eliminated SQL query. We conduct\u0000experiments on the Spider and BIRD datasets to demonstrate the effectiveness of\u0000this framework. The results demonstrate that SEA-SQL achieves state-of-the-art\u0000performance in the GPT3.5 scenario with 9%-58% of the generation cost.\u0000Furthermore, SEA-SQL is comparable to GPT-4 with only 0.9%-5.3% of the\u0000generation cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? 使用大型语言模型的 NL2SQL 调查：我们在哪里，我们要去哪里？

arXiv - CS - Databases

Pub Date : 2024-08-09 DOI: arxiv-2408.05109

Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang

Translating users' natural language queries (NL) into SQL queries (i.e.,NL2SQL) can significantly reduce barriers to accessing relational databases andsupport various commercial applications. The performance of NL2SQL has beengreatly enhanced with the emergence of Large Language Models (LLMs). In thissurvey, we provide a comprehensive review of NL2SQL techniques powered by LLMs,covering its entire lifecycle from the following four aspects: (1) Model:NL2SQL translation techniques that tackle not only NL ambiguity andunder-specification, but also properly map NL with database schema andinstances; (2) Data: From the collection of training data, data synthesis dueto training data scarcity, to NL2SQL benchmarks; (3) Evaluation: EvaluatingNL2SQL methods from multiple angles using different metrics and granularities;and (4) Error Analysis: analyzing NL2SQL errors to find the root cause andguiding NL2SQL models to evolve. Moreover, we provide a rule of thumb fordeveloping NL2SQL solutions. Finally, we discuss the research challenges andopen problems of NL2SQL in the LLMs era.

将用户的自然语言查询（NL）翻译成 SQL 查询（即 NL2SQL）可以大大减少访问关系数据库的障碍，并为各种商业应用提供支持。随着大型语言模型（LLM）的出现，NL2SQL 的性能大大提高。在本次调查中，我们从以下四个方面全面回顾了由LLMs驱动的NL2SQL技术，涵盖了其整个生命周期：（1）模型：NL2SQL翻译技术不仅要解决NL歧义和规范不足的问题，还要将NL与数据库模式和实例正确映射；（2）数据：从训练数据的收集、因训练数据稀缺而进行的数据合成，到 NL2SQL 基准；(3) 评估：(4) 错误分析：分析 NL2SQL 错误以找到根本原因，并指导 NL2SQL 模型发展。此外，我们还提供了开发 NL2SQL 解决方案的经验法则。最后，我们讨论了LLMs时代NL2SQL的研究挑战和悬而未决的问题。

{"title":"A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?","authors":"Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang","doi":"arxiv-2408.05109","DOIUrl":"https://doi.org/arxiv-2408.05109","url":null,"abstract":"Translating users' natural language queries (NL) into SQL queries (i.e.,\u0000NL2SQL) can significantly reduce barriers to accessing relational databases and\u0000support various commercial applications. The performance of NL2SQL has been\u0000greatly enhanced with the emergence of Large Language Models (LLMs). In this\u0000survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs,\u0000covering its entire lifecycle from the following four aspects: (1) Model:\u0000NL2SQL translation techniques that tackle not only NL ambiguity and\u0000under-specification, but also properly map NL with database schema and\u0000instances; (2) Data: From the collection of training data, data synthesis due\u0000to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating\u0000NL2SQL methods from multiple angles using different metrics and granularities;\u0000and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and\u0000guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for\u0000developing NL2SQL solutions. Finally, we discuss the research challenges and\u0000open problems of NL2SQL in the LLMs era.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"271 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HotStuff-1: Linear Consensus with One-Phase Speculation HotStuff-1：单相猜测的线性共识

arXiv - CS - Databases

Pub Date : 2024-08-08 DOI: arxiv-2408.04728

Dakai Kang, Suyash Gupta, Dahlia Malkhi, Mohammad Sadoghi

This paper introduces HotStuff-1, a BFT consensus protocol that improves thelatency of HotStuff-2 by two network-hops while maintaining linearcommunication complexity against faults. Additionally, HotStuff-1 incorporatesan incentive-compatible leader rotation regime that motivates leaders to commitconsensus decisions promptly. HotStuff-1 achieves a reduction by two network hops by sending clients earlyfinality confirmations speculatively, after one phase of the protocol. Unlikeprevious speculation regimes, the early finality confirmation path ofHotStuff-1 is fault-tolerant and the latency improvement does not rely onoptimism. An important consideration for speculation regimes in general, whichis referred to as the prefix speculation dilemma, is exposed and resolved. HotStuff-1 embodies an additional mechanism, slotting, that thwartsreal-world delays caused by rationally-incentivized leaders. Leaders may alsobe inclined to sabotage each other's progress. The slotting mechanism allowsleaders to drive multiple decisions, thus mitigating both threats, whiledynamically adapting the number of allowed decisions per leader to networktransmission delays.

本文介绍了一种 BFT 共识协议 HotStuff-1，它将 HotStuff-2 的延迟提高了两个网络跳数，同时保持了线性通信复杂性以应对故障。此外，HotStuff-1 还采用了激励兼容的领导者轮换制度，激励领导者及时提交共识决策。HotStuff-1 通过在协议的一个阶段后向客户端发送推测性的早期结果确认，减少了两个网络跳数。与以往的推测机制不同，HotStuff-1 的早期最终确认路径是容错的，延迟的改善并不依赖于乐观主义。对于一般的推测机制来说，有一个重要的考虑因素（被称为前缀推测困境）被暴露出来并得到解决。HotStuff-1 采用了一种额外的机制--插槽，它可以阻止理性激励的领导者造成的真实世界延迟。领导者也可能倾向于破坏对方的进展。插槽机制允许领导者做出多个决定，从而减轻了这两种威胁，同时还能根据网络传输延迟调整每个领导者允许做出的决定数量。

{"title":"HotStuff-1: Linear Consensus with One-Phase Speculation","authors":"Dakai Kang, Suyash Gupta, Dahlia Malkhi, Mohammad Sadoghi","doi":"arxiv-2408.04728","DOIUrl":"https://doi.org/arxiv-2408.04728","url":null,"abstract":"This paper introduces HotStuff-1, a BFT consensus protocol that improves the\u0000latency of HotStuff-2 by two network-hops while maintaining linear\u0000communication complexity against faults. Additionally, HotStuff-1 incorporates\u0000an incentive-compatible leader rotation regime that motivates leaders to commit\u0000consensus decisions promptly. HotStuff-1 achieves a reduction by two network hops by sending clients early\u0000finality confirmations speculatively, after one phase of the protocol. Unlike\u0000previous speculation regimes, the early finality confirmation path of\u0000HotStuff-1 is fault-tolerant and the latency improvement does not rely on\u0000optimism. An important consideration for speculation regimes in general, which\u0000is referred to as the prefix speculation dilemma, is exposed and resolved. HotStuff-1 embodies an additional mechanism, slotting, that thwarts\u0000real-world delays caused by rationally-incentivized leaders. Leaders may also\u0000be inclined to sabotage each other's progress. The slotting mechanism allows\u0000leaders to drive multiple decisions, thus mitigating both threats, while\u0000dynamically adapting the number of allowed decisions per leader to network\u0000transmission delays.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"133 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding CREST：有效压缩数据存储，实现基于检索的推测性解码

arXiv - CS - Databases

Pub Date : 2024-08-08 DOI: arxiv-2408.04678

Sophia Ho, Jinsol Park, Patrick Wang

We present CREST (Compact Retrieval-Based Speculative Decoding), a redesignof REST that allows it to be effectively "compacted". REST is a draftingtechnique for speculative decoding based on retrieving exact n-gram matches ofthe most recent n tokens generated by the target LLM from a datastore. The keyidea of CREST is to only store a subset of the smallest and most common n-gramsin the datastore with the hope of achieving comparable performance with lessstorage space. We found that storing a subset of n-grams both reduces storagespace and improves performance. CREST matches REST's accepted token length with10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptancelength than REST using the same storage space on the HumanEval and MT Benchbenchmarks.

我们提出了 CREST（基于紧凑检索的推测性解码），它是对 REST 的重新设计，可以有效地将其 "紧凑化"。REST 是一种用于推测解码的起草技术，它基于从数据存储中检索目标 LLM 最近生成的 n 个词组的精确 n-gram 匹配。CREST 的关键理念是在数据存储中只存储最小和最常见的 n 个词组的子集，希望以较少的存储空间实现相当的性能。我们发现，存储 n-grams 的子集既能减少存储空间，又能提高性能。在 HumanEval 和 MT Benchbenchmarks 上，CREST 用 10.6-13.5 倍的存储空间达到了 REST 的可接受标记长度，用相同的存储空间实现了比 REST 高 16.5-17.1% 的可接受长度。

引用次数: 0

Programmable Dataflows: Abstraction and Programming Model for Data Sharing 可编程数据流：数据共享的抽象和编程模型

arXiv - CS - Databases

Pub Date : 2024-08-07 DOI: arxiv-2408.04092

Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez

Data sharing is central to a wide variety of applications such as frauddetection, ad matching, and research. The lack of data sharing abstractionsmakes the solution to each data sharing problem bespoke and cost-intensive,hampering value generation. In this paper, we first introduce a data sharingmodel to represent every data sharing problem with a sequence of dataflows.From the model, we distill an abstraction, the contract, which agents use tocommunicate the intent of a dataflow and evaluate its consequences, before thedataflow takes place. This helps agents move towards a common sharing goalwithout violating any regulatory and privacy constraints. Then, we design andimplement the contract programming model (CPM), which allows agents to programdata sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduceinefficiencies. To mitigate those inefficiencies, we extend the CPM so that itcan save intermediate outputs of dataflows, and skip computation if a dataflowtries to access data that it does not have access to. In our evaluation, weshow that 1) the contract abstraction is general enough to represent a widerange of sharing problems, 2) we can write programs for complex data sharingproblems and exhibit qualitative improvements over other alternatetechnologies, and 3) quantitatively, our optimizations make sharing programswritten with the CPM efficient.

数据共享是欺诈检测、广告匹配和研究等各种应用的核心。由于缺乏数据共享抽象，每个数据共享问题的解决方案都是定制的，而且成本高昂，阻碍了价值的产生。在本文中，我们首先引入了一个数据共享模型，用一连串数据流来表示每个数据共享问题。从该模型中，我们提炼出了一个抽象概念--契约，在数据流发生之前，代理使用契约来交流数据流的意图并评估其后果。这有助于代理在不违反任何法规和隐私限制的情况下实现共同的共享目标。然后，我们设计并实现了合约编程模型（CPM），它允许代理针对每个问题的需求对数据共享应用程序进行编程。合约允许数据共享，但其交互性可能会带来效率低下的问题。为了降低效率，我们对 CPM 进行了扩展，使其能够保存数据流的中间输出，并在数据流试图访问其无法访问的数据时跳过计算。在评估中，我们发现：1）合约抽象具有足够的通用性，可以代表更广泛的共享问题；2）我们可以为复杂的数据共享问题编写程序，并且与其他替代技术相比，在质量上有所改进；3）从数量上看，我们的优化使使用 CPM 编写的共享程序变得高效。

{"title":"Programmable Dataflows: Abstraction and Programming Model for Data Sharing","authors":"Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez","doi":"arxiv-2408.04092","DOIUrl":"https://doi.org/arxiv-2408.04092","url":null,"abstract":"Data sharing is central to a wide variety of applications such as fraud\u0000detection, ad matching, and research. The lack of data sharing abstractions\u0000makes the solution to each data sharing problem bespoke and cost-intensive,\u0000hampering value generation. In this paper, we first introduce a data sharing\u0000model to represent every data sharing problem with a sequence of dataflows.\u0000From the model, we distill an abstraction, the contract, which agents use to\u0000communicate the intent of a dataflow and evaluate its consequences, before the\u0000dataflow takes place. This helps agents move towards a common sharing goal\u0000without violating any regulatory and privacy constraints. Then, we design and\u0000implement the contract programming model (CPM), which allows agents to program\u0000data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce\u0000inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it\u0000can save intermediate outputs of dataflows, and skip computation if a dataflow\u0000tries to access data that it does not have access to. In our evaluation, we\u0000show that 1) the contract abstraction is general enough to represent a wide\u0000range of sharing problems, 2) we can write programs for complex data sharing\u0000problems and exhibit qualitative improvements over other alternate\u0000technologies, and 3) quantitatively, our optimizations make sharing programs\u0000written with the CPM efficient.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0