Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey
Recent research on learned indexes has created a new perspective for indexes as models that map keys to their respective storage locations. These learned indexes are created to approximate the cumulative distribution function of the key set, where using only a single model may have limited accuracy. To overcome this limitation, a typical method is to use multiple models, arranged in a hierarchical manner, where the query performance depends on two aspects: (i) traversal time to find the correct model and (ii) search time to find the key in the selected model. Such a method may cause some key space regions that are difficult to model to be placed at deeper levels in the hierarchy. To address this issue, we propose an alternative method that modifies the key space as opposed to any structural or model modifications. This is achieved through making the key set more learnable (i.e., smoothing the distribution) by inserting virtual points. Further, we develop an algorithm named CSV to integrate our virtual point insertion method into existing learned indexes, reducing both their traversal and search time. We implement CSV on state-of-the-art learned indexes and evaluate them on real-world datasets. The extensive experimental results show significant query performance improvement for the keys in deeper levels of the index structures at a low storage cost.
最近关于学习型索引的研究为索引开创了一个新的视角,即把键映射到各自存储位置的模型。创建这些学习索引是为了近似键集的累积分布函数,在这种情况下,仅使用单一模型的准确性可能有限。为了克服这种局限性,一种典型的方法是使用多个模型,这些模型以等级方式排列,查询性能取决于两个方面:(i) 查找正确模型的遍历时间和 (ii) 在所选模型中查找密钥的搜索时间。这种方法可能会导致一些难以建模的密钥空间区域被置于层次结构中更深的层次。为了解决这个问题,我们提出了一种替代方法,即修改密钥空间,而不是修改任何结构或模型。这是通过插入虚拟点使密钥集更具可学习性(即平滑分布)来实现的。此外,我们还开发了一种名为 CSV 的算法,将我们的虚拟点插入法集成到现有的学习索引中,减少了索引的遍历和搜索时间。我们在最先进的学习索引上实现了 CSV,并在实际数据集上对其进行了评估。广泛的实验结果表明,在较低的存储成本下,索引结构较深层次的键的查询性能有了显著提高。
{"title":"Learned Indexes with Distribution Smoothing via Virtual Points","authors":"Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey","doi":"arxiv-2408.06134","DOIUrl":"https://doi.org/arxiv-2408.06134","url":null,"abstract":"Recent research on learned indexes has created a new perspective for indexes\u0000as models that map keys to their respective storage locations. These learned\u0000indexes are created to approximate the cumulative distribution function of the\u0000key set, where using only a single model may have limited accuracy. To overcome\u0000this limitation, a typical method is to use multiple models, arranged in a\u0000hierarchical manner, where the query performance depends on two aspects: (i)\u0000traversal time to find the correct model and (ii) search time to find the key\u0000in the selected model. Such a method may cause some key space regions that are\u0000difficult to model to be placed at deeper levels in the hierarchy. To address\u0000this issue, we propose an alternative method that modifies the key space as\u0000opposed to any structural or model modifications. This is achieved through\u0000making the key set more learnable (i.e., smoothing the distribution) by\u0000inserting virtual points. Further, we develop an algorithm named CSV to\u0000integrate our virtual point insertion method into existing learned indexes,\u0000reducing both their traversal and search time. We implement CSV on\u0000state-of-the-art learned indexes and evaluate them on real-world datasets. The\u0000extensive experimental results show significant query performance improvement\u0000for the keys in deeper levels of the index structures at a low storage cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue
Data lakes are widely used to store extensive and heterogeneous datasets for advanced analytics. However, the unstructured nature of data in these repositories introduces complexities in exploiting them and extracting meaningful insights. This motivates the need of exploring efficient approaches for consolidating data lakes and deriving a common and unified schema. This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA) to systematically clean, organize, and design data structures within a data lake. We explore diverse data structures stored in our data lake at Infologic, including InfluxDB measurements and Elasticsearch indexes, aiming to derive conventions for a more accessible data model. Leveraging FCA, we represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema. Our methodology yields significant results, enabling the identification of common concepts in the data structures, such as resources along with their underlying shared fields (timestamp, type, usedRatio, etc.). Moreover, the number of distinct data structure field names is reduced by 54 percent (from 190 to 88) in the studied subset of our data lake. We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names, a significant improvement from the initial 121 field names that were needed to reach such coverage. The paper provides insights into the Infologic ecosystem, problem formulation, exploration strategies, and presents both qualitative and quantitative results.
{"title":"Exploiting Formal Concept Analysis for Data Modeling in Data Lakes","authors":"Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue","doi":"arxiv-2408.13265","DOIUrl":"https://doi.org/arxiv-2408.13265","url":null,"abstract":"Data lakes are widely used to store extensive and heterogeneous datasets for\u0000advanced analytics. However, the unstructured nature of data in these\u0000repositories introduces complexities in exploiting them and extracting\u0000meaningful insights. This motivates the need of exploring efficient approaches\u0000for consolidating data lakes and deriving a common and unified schema. This\u0000paper introduces a practical data visualization and analysis approach rooted in\u0000Formal Concept Analysis (FCA) to systematically clean, organize, and design\u0000data structures within a data lake. We explore diverse data structures stored\u0000in our data lake at Infologic, including InfluxDB measurements and\u0000Elasticsearch indexes, aiming to derive conventions for a more accessible data\u0000model. Leveraging FCA, we represent data structures as objects, analyze the\u0000concept lattice, and present two strategies-top-down and bottom-up-to unify\u0000these structures and establish a common schema. Our methodology yields\u0000significant results, enabling the identification of common concepts in the data\u0000structures, such as resources along with their underlying shared fields\u0000(timestamp, type, usedRatio, etc.). Moreover, the number of distinct data\u0000structure field names is reduced by 54 percent (from 190 to 88) in the studied\u0000subset of our data lake. We achieve a complete coverage of 80 percent of data\u0000structures with only 34 distinct field names, a significant improvement from\u0000the initial 121 field names that were needed to reach such coverage. The paper\u0000provides insights into the Infologic ecosystem, problem formulation,\u0000exploration strategies, and presents both qualitative and quantitative results.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Range filters are probabilistic data structures that answer approximate range emptiness queries. They aid in avoiding processing empty range queries and have use cases in many application domains such as key-value stores and social web analytics. However, current range filter designs do not support dynamically changing and growing datasets. Moreover, several of these designs also exhibit impractically high false positive rates under correlated workloads, which are common in practice. These impediments restrict the applicability of range filters across a wide range of use cases. We introduce Memento filter, the first range filter to offer dynamicity, fast operations, and a robust false positive rate guarantee for any workload. Memento filter partitions the key universe and clusters its keys according to this partitioning. For each cluster, it stores a fingerprint and a list of key suffixes contiguously. The encoding of these lists makes them amenable to existing dynamic filter structures. Due to the well-defined one-to-one mapping from keys to suffixes, Memento filter supports inserts and deletes and can even expand to accommodate a growing dataset. We implement Memento filter on top of a Rank-and-Select Quotient filter and InfiniFilter and demonstrate that it achieves competitive false positive rates and performance with the state-of-the-art while also providing dynamicity. Due to its dynamicity, Memento filter is the first range filter applicable to B-Trees. We showcase this by integrating Memento filter into WiredTiger, a B-Tree-based key-value store. Memento filter doubles WiredTiger's range query throughput when 50% of the queries are empty while keeping all other cost metrics unharmed.
{"title":"Memento Filter: A Fast, Dynamic, and Robust Range Filter","authors":"Navid Eslami, Niv Dayan","doi":"arxiv-2408.05625","DOIUrl":"https://doi.org/arxiv-2408.05625","url":null,"abstract":"Range filters are probabilistic data structures that answer approximate range\u0000emptiness queries. They aid in avoiding processing empty range queries and have\u0000use cases in many application domains such as key-value stores and social web\u0000analytics. However, current range filter designs do not support dynamically\u0000changing and growing datasets. Moreover, several of these designs also exhibit\u0000impractically high false positive rates under correlated workloads, which are\u0000common in practice. These impediments restrict the applicability of range\u0000filters across a wide range of use cases. We introduce Memento filter, the first range filter to offer dynamicity, fast\u0000operations, and a robust false positive rate guarantee for any workload.\u0000Memento filter partitions the key universe and clusters its keys according to\u0000this partitioning. For each cluster, it stores a fingerprint and a list of key\u0000suffixes contiguously. The encoding of these lists makes them amenable to\u0000existing dynamic filter structures. Due to the well-defined one-to-one mapping\u0000from keys to suffixes, Memento filter supports inserts and deletes and can even\u0000expand to accommodate a growing dataset. We implement Memento filter on top of a Rank-and-Select Quotient filter and\u0000InfiniFilter and demonstrate that it achieves competitive false positive rates\u0000and performance with the state-of-the-art while also providing dynamicity. Due\u0000to its dynamicity, Memento filter is the first range filter applicable to\u0000B-Trees. We showcase this by integrating Memento filter into WiredTiger, a\u0000B-Tree-based key-value store. Memento filter doubles WiredTiger's range query\u0000throughput when 50% of the queries are empty while keeping all other cost\u0000metrics unharmed.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Retrieval-Augmented Large Language Models (RALMs) have made significant strides in enhancing the accuracy of generated responses.However, existing research often overlooks the data quality issues within retrieval results, often caused by inaccurate existing vector-distance-based retrieval methods.We propose to boost the precision of RALMs' answers from a data quality perspective through the Context-Driven Index Trimming (CDIT) framework, where Context Matching Dependencies (CMDs) are employed as logical data quality rules to capture and regulate the consistency between retrieved contexts.Based on the semantic comprehension capabilities of Large Language Models (LLMs), CDIT can effectively identify and discard retrieval results that are inconsistent with the query context and further modify indexes in the database, thereby improving answer quality.Experiments demonstrate on challenging question-answering tasks.Also, the flexibility of CDIT is verified through its compatibility with various language models and indexing methods, which offers a promising approach to bolster RALMs' data quality and retrieval precision jointly.
{"title":"Context-Driven Index Trimming: A Data Quality Perspective to Enhancing Precision of RALMs","authors":"Kexin Ma, Ruochun Jin, Xi Wang, Huan Chen, Jing Ren, Yuhua Tang","doi":"arxiv-2408.05524","DOIUrl":"https://doi.org/arxiv-2408.05524","url":null,"abstract":"Retrieval-Augmented Large Language Models (RALMs) have made significant\u0000strides in enhancing the accuracy of generated responses.However, existing\u0000research often overlooks the data quality issues within retrieval results,\u0000often caused by inaccurate existing vector-distance-based retrieval methods.We\u0000propose to boost the precision of RALMs' answers from a data quality\u0000perspective through the Context-Driven Index Trimming (CDIT) framework, where\u0000Context Matching Dependencies (CMDs) are employed as logical data quality rules\u0000to capture and regulate the consistency between retrieved contexts.Based on the\u0000semantic comprehension capabilities of Large Language Models (LLMs), CDIT can\u0000effectively identify and discard retrieval results that are inconsistent with\u0000the query context and further modify indexes in the database, thereby improving\u0000answer quality.Experiments demonstrate on challenging question-answering\u0000tasks.Also, the flexibility of CDIT is verified through its compatibility with\u0000various language models and indexing methods, which offers a promising approach\u0000to bolster RALMs' data quality and retrieval precision jointly.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiqi Wang, Long Yuan, Wenjie Zhang, Xuemin Lin, Zi Chen, Qing Liu
Top-k Nearest Neighbors (kNN) problem on road network has numerous applications on location-based services. As direct search using the Dijkstra's algorithm results in a large search space, a plethora of complex-index-based approaches have been proposed to speedup the query processing. However, even with the current state-of-the-art approach, long query processing delays persist, along with significant space overhead and prohibitively long indexing time. In this paper, we depart from the complex index designs prevalent in existing literature and propose a simple index named KNN-Index. With KNN-Index, we can answer a kNN query optimally and progressively with small and size-bounded index. To improve the index construction performance, we propose a bidirectional construction algorithm which can effectively share the common computation during the construction. Theoretical analysis and experimental results on real road networks demonstrate the superiority of KNN-Index over the state-of-the-art approach in query processing performance, index size, and index construction efficiency.
{"title":"Simpler is More: Efficient Top-K Nearest Neighbors Search on Large Road Networks","authors":"Yiqi Wang, Long Yuan, Wenjie Zhang, Xuemin Lin, Zi Chen, Qing Liu","doi":"arxiv-2408.05432","DOIUrl":"https://doi.org/arxiv-2408.05432","url":null,"abstract":"Top-k Nearest Neighbors (kNN) problem on road network has numerous\u0000applications on location-based services. As direct search using the Dijkstra's\u0000algorithm results in a large search space, a plethora of complex-index-based\u0000approaches have been proposed to speedup the query processing. However, even\u0000with the current state-of-the-art approach, long query processing delays\u0000persist, along with significant space overhead and prohibitively long indexing\u0000time. In this paper, we depart from the complex index designs prevalent in\u0000existing literature and propose a simple index named KNN-Index. With KNN-Index,\u0000we can answer a kNN query optimally and progressively with small and\u0000size-bounded index. To improve the index construction performance, we propose a\u0000bidirectional construction algorithm which can effectively share the common\u0000computation during the construction. Theoretical analysis and experimental\u0000results on real road networks demonstrate the superiority of KNN-Index over the\u0000state-of-the-art approach in query processing performance, index size, and\u0000index construction efficiency.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in large language models (LLMs) have significantly contributed to the progress of the Text-to-SQL task. A common requirement in many of these works is the post-correction of SQL queries. However, the majority of this process entails analyzing error cases to develop prompts with rules that eliminate model bias. And there is an absence of execution verification for SQL queries. In addition, the prevalent techniques primarily depend on GPT-4 and few-shot prompts, resulting in expensive costs. To investigate the effective methods for SQL refinement in a cost-efficient manner, we introduce Semantic-Enhanced Text-to-SQL with Adaptive Refinement (SEA-SQL), which includes Adaptive Bias Elimination and Dynamic Execution Adjustment, aims to improve performance while minimizing resource expenditure with zero-shot prompts. Specifically, SEA-SQL employs a semantic-enhanced schema to augment database information and optimize SQL queries. During the SQL query generation, a fine-tuned adaptive bias eliminator is applied to mitigate inherent biases caused by the LLM. The dynamic execution adjustment is utilized to guarantee the executability of the bias eliminated SQL query. We conduct experiments on the Spider and BIRD datasets to demonstrate the effectiveness of this framework. The results demonstrate that SEA-SQL achieves state-of-the-art performance in the GPT3.5 scenario with 9%-58% of the generation cost. Furthermore, SEA-SQL is comparable to GPT-4 with only 0.9%-5.3% of the generation cost.
{"title":"SEA-SQL: Semantic-Enhanced Text-to-SQL with Adaptive Refinement","authors":"Chaofan Li, Yingxia Shao, Zheng Liu","doi":"arxiv-2408.04919","DOIUrl":"https://doi.org/arxiv-2408.04919","url":null,"abstract":"Recent advancements in large language models (LLMs) have significantly\u0000contributed to the progress of the Text-to-SQL task. A common requirement in\u0000many of these works is the post-correction of SQL queries. However, the\u0000majority of this process entails analyzing error cases to develop prompts with\u0000rules that eliminate model bias. And there is an absence of execution\u0000verification for SQL queries. In addition, the prevalent techniques primarily\u0000depend on GPT-4 and few-shot prompts, resulting in expensive costs. To\u0000investigate the effective methods for SQL refinement in a cost-efficient\u0000manner, we introduce Semantic-Enhanced Text-to-SQL with Adaptive Refinement\u0000(SEA-SQL), which includes Adaptive Bias Elimination and Dynamic Execution\u0000Adjustment, aims to improve performance while minimizing resource expenditure\u0000with zero-shot prompts. Specifically, SEA-SQL employs a semantic-enhanced\u0000schema to augment database information and optimize SQL queries. During the SQL\u0000query generation, a fine-tuned adaptive bias eliminator is applied to mitigate\u0000inherent biases caused by the LLM. The dynamic execution adjustment is utilized\u0000to guarantee the executability of the bias eliminated SQL query. We conduct\u0000experiments on the Spider and BIRD datasets to demonstrate the effectiveness of\u0000this framework. The results demonstrate that SEA-SQL achieves state-of-the-art\u0000performance in the GPT3.5 scenario with 9%-58% of the generation cost.\u0000Furthermore, SEA-SQL is comparable to GPT-4 with only 0.9%-5.3% of the\u0000generation cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang
Translating users' natural language queries (NL) into SQL queries (i.e., NL2SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly enhanced with the emergence of Large Language Models (LLMs). In this survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs, covering its entire lifecycle from the following four aspects: (1) Model: NL2SQL translation techniques that tackle not only NL ambiguity and under-specification, but also properly map NL with database schema and instances; (2) Data: From the collection of training data, data synthesis due to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating NL2SQL methods from multiple angles using different metrics and granularities; and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for developing NL2SQL solutions. Finally, we discuss the research challenges and open problems of NL2SQL in the LLMs era.
{"title":"A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?","authors":"Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang","doi":"arxiv-2408.05109","DOIUrl":"https://doi.org/arxiv-2408.05109","url":null,"abstract":"Translating users' natural language queries (NL) into SQL queries (i.e.,\u0000NL2SQL) can significantly reduce barriers to accessing relational databases and\u0000support various commercial applications. The performance of NL2SQL has been\u0000greatly enhanced with the emergence of Large Language Models (LLMs). In this\u0000survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs,\u0000covering its entire lifecycle from the following four aspects: (1) Model:\u0000NL2SQL translation techniques that tackle not only NL ambiguity and\u0000under-specification, but also properly map NL with database schema and\u0000instances; (2) Data: From the collection of training data, data synthesis due\u0000to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating\u0000NL2SQL methods from multiple angles using different metrics and granularities;\u0000and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and\u0000guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for\u0000developing NL2SQL solutions. Finally, we discuss the research challenges and\u0000open problems of NL2SQL in the LLMs era.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"271 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dakai Kang, Suyash Gupta, Dahlia Malkhi, Mohammad Sadoghi
This paper introduces HotStuff-1, a BFT consensus protocol that improves the latency of HotStuff-2 by two network-hops while maintaining linear communication complexity against faults. Additionally, HotStuff-1 incorporates an incentive-compatible leader rotation regime that motivates leaders to commit consensus decisions promptly. HotStuff-1 achieves a reduction by two network hops by sending clients early finality confirmations speculatively, after one phase of the protocol. Unlike previous speculation regimes, the early finality confirmation path of HotStuff-1 is fault-tolerant and the latency improvement does not rely on optimism. An important consideration for speculation regimes in general, which is referred to as the prefix speculation dilemma, is exposed and resolved. HotStuff-1 embodies an additional mechanism, slotting, that thwarts real-world delays caused by rationally-incentivized leaders. Leaders may also be inclined to sabotage each other's progress. The slotting mechanism allows leaders to drive multiple decisions, thus mitigating both threats, while dynamically adapting the number of allowed decisions per leader to network transmission delays.
{"title":"HotStuff-1: Linear Consensus with One-Phase Speculation","authors":"Dakai Kang, Suyash Gupta, Dahlia Malkhi, Mohammad Sadoghi","doi":"arxiv-2408.04728","DOIUrl":"https://doi.org/arxiv-2408.04728","url":null,"abstract":"This paper introduces HotStuff-1, a BFT consensus protocol that improves the\u0000latency of HotStuff-2 by two network-hops while maintaining linear\u0000communication complexity against faults. Additionally, HotStuff-1 incorporates\u0000an incentive-compatible leader rotation regime that motivates leaders to commit\u0000consensus decisions promptly. HotStuff-1 achieves a reduction by two network hops by sending clients early\u0000finality confirmations speculatively, after one phase of the protocol. Unlike\u0000previous speculation regimes, the early finality confirmation path of\u0000HotStuff-1 is fault-tolerant and the latency improvement does not rely on\u0000optimism. An important consideration for speculation regimes in general, which\u0000is referred to as the prefix speculation dilemma, is exposed and resolved. HotStuff-1 embodies an additional mechanism, slotting, that thwarts\u0000real-world delays caused by rationally-incentivized leaders. Leaders may also\u0000be inclined to sabotage each other's progress. The slotting mechanism allows\u0000leaders to drive multiple decisions, thus mitigating both threats, while\u0000dynamically adapting the number of allowed decisions per leader to network\u0000transmission delays.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"133 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
{"title":"CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding","authors":"Sophia Ho, Jinsol Park, Patrick Wang","doi":"arxiv-2408.04678","DOIUrl":"https://doi.org/arxiv-2408.04678","url":null,"abstract":"We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign\u0000of REST that allows it to be effectively \"compacted\". REST is a drafting\u0000technique for speculative decoding based on retrieving exact n-gram matches of\u0000the most recent n tokens generated by the target LLM from a datastore. The key\u0000idea of CREST is to only store a subset of the smallest and most common n-grams\u0000in the datastore with the hope of achieving comparable performance with less\u0000storage space. We found that storing a subset of n-grams both reduces storage\u0000space and improves performance. CREST matches REST's accepted token length with\u000010.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance\u0000length than REST using the same storage space on the HumanEval and MT Bench\u0000benchmarks.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we distill an abstraction, the contract, which agents use to communicate the intent of a dataflow and evaluate its consequences, before the dataflow takes place. This helps agents move towards a common sharing goal without violating any regulatory and privacy constraints. Then, we design and implement the contract programming model (CPM), which allows agents to program data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it can save intermediate outputs of dataflows, and skip computation if a dataflow tries to access data that it does not have access to. In our evaluation, we show that 1) the contract abstraction is general enough to represent a wide range of sharing problems, 2) we can write programs for complex data sharing problems and exhibit qualitative improvements over other alternate technologies, and 3) quantitatively, our optimizations make sharing programs written with the CPM efficient.
{"title":"Programmable Dataflows: Abstraction and Programming Model for Data Sharing","authors":"Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez","doi":"arxiv-2408.04092","DOIUrl":"https://doi.org/arxiv-2408.04092","url":null,"abstract":"Data sharing is central to a wide variety of applications such as fraud\u0000detection, ad matching, and research. The lack of data sharing abstractions\u0000makes the solution to each data sharing problem bespoke and cost-intensive,\u0000hampering value generation. In this paper, we first introduce a data sharing\u0000model to represent every data sharing problem with a sequence of dataflows.\u0000From the model, we distill an abstraction, the contract, which agents use to\u0000communicate the intent of a dataflow and evaluate its consequences, before the\u0000dataflow takes place. This helps agents move towards a common sharing goal\u0000without violating any regulatory and privacy constraints. Then, we design and\u0000implement the contract programming model (CPM), which allows agents to program\u0000data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce\u0000inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it\u0000can save intermediate outputs of dataflows, and skip computation if a dataflow\u0000tries to access data that it does not have access to. In our evaluation, we\u0000show that 1) the contract abstraction is general enough to represent a wide\u0000range of sharing problems, 2) we can write programs for complex data sharing\u0000problems and exhibit qualitative improvements over other alternate\u0000technologies, and 3) quantitatively, our optimizations make sharing programs\u0000written with the CPM efficient.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}