arXiv - CS - Databases最新文献

英文中文

High Throughput Shortest Distance Query Processing on Large Dynamic Road Networks 大型动态道路网络上的高吞吐量最短距离查询处理

arXiv - CS - Databases

Pub Date : 2024-09-10 DOI: arxiv-2409.06148

Xinjie Zhou, Mengxuan Zhang, Lei Li, Xiaofang Zhou

Shortest path (SP) computation is the building block for many location-basedservices, and achieving high throughput SP query processing is an essentialgoal for the real-time response of those services. However, the large number ofqueries submitted in large-scale dynamic road networks still poses challengesto this goal. Therefore, in this work, we propose a novel framework aiming toprocess SP queries with high throughput in large and dynamic road networks, byleveraging the Partitioned Shortest Path (PSP) index. Specifically, we firstput forward a cross-boundary strategy to accelerate the query processing of PSPindex and analyze its efficiency upper-bound by discovering the curse of PSPindex query efficiency. After that, we propose a non-trivial PartitionedMulti-stage Hub Labeling (PMHL) that utilizes multiple PSP strategies andthread parallelization to achieve consecutive query efficiency improvement andfast index maintenance. Finally, to further increase query throughput, wedesign tree decomposition-based graph partitioning and propose Post-partitionedMulti-stage Hub Labeling (PostMHL) with faster query processing and indexupdate than PMHL. Experiments on real-world road networks show that our methodsoutperform state-of-the-art baselines in query throughput, yielding up to 1-4orders of magnitude improvement.

最短路径（SP）计算是许多基于位置的服务的基础，实现高吞吐量的 SP 查询处理是这些服务实时响应的基本目标。然而，在大规模动态道路网络中提交的大量查询仍对这一目标构成挑战。因此，在这项工作中，我们提出了一种新的框架，旨在利用分区最短路径（PSP）索引，在大型动态道路网络中以高吞吐量处理 SP 查询。具体来说，我们首先提出了一种跨边界策略来加速 PSP 索引的查询处理，并通过发现 PSP 索引查询效率的诅咒来分析其效率上限。之后，我们提出了一种非复杂的分区多级集线器标签（PartitionedMulti-stage Hub Labeling，PMHL），利用多种 PSP 策略和线程并行化来实现查询效率的连续提高和索引的快速维护。最后，为了进一步提高查询吞吐量，我们设计了基于树分解的图分区，并提出了后分区多级集线器标签（Post-partitionedMulti-stage Hub Labeling，PostMHL），其查询处理和索引更新速度比 PMHL 更快。在真实道路网络上进行的实验表明，我们的方法在查询吞吐量方面优于最先进的基线方法，最多可提高 1-4 个数量级。

{"title":"High Throughput Shortest Distance Query Processing on Large Dynamic Road Networks","authors":"Xinjie Zhou, Mengxuan Zhang, Lei Li, Xiaofang Zhou","doi":"arxiv-2409.06148","DOIUrl":"https://doi.org/arxiv-2409.06148","url":null,"abstract":"Shortest path (SP) computation is the building block for many location-based\u0000services, and achieving high throughput SP query processing is an essential\u0000goal for the real-time response of those services. However, the large number of\u0000queries submitted in large-scale dynamic road networks still poses challenges\u0000to this goal. Therefore, in this work, we propose a novel framework aiming to\u0000process SP queries with high throughput in large and dynamic road networks, by\u0000leveraging the Partitioned Shortest Path (PSP) index. Specifically, we first\u0000put forward a cross-boundary strategy to accelerate the query processing of PSP\u0000index and analyze its efficiency upper-bound by discovering the curse of PSP\u0000index query efficiency. After that, we propose a non-trivial Partitioned\u0000Multi-stage Hub Labeling (PMHL) that utilizes multiple PSP strategies and\u0000thread parallelization to achieve consecutive query efficiency improvement and\u0000fast index maintenance. Finally, to further increase query throughput, we\u0000design tree decomposition-based graph partitioning and propose Post-partitioned\u0000Multi-stage Hub Labeling (PostMHL) with faster query processing and index\u0000update than PMHL. Experiments on real-world road networks show that our methods\u0000outperform state-of-the-art baselines in query throughput, yielding up to 1-4\u0000orders of magnitude improvement.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A System and Benchmark for LLM-based Q&A on Heterogeneous Data 基于 LLM 的异构数据 Q&A 系统和基准测试

arXiv - CS - Databases

Pub Date : 2024-09-09 DOI: arxiv-2409.05735

Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman

In many industrial settings, users wish to ask questions whose answers may befound in structured data sources such as a spreadsheets, databases, APIs, orcombinations thereof. Often, the user doesn't know how to identify or accessthe right data source. This problem is compounded even further if multiple (andpotentially siloed) data sources must be assembled to derive the answer.Recently, various Text-to-SQL applications that leverage Large Language Models(LLMs) have addressed some of these problems by enabling users to ask questionsin natural language. However, these applications remain impractical inrealistic industrial settings because they fail to cope with the data sourceheterogeneity that typifies such environments. In this paper, we addressheterogeneity by introducing the siwarex platform, which enables seamlessnatural language access to both databases and APIs. To demonstrate theeffectiveness of siwarex, we extend the popular Spider dataset and benchmark byreplacing some of its tables by data retrieval APIs. We find that siwarex doesa good job of coping with data source heterogeneity. Our modified Spiderbenchmark will soon be available to the research community

在许多工业环境中，用户希望提出问题，而问题的答案可以在结构化数据源中找到，如电子表格、数据库、API 或它们的组合。通常情况下，用户不知道如何识别或访问正确的数据源。最近，各种利用大型语言模型（LLM）的文本到 SQL 应用程序通过让用户使用自然语言提问，解决了其中的一些问题。然而，这些应用在现实的工业环境中仍然不切实际，因为它们无法应对数据源的异构性，而这种异构性正是此类环境的典型特征。在本文中，我们通过引入 siwarex 平台来解决异构问题，该平台可实现对数据库和应用程序接口的无缝自然语言访问。为了证明 siwarex 的有效性，我们扩展了流行的 Spider 数据集和基准，用数据检索 API 代替了其中的一些表格。我们发现 siwarex 能够很好地应对数据源的异构性。我们修改后的 Spider 基准将很快提供给研究界。

{"title":"A System and Benchmark for LLM-based Q&A on Heterogeneous Data","authors":"Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman","doi":"arxiv-2409.05735","DOIUrl":"https://doi.org/arxiv-2409.05735","url":null,"abstract":"In many industrial settings, users wish to ask questions whose answers may be\u0000found in structured data sources such as a spreadsheets, databases, APIs, or\u0000combinations thereof. Often, the user doesn't know how to identify or access\u0000the right data source. This problem is compounded even further if multiple (and\u0000potentially siloed) data sources must be assembled to derive the answer.\u0000Recently, various Text-to-SQL applications that leverage Large Language Models\u0000(LLMs) have addressed some of these problems by enabling users to ask questions\u0000in natural language. However, these applications remain impractical in\u0000realistic industrial settings because they fail to cope with the data source\u0000heterogeneity that typifies such environments. In this paper, we address\u0000heterogeneity by introducing the siwarex platform, which enables seamless\u0000natural language access to both databases and APIs. To demonstrate the\u0000effectiveness of siwarex, we extend the popular Spider dataset and benchmark by\u0000replacing some of its tables by data retrieval APIs. We find that siwarex does\u0000a good job of coping with data source heterogeneity. Our modified Spider\u0000benchmark will soon be available to the research community","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Rare Temporal Pattern Mining in Time Series 时间序列中的高效罕见时态模式挖掘

arXiv - CS - Databases

Pub Date : 2024-09-08 DOI: arxiv-2409.05042

Van Ho Long, Nguyen Ho, Trinh Le Cong, Anh-Vu Dinh-Duc, Tu Nguyen Ngoc

Time series data from various domains are increasing continuously. Extractingand analyzing the temporal patterns in these series can reveal significantinsights. Temporal pattern mining (TPM) extends traditional pattern mining byincorporating event time intervals into extracted patterns, enhancing theirexpressiveness but increasing time and space complexities. One valuable type oftemporal pattern is known as rare temporal patterns (RTPs), which occur rarelybut with high confidence. There exist several challenges when mining raretemporal patterns. The support measure is set very low, leading to a furthercombinatorial explosion and potentially producing too many uninterestingpatterns. Thus, an efficient approach to rare temporal pattern mining isneeded. This paper introduces our Rare Temporal Pattern Mining from Time Series(RTPMfTS) method for discovering rare temporal patterns, featuring thefollowing key contributions: (1) An end-to-end RTPMfTS process that takes timeseries data as input and yields rare temporal patterns as output. (2) Anefficient Rare Temporal Pattern Mining (RTPM) algorithm that uses optimizeddata structures for quick event and pattern retrieval and utilizes effectivepruning techniques for much faster mining. (3) A thorough experimentalevaluation of RTPM, showing that RTPM outperforms the baseline in terms ofruntime and memory usage.

来自各个领域的时间序列数据不断增加。提取和分析这些序列中的时间模式可以揭示重要信息。时态模式挖掘（TPM）是对传统模式挖掘的扩展，它将事件的时间间隔纳入到提取的模式中，从而增强了模式的表现力，但也增加了时间和空间的复杂性。罕见时间模式（RTPs）是一种有价值的时间模式，这种模式很少出现，但可信度很高。在挖掘稀有时态模式时存在几个挑战。支持度量设置得很低，会导致进一步的组合爆炸，并可能产生过多无趣的模式。因此，需要一种高效的稀有时间模式挖掘方法。本文介绍了我们发现稀有时间模式的时间序列稀有时间模式挖掘（RTPMfTS）方法，主要贡献如下：（1）端到端 RTPMfTS 流程，以时间序列数据为输入，以稀有时间模式为输出。(2) 一种高效的罕见时间模式挖掘（RTPM）算法，它使用优化的数据结构来快速检索事件和模式，并利用有效的剪枝技术来加快挖掘速度。(3) 对 RTPM 进行了全面的实验评估，结果表明 RTPM 在运行时间和内存使用方面都优于基线算法。

{"title":"Efficient Rare Temporal Pattern Mining in Time Series","authors":"Van Ho Long, Nguyen Ho, Trinh Le Cong, Anh-Vu Dinh-Duc, Tu Nguyen Ngoc","doi":"arxiv-2409.05042","DOIUrl":"https://doi.org/arxiv-2409.05042","url":null,"abstract":"Time series data from various domains are increasing continuously. Extracting\u0000and analyzing the temporal patterns in these series can reveal significant\u0000insights. Temporal pattern mining (TPM) extends traditional pattern mining by\u0000incorporating event time intervals into extracted patterns, enhancing their\u0000expressiveness but increasing time and space complexities. One valuable type of\u0000temporal pattern is known as rare temporal patterns (RTPs), which occur rarely\u0000but with high confidence. There exist several challenges when mining rare\u0000temporal patterns. The support measure is set very low, leading to a further\u0000combinatorial explosion and potentially producing too many uninteresting\u0000patterns. Thus, an efficient approach to rare temporal pattern mining is\u0000needed. This paper introduces our Rare Temporal Pattern Mining from Time Series\u0000(RTPMfTS) method for discovering rare temporal patterns, featuring the\u0000following key contributions: (1) An end-to-end RTPMfTS process that takes time\u0000series data as input and yields rare temporal patterns as output. (2) An\u0000efficient Rare Temporal Pattern Mining (RTPM) algorithm that uses optimized\u0000data structures for quick event and pattern retrieval and utilizes effective\u0000pruning techniques for much faster mining. (3) A thorough experimental\u0000evaluation of RTPM, showing that RTPM outperforms the baseline in terms of\u0000runtime and memory usage.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph versioning for evolving urban data 为不断变化的城市数据提供图形版本

arXiv - CS - Databases

Pub Date : 2024-09-06 DOI: arxiv-2409.04498

Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere

The continuous evolution of cities poses significant challenges in terms ofmanaging and understanding their complex dynamics. With the increasing demandfor transparency and the growing availability of open urban data, it has becomeimportant to ensure the reproducibility of scientific research and computationsin urban planning. To understand past decisions and other possible scenarios,we require solutions that go beyond the management of urban knowledge graphs.In this work, we explore existing solutions and their limits and explain theneed and possible approaches for querying across multiple graph versions.

城市的不断发展给管理和了解其复杂动态带来了巨大挑战。随着对透明度的要求越来越高，开放式城市数据的可用性越来越强，确保城市规划中科学研究和计算的可重复性变得非常重要。为了理解过去的决策和其他可能的情况，我们需要超越城市知识图谱管理的解决方案。在这项工作中，我们探讨了现有的解决方案及其局限性，并解释了跨多个图谱版本查询的需求和可能的方法。

引用次数: 0

ConVer-G: Concurrent versioning of knowledge graphs ConVer-G：知识图谱的并发版本管理

arXiv - CS - Databases

Pub Date : 2024-09-06 DOI: arxiv-2409.04499

Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere

The multiplication of platforms offering open data has facilitated access toinformation that can be used for research, innovation, and decision-making.Providing transparency and availability, open data is regularly updated,allowing us to observe their evolution over time. We are particularly interested in the evolution of urban data that allowsstakeholders to better understand dynamics and propose solutions to improve thequality of life of citizens. In this context, we are interested in themanagement of evolving data, especially urban data and the ability to querythese data across the available versions. In order to have the ability tounderstand our urban heritage and propose new scenarios, we must be able tosearch for knowledge through concurrent versions of urban knowledge graphs. In this work, we present the ConVer-G (Concurrent Versioning of knowledgeGraphs) system for storage and querying through multiple concurrent versions ofgraphs.

提供开放数据的平台越来越多，这为获取可用于研究、创新和决策的信息提供了便利。开放数据具有透明度和可用性，会定期更新，使我们能够观察到其随时间的演变。我们对城市数据的演变尤其感兴趣，这些数据能让利益相关者更好地了解动态，并提出解决方案，从而提高市民的生活质量。在这种情况下，我们对不断变化的数据（尤其是城市数据）的管理以及跨版本查询这些数据的能力很感兴趣。为了能够理解我们的城市遗产并提出新的方案，我们必须能够通过并行版本的城市知识图谱来搜索知识。在这项工作中，我们提出了 ConVer-G（知识图谱并发版本化）系统，用于通过多个并发版本的图谱进行存储和查询。

引用次数: 0

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model AnyMatch -- 利用小型语言模型进行高效的零点实体匹配

arXiv - CS - Databases

Pub Date : 2024-09-06 DOI: arxiv-2409.04073

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter

Entity matching (EM) is the problem of determining whether two records referto same real-world entity, which is crucial in data integration, e.g., forproduct catalogs or address databases. A major drawback of many EM approachesis their dependence on labelled examples. We thus focus on the challengingsetting of zero-shot entity matching where no labelled examples are availablefor an unseen target dataset. Recently, large language models (LLMs) have shownpromising results for zero-shot EM, but their low throughput and highdeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language modelfine-tuned in a transfer learning setup. We propose several novel dataselection techniques to generate fine-tuning data for our model, e.g., byselecting difficult pairs to match via an AutoML filter, by generatingadditional attribute-level examples, and by controlling label imbalance in thedata. We conduct an extensive evaluation of the prediction quality and deploymentcost of our model, in a comparison to thirteen baselines on nine benchmarkdatasets. We find that AnyMatch provides competitive prediction quality despiteits small parameter size: it achieves the second-highest F1 score overall, andoutperforms several other approaches that employ models with hundreds ofbillions of parameters. Furthermore, our approach exhibits major cost benefits:the average prediction quality of AnyMatch is within 4.4% of thestate-of-the-art method MatchGPT with the proprietary trillion-parameter modelGPT-4, yet AnyMatch requires four orders of magnitude less parameters andincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).

实体匹配（EM）是确定两条记录是否指向同一个现实世界实体的问题，这在数据集成（如产品目录或地址数据库）中至关重要。许多 EM 方法的一个主要缺点是依赖于标记示例。因此，我们将重点放在 "零镜头实体匹配 "这一具有挑战性的情境上，在这种情境中，没有标记过的示例可用于未见过的目标数据集。最近，大型语言模型（LLM）在零拍 EM 方面取得了令人满意的结果，但其低吞吐量和高部署成本限制了其适用性和可扩展性。我们利用在迁移学习设置中经过微调的小型语言模型 AnyMatch 重新探讨了零次 EM 问题。我们提出了几种新颖的数据选择技术来为我们的模型生成微调数据，例如，通过 AutoML 过滤器选择难以匹配的配对，生成附加属性级示例，以及控制数据中的标签不平衡。我们在九个基准数据集上与 13 个基线模型进行了比较，对我们模型的预测质量和部署成本进行了广泛评估。我们发现，尽管参数规模较小，AnyMatch 却能提供具有竞争力的预测质量：它获得了第二高的 F1 总分，并超越了其他几种采用千亿参数模型的方法。此外，我们的方法在成本方面也有很大的优势：AnyMatch 的平均预测质量与采用专有万亿参数模型 GPT-4 的最先进方法 MatchGPT 相比，相差不到 4.4%，但 AnyMatch 所需的参数数量却少了四个数量级，推理成本（以每千个代币美元计）也低了 3899 倍。

{"title":"AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model","authors":"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter","doi":"arxiv-2409.04073","DOIUrl":"https://doi.org/arxiv-2409.04073","url":null,"abstract":"Entity matching (EM) is the problem of determining whether two records refer\u0000to same real-world entity, which is crucial in data integration, e.g., for\u0000product catalogs or address databases. A major drawback of many EM approaches\u0000is their dependence on labelled examples. We thus focus on the challenging\u0000setting of zero-shot entity matching where no labelled examples are available\u0000for an unseen target dataset. Recently, large language models (LLMs) have shown\u0000promising results for zero-shot EM, but their low throughput and high\u0000deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\u0000fine-tuned in a transfer learning setup. We propose several novel data\u0000selection techniques to generate fine-tuning data for our model, e.g., by\u0000selecting difficult pairs to match via an AutoML filter, by generating\u0000additional attribute-level examples, and by controlling label imbalance in the\u0000data. We conduct an extensive evaluation of the prediction quality and deployment\u0000cost of our model, in a comparison to thirteen baselines on nine benchmark\u0000datasets. We find that AnyMatch provides competitive prediction quality despite\u0000its small parameter size: it achieves the second-highest F1 score overall, and\u0000outperforms several other approaches that employ models with hundreds of\u0000billions of parameters. Furthermore, our approach exhibits major cost benefits:\u0000the average prediction quality of AnyMatch is within 4.4% of the\u0000state-of-the-art method MatchGPT with the proprietary trillion-parameter model\u0000GPT-4, yet AnyMatch requires four orders of magnitude less parameters and\u0000incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models 通过概率关系模型实现保护隐私的关系数据合成

arXiv - CS - Databases

Pub Date : 2024-09-06 DOI: arxiv-2409.04194

Malte Luttermann, Ralf Möller, Mattis Hartwig

Probabilistic relational models provide a well-established formalism tocombine first-order logic and probabilistic models, thereby allowing torepresent relationships between objects in a relational domain. At the sametime, the field of artificial intelligence requires increasingly large amountsof relational training data for various machine learning tasks. Collectingreal-world data, however, is often challenging due to privacy concerns, dataprotection regulations, high costs, and so on. To mitigate these challenges,the generation of synthetic data is a promising approach. In this paper, wesolve the problem of generating synthetic relational data via probabilisticrelational models. In particular, we propose a fully-fledged pipeline to gofrom relational database to probabilistic relational model, which can then beused to sample new synthetic relational data points from its underlyingprobability distribution. As part of our proposed pipeline, we introduce alearning algorithm to construct a probabilistic relational model from a givenrelational database.

概率关系模型是将一阶逻辑和概率模型结合起来的一种成熟的形式主义，从而可以表示关系域中对象之间的关系。与此同时，人工智能领域需要越来越多的关系训练数据来完成各种机器学习任务。然而，由于隐私问题、数据保护法规、高昂的成本等原因，收集真实世界的数据往往具有挑战性。为了缓解这些挑战，生成合成数据是一种很有前景的方法。在本文中，我们将解决通过概率关系模型生成合成关系数据的问题。特别是，我们提出了一个从关系数据库到概率关系模型的完整流水线，该流水线可用于从底层概率分布中采样新的合成关系数据点。作为建议管道的一部分，我们引入了一种学习算法，用于从给定的关系数据库构建概率关系模型。

引用次数: 0

Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation 利用大型语言模型革新数据库问答：综合基准和评估

arXiv - CS - Databases

Pub Date : 2024-09-05 DOI: arxiv-2409.04475

Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li

The development of Large Language Models (LLMs) has revolutionized Q&A acrossvarious industries, including the database domain. However, there is still alack of a comprehensive benchmark to evaluate the capabilities of differentLLMs and their modular components in database Q&A. To this end, we introduceDQA, the first comprehensive database Q&A benchmark. DQA features an innovativeLLM-based method for automating the generation, cleaning, and rewriting ofdatabase Q&A, resulting in over 240,000 Q&A pairs in English and Chinese. TheseQ&A pairs cover nearly all aspects of database knowledge, including databasemanuals, database blogs, and database tools. This inclusion allows foradditional assessment of LLMs' Retrieval-Augmented Generation (RAG) and ToolInvocation Generation (TIG) capabilities in the database Q&A task. Furthermore,we propose a comprehensive LLM-based database Q&A testbed on DQA. This testbedis highly modular and scalable, with both basic and advanced components likeQuestion Classification Routing (QCR), RAG, TIG, and Prompt TemplateEngineering (PTE). Besides, DQA provides a complete evaluation pipeline,featuring diverse metrics and a standardized evaluation process to ensurecomprehensiveness, accuracy, and fairness. We use DQA to evaluate the databaseQ&A capabilities under the proposed testbed comprehensively. The evaluationreveals findings like (i) the strengths and limitations of nine differentLLM-based Q&A bots and (ii) the performance impact and potential improvementsof various service components (e.g., QCR, RAG, TIG). We hope our benchmark andfindings will better guide the future development of LLM-based database Q&Aresearch.

大语言模型（LLMs）的发展给包括数据库领域在内的各行各业的问答带来了革命性的变化。然而，目前仍然缺乏一个全面的基准来评估不同 LLM 及其模块化组件在数据库问答中的能力。为此，我们推出了第一个全面的数据库问答基准--DQA。DQA 采用基于LLM 的创新方法自动生成、清理和改写数据库问答，产生了 240,000 多个中英文问答对。这些问答几乎涵盖了数据库知识的所有方面，包括数据库手册、数据库博客和数据库工具。这样就可以对 LLM 在数据库问答任务中的检索增强生成（RAG）和工具调用生成（TIG）能力进行额外的评估。此外，我们还在 DQA 上提出了一个基于 LLM 的综合数据库问答测试平台。该测试平台具有高度的模块化和可扩展性，既有基本组件，也有高级组件，如问题分类路由（QCR）、RAG、TIG和提示模板工程（PTE）。此外，DQA 还提供了完整的评估管道，具有多样化的指标和标准化的评估流程，以确保评估的全面性、准确性和公平性。我们使用 DQA 全面评估了拟议测试平台下的数据库 Q&A 能力。评估揭示了以下结论：(i) 九种不同的基于 LLM 的问答机器人的优势和局限性；(ii) 各种服务组件（如 QCR、RAG、TIG）的性能影响和潜在改进。我们希望我们的基准和发现能够更好地指导基于 LLM 的数据库问答研究的未来发展。

{"title":"Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation","authors":"Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li","doi":"arxiv-2409.04475","DOIUrl":"https://doi.org/arxiv-2409.04475","url":null,"abstract":"The development of Large Language Models (LLMs) has revolutionized Q&A across\u0000various industries, including the database domain. However, there is still a\u0000lack of a comprehensive benchmark to evaluate the capabilities of different\u0000LLMs and their modular components in database Q&A. To this end, we introduce\u0000DQA, the first comprehensive database Q&A benchmark. DQA features an innovative\u0000LLM-based method for automating the generation, cleaning, and rewriting of\u0000database Q&A, resulting in over 240,000 Q&A pairs in English and Chinese. These\u0000Q&A pairs cover nearly all aspects of database knowledge, including database\u0000manuals, database blogs, and database tools. This inclusion allows for\u0000additional assessment of LLMs' Retrieval-Augmented Generation (RAG) and Tool\u0000Invocation Generation (TIG) capabilities in the database Q&A task. Furthermore,\u0000we propose a comprehensive LLM-based database Q&A testbed on DQA. This testbed\u0000is highly modular and scalable, with both basic and advanced components like\u0000Question Classification Routing (QCR), RAG, TIG, and Prompt Template\u0000Engineering (PTE). Besides, DQA provides a complete evaluation pipeline,\u0000featuring diverse metrics and a standardized evaluation process to ensure\u0000comprehensiveness, accuracy, and fairness. We use DQA to evaluate the database\u0000Q&A capabilities under the proposed testbed comprehensively. The evaluation\u0000reveals findings like (i) the strengths and limitations of nine different\u0000LLM-based Q&A bots and (ii) the performance impact and potential improvements\u0000of various service components (e.g., QCR, RAG, TIG). We hope our benchmark and\u0000findings will better guide the future development of LLM-based database Q&A\u0000research.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comprehensive Survey of Blockchain Scalability: Shaping Inner-Chain and Inter-Chain Perspectives 区块链可扩展性综合调查：塑造内链和链间视角

arXiv - CS - Databases

Pub Date : 2024-09-04 DOI: arxiv-2409.02968

Baochao Chen, Liyuan Ma, Hao Xu, Juncheng Ma, Dengcheng Hu, Xiulong Liu, Jie Wu, Jianrong Wang, Keqiu Li

Blockchain is widely applied in logistics, finance, and agriculture. Assingle blockchain users grow, scalability becomes crucial. However, existingworks lack a comprehensive summary of blockchain scalability. They focus onsingle chains or cross-chain technologies. This survey summarizes scalabilityacross the physical and logical layers, as well as inner-chain, inter-chain,and technology dimensions. The physical layer covers data and protocols, whilethe logical layer represents blockchain architecture. Each component isanalyzed from inner-chain and inter-chain perspectives, consideringtechnological factors. The aim is to enhance researchers' understanding ofblockchain's architecture, data, and protocols to advance scalability research.

区块链被广泛应用于物流、金融和农业领域。随着区块链用户的增长，可扩展性变得至关重要。然而，现有著作缺乏对区块链可扩展性的全面总结。它们主要关注单链或跨链技术。本调查总结了物理层和逻辑层以及内链、链间和技术层面的可扩展性。物理层包括数据和协议，而逻辑层代表区块链架构。考虑到技术因素，每个组件都从链内和链间角度进行了分析。目的是增强研究人员对区块链架构、数据和协议的理解，从而推进可扩展性研究。

引用次数: 0

Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data 可审计、可重复使用的横道图，可快速、按比例整合分散的表格数据

arXiv - CS - Databases

Pub Date : 2024-09-03 DOI: arxiv-2409.01517

Gavin Chait

This paper presents an open-source curatorial toolkit intended to producewell-structured and interoperable data. Curation is divided into discretecomponents, with a schema-centric focus for auditable restructuring of complexand scattered tabular data to conform to a destination schema. Task separationallows development of software and analysis without source data being present.Transformations are captured as high-level sequential scripts describingschema-to-schema mappings, reducing complexity and resource requirements.Ultimately, data are transformed, but the objective is that any data meeting aschema definition can be restructured using a crosswalk. The toolkit isavailable both as a Python package, and as a 'no-code' visual web application.A visual example is presented, derived from a longitudinal study wherescattered source data from hundreds of local councils are integrated into asingle database.

本文介绍了一个旨在生成结构良好、可互操作的数据的开放源代码编辑工具包。数据整理被划分为不同的组成部分，以模式为中心，对复杂而分散的表格数据进行可审计的重组，使其符合目标模式。任务分离允许在不存在源数据的情况下开发软件和进行分析。转换以描述模式到模式映射的高级顺序脚本的形式进行，从而降低了复杂性和资源需求。最终，数据将被转换，但目标是任何符合模式定义的数据都可以使用横道图进行重组。该工具包既可以作为 Python 软件包提供，也可以作为 "无代码 "可视化网络应用程序提供。本文介绍了一个可视化示例，该示例来自一项纵向研究，研究将数百个地方议会的零散源数据整合到同一个数据库中。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Databases

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀