Xinjie Zhou, Mengxuan Zhang, Lei Li, Xiaofang Zhou
Shortest path (SP) computation is the building block for many location-based services, and achieving high throughput SP query processing is an essential goal for the real-time response of those services. However, the large number of queries submitted in large-scale dynamic road networks still poses challenges to this goal. Therefore, in this work, we propose a novel framework aiming to process SP queries with high throughput in large and dynamic road networks, by leveraging the Partitioned Shortest Path (PSP) index. Specifically, we first put forward a cross-boundary strategy to accelerate the query processing of PSP index and analyze its efficiency upper-bound by discovering the curse of PSP index query efficiency. After that, we propose a non-trivial Partitioned Multi-stage Hub Labeling (PMHL) that utilizes multiple PSP strategies and thread parallelization to achieve consecutive query efficiency improvement and fast index maintenance. Finally, to further increase query throughput, we design tree decomposition-based graph partitioning and propose Post-partitioned Multi-stage Hub Labeling (PostMHL) with faster query processing and index update than PMHL. Experiments on real-world road networks show that our methods outperform state-of-the-art baselines in query throughput, yielding up to 1-4 orders of magnitude improvement.
{"title":"High Throughput Shortest Distance Query Processing on Large Dynamic Road Networks","authors":"Xinjie Zhou, Mengxuan Zhang, Lei Li, Xiaofang Zhou","doi":"arxiv-2409.06148","DOIUrl":"https://doi.org/arxiv-2409.06148","url":null,"abstract":"Shortest path (SP) computation is the building block for many location-based\u0000services, and achieving high throughput SP query processing is an essential\u0000goal for the real-time response of those services. However, the large number of\u0000queries submitted in large-scale dynamic road networks still poses challenges\u0000to this goal. Therefore, in this work, we propose a novel framework aiming to\u0000process SP queries with high throughput in large and dynamic road networks, by\u0000leveraging the Partitioned Shortest Path (PSP) index. Specifically, we first\u0000put forward a cross-boundary strategy to accelerate the query processing of PSP\u0000index and analyze its efficiency upper-bound by discovering the curse of PSP\u0000index query efficiency. After that, we propose a non-trivial Partitioned\u0000Multi-stage Hub Labeling (PMHL) that utilizes multiple PSP strategies and\u0000thread parallelization to achieve consecutive query efficiency improvement and\u0000fast index maintenance. Finally, to further increase query throughput, we\u0000design tree decomposition-based graph partitioning and propose Post-partitioned\u0000Multi-stage Hub Labeling (PostMHL) with faster query processing and index\u0000update than PMHL. Experiments on real-world road networks show that our methods\u0000outperform state-of-the-art baselines in query throughput, yielding up to 1-4\u0000orders of magnitude improvement.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman
In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community
{"title":"A System and Benchmark for LLM-based Q&A on Heterogeneous Data","authors":"Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman","doi":"arxiv-2409.05735","DOIUrl":"https://doi.org/arxiv-2409.05735","url":null,"abstract":"In many industrial settings, users wish to ask questions whose answers may be\u0000found in structured data sources such as a spreadsheets, databases, APIs, or\u0000combinations thereof. Often, the user doesn't know how to identify or access\u0000the right data source. This problem is compounded even further if multiple (and\u0000potentially siloed) data sources must be assembled to derive the answer.\u0000Recently, various Text-to-SQL applications that leverage Large Language Models\u0000(LLMs) have addressed some of these problems by enabling users to ask questions\u0000in natural language. However, these applications remain impractical in\u0000realistic industrial settings because they fail to cope with the data source\u0000heterogeneity that typifies such environments. In this paper, we address\u0000heterogeneity by introducing the siwarex platform, which enables seamless\u0000natural language access to both databases and APIs. To demonstrate the\u0000effectiveness of siwarex, we extend the popular Spider dataset and benchmark by\u0000replacing some of its tables by data retrieval APIs. We find that siwarex does\u0000a good job of coping with data source heterogeneity. Our modified Spider\u0000benchmark will soon be available to the research community","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Van Ho Long, Nguyen Ho, Trinh Le Cong, Anh-Vu Dinh-Duc, Tu Nguyen Ngoc
Time series data from various domains are increasing continuously. Extracting and analyzing the temporal patterns in these series can reveal significant insights. Temporal pattern mining (TPM) extends traditional pattern mining by incorporating event time intervals into extracted patterns, enhancing their expressiveness but increasing time and space complexities. One valuable type of temporal pattern is known as rare temporal patterns (RTPs), which occur rarely but with high confidence. There exist several challenges when mining rare temporal patterns. The support measure is set very low, leading to a further combinatorial explosion and potentially producing too many uninteresting patterns. Thus, an efficient approach to rare temporal pattern mining is needed. This paper introduces our Rare Temporal Pattern Mining from Time Series (RTPMfTS) method for discovering rare temporal patterns, featuring the following key contributions: (1) An end-to-end RTPMfTS process that takes time series data as input and yields rare temporal patterns as output. (2) An efficient Rare Temporal Pattern Mining (RTPM) algorithm that uses optimized data structures for quick event and pattern retrieval and utilizes effective pruning techniques for much faster mining. (3) A thorough experimental evaluation of RTPM, showing that RTPM outperforms the baseline in terms of runtime and memory usage.
{"title":"Efficient Rare Temporal Pattern Mining in Time Series","authors":"Van Ho Long, Nguyen Ho, Trinh Le Cong, Anh-Vu Dinh-Duc, Tu Nguyen Ngoc","doi":"arxiv-2409.05042","DOIUrl":"https://doi.org/arxiv-2409.05042","url":null,"abstract":"Time series data from various domains are increasing continuously. Extracting\u0000and analyzing the temporal patterns in these series can reveal significant\u0000insights. Temporal pattern mining (TPM) extends traditional pattern mining by\u0000incorporating event time intervals into extracted patterns, enhancing their\u0000expressiveness but increasing time and space complexities. One valuable type of\u0000temporal pattern is known as rare temporal patterns (RTPs), which occur rarely\u0000but with high confidence. There exist several challenges when mining rare\u0000temporal patterns. The support measure is set very low, leading to a further\u0000combinatorial explosion and potentially producing too many uninteresting\u0000patterns. Thus, an efficient approach to rare temporal pattern mining is\u0000needed. This paper introduces our Rare Temporal Pattern Mining from Time Series\u0000(RTPMfTS) method for discovering rare temporal patterns, featuring the\u0000following key contributions: (1) An end-to-end RTPMfTS process that takes time\u0000series data as input and yields rare temporal patterns as output. (2) An\u0000efficient Rare Temporal Pattern Mining (RTPM) algorithm that uses optimized\u0000data structures for quick event and pattern retrieval and utilizes effective\u0000pruning techniques for much faster mining. (3) A thorough experimental\u0000evaluation of RTPM, showing that RTPM outperforms the baseline in terms of\u0000runtime and memory usage.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere
The continuous evolution of cities poses significant challenges in terms of managing and understanding their complex dynamics. With the increasing demand for transparency and the growing availability of open urban data, it has become important to ensure the reproducibility of scientific research and computations in urban planning. To understand past decisions and other possible scenarios, we require solutions that go beyond the management of urban knowledge graphs. In this work, we explore existing solutions and their limits and explain the need and possible approaches for querying across multiple graph versions.
{"title":"Graph versioning for evolving urban data","authors":"Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere","doi":"arxiv-2409.04498","DOIUrl":"https://doi.org/arxiv-2409.04498","url":null,"abstract":"The continuous evolution of cities poses significant challenges in terms of\u0000managing and understanding their complex dynamics. With the increasing demand\u0000for transparency and the growing availability of open urban data, it has become\u0000important to ensure the reproducibility of scientific research and computations\u0000in urban planning. To understand past decisions and other possible scenarios,\u0000we require solutions that go beyond the management of urban knowledge graphs.\u0000In this work, we explore existing solutions and their limits and explain the\u0000need and possible approaches for querying across multiple graph versions.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere
The multiplication of platforms offering open data has facilitated access to information that can be used for research, innovation, and decision-making. Providing transparency and availability, open data is regularly updated, allowing us to observe their evolution over time. We are particularly interested in the evolution of urban data that allows stakeholders to better understand dynamics and propose solutions to improve the quality of life of citizens. In this context, we are interested in the management of evolving data, especially urban data and the ability to query these data across the available versions. In order to have the ability to understand our urban heritage and propose new scenarios, we must be able to search for knowledge through concurrent versions of urban knowledge graphs. In this work, we present the ConVer-G (Concurrent Versioning of knowledge Graphs) system for storage and querying through multiple concurrent versions of graphs.
{"title":"ConVer-G: Concurrent versioning of knowledge graphs","authors":"Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere","doi":"arxiv-2409.04499","DOIUrl":"https://doi.org/arxiv-2409.04499","url":null,"abstract":"The multiplication of platforms offering open data has facilitated access to\u0000information that can be used for research, innovation, and decision-making.\u0000Providing transparency and availability, open data is regularly updated,\u0000allowing us to observe their evolution over time. We are particularly interested in the evolution of urban data that allows\u0000stakeholders to better understand dynamics and propose solutions to improve the\u0000quality of life of citizens. In this context, we are interested in the\u0000management of evolving data, especially urban data and the ability to query\u0000these data across the available versions. In order to have the ability to\u0000understand our urban heritage and propose new scenarios, we must be able to\u0000search for knowledge through concurrent versions of urban knowledge graphs. In this work, we present the ConVer-G (Concurrent Versioning of knowledge\u0000Graphs) system for storage and querying through multiple concurrent versions of\u0000graphs.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter
Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).
实体匹配(EM)是确定两条记录是否指向同一个现实世界实体的问题,这在数据集成(如产品目录或地址数据库)中至关重要。许多 EM 方法的一个主要缺点是依赖于标记示例。因此,我们将重点放在 "零镜头实体匹配 "这一具有挑战性的情境上,在这种情境中,没有标记过的示例可用于未见过的目标数据集。最近,大型语言模型(LLM)在零拍 EM 方面取得了令人满意的结果,但其低吞吐量和高部署成本限制了其适用性和可扩展性。我们利用在迁移学习设置中经过微调的小型语言模型 AnyMatch 重新探讨了零次 EM 问题。我们提出了几种新颖的数据选择技术来为我们的模型生成微调数据,例如,通过 AutoML 过滤器选择难以匹配的配对,生成附加属性级示例,以及控制数据中的标签不平衡。我们在九个基准数据集上与 13 个基线模型进行了比较,对我们模型的预测质量和部署成本进行了广泛评估。我们发现,尽管参数规模较小,AnyMatch 却能提供具有竞争力的预测质量:它获得了第二高的 F1 总分,并超越了其他几种采用千亿参数模型的方法。此外,我们的方法在成本方面也有很大的优势:AnyMatch 的平均预测质量与采用专有万亿参数模型 GPT-4 的最先进方法 MatchGPT 相比,相差不到 4.4%,但 AnyMatch 所需的参数数量却少了四个数量级,推理成本(以每千个代币美元计)也低了 3899 倍。
{"title":"AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model","authors":"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter","doi":"arxiv-2409.04073","DOIUrl":"https://doi.org/arxiv-2409.04073","url":null,"abstract":"Entity matching (EM) is the problem of determining whether two records refer\u0000to same real-world entity, which is crucial in data integration, e.g., for\u0000product catalogs or address databases. A major drawback of many EM approaches\u0000is their dependence on labelled examples. We thus focus on the challenging\u0000setting of zero-shot entity matching where no labelled examples are available\u0000for an unseen target dataset. Recently, large language models (LLMs) have shown\u0000promising results for zero-shot EM, but their low throughput and high\u0000deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\u0000fine-tuned in a transfer learning setup. We propose several novel data\u0000selection techniques to generate fine-tuning data for our model, e.g., by\u0000selecting difficult pairs to match via an AutoML filter, by generating\u0000additional attribute-level examples, and by controlling label imbalance in the\u0000data. We conduct an extensive evaluation of the prediction quality and deployment\u0000cost of our model, in a comparison to thirteen baselines on nine benchmark\u0000datasets. We find that AnyMatch provides competitive prediction quality despite\u0000its small parameter size: it achieves the second-highest F1 score overall, and\u0000outperforms several other approaches that employ models with hundreds of\u0000billions of parameters. Furthermore, our approach exhibits major cost benefits:\u0000the average prediction quality of AnyMatch is within 4.4% of the\u0000state-of-the-art method MatchGPT with the proprietary trillion-parameter model\u0000GPT-4, yet AnyMatch requires four orders of magnitude less parameters and\u0000incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Probabilistic relational models provide a well-established formalism to combine first-order logic and probabilistic models, thereby allowing to represent relationships between objects in a relational domain. At the same time, the field of artificial intelligence requires increasingly large amounts of relational training data for various machine learning tasks. Collecting real-world data, however, is often challenging due to privacy concerns, data protection regulations, high costs, and so on. To mitigate these challenges, the generation of synthetic data is a promising approach. In this paper, we solve the problem of generating synthetic relational data via probabilistic relational models. In particular, we propose a fully-fledged pipeline to go from relational database to probabilistic relational model, which can then be used to sample new synthetic relational data points from its underlying probability distribution. As part of our proposed pipeline, we introduce a learning algorithm to construct a probabilistic relational model from a given relational database.
{"title":"Towards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models","authors":"Malte Luttermann, Ralf Möller, Mattis Hartwig","doi":"arxiv-2409.04194","DOIUrl":"https://doi.org/arxiv-2409.04194","url":null,"abstract":"Probabilistic relational models provide a well-established formalism to\u0000combine first-order logic and probabilistic models, thereby allowing to\u0000represent relationships between objects in a relational domain. At the same\u0000time, the field of artificial intelligence requires increasingly large amounts\u0000of relational training data for various machine learning tasks. Collecting\u0000real-world data, however, is often challenging due to privacy concerns, data\u0000protection regulations, high costs, and so on. To mitigate these challenges,\u0000the generation of synthetic data is a promising approach. In this paper, we\u0000solve the problem of generating synthetic relational data via probabilistic\u0000relational models. In particular, we propose a fully-fledged pipeline to go\u0000from relational database to probabilistic relational model, which can then be\u0000used to sample new synthetic relational data points from its underlying\u0000probability distribution. As part of our proposed pipeline, we introduce a\u0000learning algorithm to construct a probabilistic relational model from a given\u0000relational database.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li
The development of Large Language Models (LLMs) has revolutionized Q&A across various industries, including the database domain. However, there is still a lack of a comprehensive benchmark to evaluate the capabilities of different LLMs and their modular components in database Q&A. To this end, we introduce DQA, the first comprehensive database Q&A benchmark. DQA features an innovative LLM-based method for automating the generation, cleaning, and rewriting of database Q&A, resulting in over 240,000 Q&A pairs in English and Chinese. These Q&A pairs cover nearly all aspects of database knowledge, including database manuals, database blogs, and database tools. This inclusion allows for additional assessment of LLMs' Retrieval-Augmented Generation (RAG) and Tool Invocation Generation (TIG) capabilities in the database Q&A task. Furthermore, we propose a comprehensive LLM-based database Q&A testbed on DQA. This testbed is highly modular and scalable, with both basic and advanced components like Question Classification Routing (QCR), RAG, TIG, and Prompt Template Engineering (PTE). Besides, DQA provides a complete evaluation pipeline, featuring diverse metrics and a standardized evaluation process to ensure comprehensiveness, accuracy, and fairness. We use DQA to evaluate the database Q&A capabilities under the proposed testbed comprehensively. The evaluation reveals findings like (i) the strengths and limitations of nine different LLM-based Q&A bots and (ii) the performance impact and potential improvements of various service components (e.g., QCR, RAG, TIG). We hope our benchmark and findings will better guide the future development of LLM-based database Q&A