arXiv - CS - Databases最新文献_第4页

Updateable Data-Driven Cardinality Estimator with Bounded Q-error 具有有界 Q 误差的可更新数据驱动卡方估计器

arXiv - CS - Databases

Pub Date : 2024-08-30 DOI: arxiv-2408.17209

Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang

Modern Cardinality Estimators struggle with data updates. This researchtackles this challenge within single-table. We introduce ICE, an Index-basedCardinality Estimator, the first data-driven estimator that enables instant,tuple-leveled updates. ICE has learned two key lessons from the multidimensional index and appliedthem to solve cardinality estimation in dynamic scenarios: (1) Index possessesthe capability for swift training and seamless updating amidst vastmultidimensional data. (2) Index offers precise data distribution, stayingsynchronized with the latest database version. These insights endow the indexwith the ability to be a highly accurate, data-driven model that rapidly adaptsto data updates and is resilient to out-of-distribution challenges during querytesting. To make a solitary index support cardinality estimation, we havecrafted sophisticated algorithms for training, updating, and estimating,analyzing unbiasedness and variance. Extensive experiments demonstrate the superiority of ICE. ICE offers preciseestimations and fast updates/construction across diverse workloads. Compared tostate-of-the-art real-time query-driven models, ICE boasts superior accuracy(2-3 orders of magnitude more precise), faster updates (4.7-6.9 times faster),and significantly reduced training time (up to 1-3 orders of magnitude faster).

现代 Cardinality Estimators 难以应对数据更新。本研究在单表中解决了这一难题。我们介绍了基于索引的卡合性估计器 ICE，它是首个能实现即时元组级更新的数据驱动估计器。ICE 从多维索引中汲取了两条关键经验，并将它们应用于解决动态场景中的卡合性估计问题：(1) 索引具有在庞大的多维数据中进行快速训练和无缝更新的能力。(2) 索引提供精确的数据分布，与最新的数据库版本保持同步。这些洞察力赋予了索引高精确度、数据驱动型模型的能力，使其能够快速适应数据更新，并在查询测试过程中抵御超出分布范围的挑战。为了使单个索引支持万有引力估计，我们开发了用于训练、更新和估计、分析无偏性和方差的复杂算法。广泛的实验证明了 ICE 的优越性。在不同的工作负载中，ICE 都能提供精确的估计和快速的更新/构建。与最先进的实时查询驱动模型相比，ICE 拥有更高的精确度（精确度提高了 2-3 个数量级）、更快的更新速度（快 4.7-6.9 倍）以及显著缩短的训练时间（快 1-3 个数量级）。

{"title":"Updateable Data-Driven Cardinality Estimator with Bounded Q-error","authors":"Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang","doi":"arxiv-2408.17209","DOIUrl":"https://doi.org/arxiv-2408.17209","url":null,"abstract":"Modern Cardinality Estimators struggle with data updates. This research\u0000tackles this challenge within single-table. We introduce ICE, an Index-based\u0000Cardinality Estimator, the first data-driven estimator that enables instant,\u0000tuple-leveled updates. ICE has learned two key lessons from the multidimensional index and applied\u0000them to solve cardinality estimation in dynamic scenarios: (1) Index possesses\u0000the capability for swift training and seamless updating amidst vast\u0000multidimensional data. (2) Index offers precise data distribution, staying\u0000synchronized with the latest database version. These insights endow the index\u0000with the ability to be a highly accurate, data-driven model that rapidly adapts\u0000to data updates and is resilient to out-of-distribution challenges during query\u0000testing. To make a solitary index support cardinality estimation, we have\u0000crafted sophisticated algorithms for training, updating, and estimating,\u0000analyzing unbiasedness and variance. Extensive experiments demonstrate the superiority of ICE. ICE offers precise\u0000estimations and fast updates/construction across diverse workloads. Compared to\u0000state-of-the-art real-time query-driven models, ICE boasts superior accuracy\u0000(2-3 orders of magnitude more precise), faster updates (4.7-6.9 times faster),\u0000and significantly reduced training time (up to 1-3 orders of magnitude faster).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CollectionLocator Level 1: Metadata-Based Search for Collections in Federated Biobanks CollectionLocator 1 级：基于元数据的联合生物库藏品搜索

arXiv - CS - Databases

Pub Date : 2024-08-29 DOI: arxiv-2408.16422

Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder

Biobanks are indispensable resources for medical research collectingbiological material and associated data and making them available for researchprojects and medical studies. For that, the biobank data has to meet certaincriteria which can be formulated as adherence to the FAIR (findable,accessible, interoperable and reusable) principles. We developed a tool, CollectionLocator, which aims at increasing the FAIRcompliance of biobank data by supporting researchers in identifying whichbiobank and which collection are likely to contain cases (material and data)satisfying the requirements of a defined research project when the detailedsample data is not available due to privacy restrictions. The CollectionLocatoris based on an ontology-based metadata model to address the enormousheterogeneities and ensure the privacy of the donors of the biological samplesand the data. Furthermore, the CollectionLocator represents the data andmetadata quality of the collections such that the quality requirements of therequester can be matched with the quality of the available data. The concept ofCollectionLocator is evaluated with a proof-of-concept implementation.

生物库是医学研究不可或缺的资源，它收集生物材料和相关数据，并提供给研究项目和医学研究使用。为此，生物库数据必须符合一定的标准，这些标准可以表述为遵守 FAIR（可查找、可访问、可互操作和可重用）原则。我们开发了一款名为 CollectionLocator 的工具，旨在提高生物库数据的 FAIR 合规性，支持研究人员在因隐私限制而无法获得详细样本数据的情况下，识别哪个生物库和哪个集合可能包含满足特定研究项目要求的病例（材料和数据）。采集定位器以基于本体的元数据模型为基础，以解决巨大的异质性问题，并确保生物样本和数据捐献者的隐私。此外，采集定位器还代表了采集数据和元数据的质量，从而使查询者的质量要求与可用数据的质量相匹配。收集定位器的概念通过概念验证实现进行了评估。

{"title":"CollectionLocator Level 1: Metadata-Based Search for Collections in Federated Biobanks","authors":"Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder","doi":"arxiv-2408.16422","DOIUrl":"https://doi.org/arxiv-2408.16422","url":null,"abstract":"Biobanks are indispensable resources for medical research collecting\u0000biological material and associated data and making them available for research\u0000projects and medical studies. For that, the biobank data has to meet certain\u0000criteria which can be formulated as adherence to the FAIR (findable,\u0000accessible, interoperable and reusable) principles. We developed a tool, CollectionLocator, which aims at increasing the FAIR\u0000compliance of biobank data by supporting researchers in identifying which\u0000biobank and which collection are likely to contain cases (material and data)\u0000satisfying the requirements of a defined research project when the detailed\u0000sample data is not available due to privacy restrictions. The CollectionLocator\u0000is based on an ontology-based metadata model to address the enormous\u0000heterogeneities and ensure the privacy of the donors of the biological samples\u0000and the data. Furthermore, the CollectionLocator represents the data and\u0000metadata quality of the collections such that the quality requirements of the\u0000requester can be matched with the quality of the available data. The concept of\u0000CollectionLocator is evaluated with a proof-of-concept implementation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake MQRLD：基于数据湖的具有查询感知特征表示和学习索引的多模态数据检索平台

arXiv - CS - Databases

Pub Date : 2024-08-29 DOI: arxiv-2408.16237

Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao

Multimodal data has become a crucial element in the realm of big dataanalytics, driving advancements in data exploration, data mining, andempowering artificial intelligence applications. To support high-qualityretrieval for these cutting-edge applications, a robust data retrieval platformshould meet the requirements for transparent data storage, rich hybrid queries,effective feature representation, and high query efficiency. However, among theexisting platforms, traditional schema-on-write systems, multi-model databases,vector databases, and data lakes, which are the primary options for multimodaldata retrieval, are difficult to fulfill these requirements simultaneously.Therefore, there is an urgent need to develop a more versatile multimodal dataretrieval platform to address these issues. In this paper, we introduce a Multimodal Data Retrieval Platform withQuery-aware Feature Representation and Learned Index based on Data Lake(MQRLD). It leverages the transparent storage capabilities of data lakes,integrates the multimodal open API to provide a unified interface that supportsrich hybrid queries, introduces a query-aware multimodal data featurerepresentation strategy to obtain effective features, and offershigh-dimensional learned indexes to optimize data query. We conduct acomparative analysis of the query performance of MQRLD against other methodsfor rich hybrid queries. Our results underscore the superior efficiency ofMQRLD in handling multimodal data retrieval tasks, demonstrating its potentialto significantly improve retrieval performance in complex environments. We alsoclarify some potential concerns in the discussion.

多模态数据已成为大数据分析领域的关键要素，推动着数据探索、数据挖掘和人工智能应用的进步。为了支持这些前沿应用的高质量检索，强大的数据检索平台应满足透明数据存储、丰富的混合查询、有效的特征表示和高查询效率等要求。然而，在现有的平台中，传统的写模式系统、多模型数据库、矢量数据库和数据湖等作为多模态数据检索的主要选择，很难同时满足这些要求。本文介绍了一种基于数据湖（Data Lake）的多模态数据检索平台（Multimodal Data Retrieval Platform withQuery-aware Feature Representation and Learned Index，MQRLD）。它利用数据湖的透明存储能力，集成多模态开放应用程序接口（API）以提供支持丰富混合查询的统一接口，引入查询感知多模态数据特征表示策略以获取有效特征，并提供高维学习索引以优化数据查询。我们对 MQRLD 的查询性能与其他富混合查询方法进行了比较分析。我们的研究结果表明，MQRLD 在处理多模态数据检索任务时具有卓越的效率，证明了它在复杂环境中显著提高检索性能的潜力。我们还在讨论中澄清了一些潜在的问题。

{"title":"MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake","authors":"Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao","doi":"arxiv-2408.16237","DOIUrl":"https://doi.org/arxiv-2408.16237","url":null,"abstract":"Multimodal data has become a crucial element in the realm of big data\u0000analytics, driving advancements in data exploration, data mining, and\u0000empowering artificial intelligence applications. To support high-quality\u0000retrieval for these cutting-edge applications, a robust data retrieval platform\u0000should meet the requirements for transparent data storage, rich hybrid queries,\u0000effective feature representation, and high query efficiency. However, among the\u0000existing platforms, traditional schema-on-write systems, multi-model databases,\u0000vector databases, and data lakes, which are the primary options for multimodal\u0000data retrieval, are difficult to fulfill these requirements simultaneously.\u0000Therefore, there is an urgent need to develop a more versatile multimodal data\u0000retrieval platform to address these issues. In this paper, we introduce a Multimodal Data Retrieval Platform with\u0000Query-aware Feature Representation and Learned Index based on Data Lake\u0000(MQRLD). It leverages the transparent storage capabilities of data lakes,\u0000integrates the multimodal open API to provide a unified interface that supports\u0000rich hybrid queries, introduces a query-aware multimodal data feature\u0000representation strategy to obtain effective features, and offers\u0000high-dimensional learned indexes to optimize data query. We conduct a\u0000comparative analysis of the query performance of MQRLD against other methods\u0000for rich hybrid queries. Our results underscore the superior efficiency of\u0000MQRLD in handling multimodal data retrieval tasks, demonstrating its potential\u0000to significantly improve retrieval performance in complex environments. We also\u0000clarify some potential concerns in the discussion.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"441 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases CardBench：关系数据库中学习到的卡片性估计基准

arXiv - CS - Databases

Pub Date : 2024-08-28 DOI: arxiv-2408.16170

Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan

Cardinality estimation is crucial for enabling high query performance inrelational databases. Recently learned cardinality estimation models have beenproposed to improve accuracy but there is no systematic benchmark or datasetswhich allows researchers to evaluate the progress made by new learnedapproaches and even systematically develop new learned approaches. In thispaper, we are releasing a benchmark, containing thousands of queries over 20distinct real-world databases for learned cardinality estimation. In contrastto other initial benchmarks, our benchmark is much more diverse and can be usedfor training and testing learned models systematically. Using this benchmark,we explored whether learned cardinality estimation can be transferred to anunseen dataset in a zero-shot manner. We trained GNN-based andtransformer-based models to study the problem in three setups: 1-)instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that whilewe get promising results for zero-shot cardinality estimation on simple singletable queries; as soon as we add joins, the accuracy drops. However, we showthat with fine-tuning, we can still utilize pre-trained models for cardinalityestimation, significantly reducing training overheads compared to instancespecific models. We are open sourcing our scripts to collect statistics,generate queries and training datasets to foster more extensive research, alsofrom the ML community on the important problem of cardinality estimation and inparticular improve on recent directions such as pre-trained cardinalityestimation.

卡片性估计对于关系数据库实现高性能查询至关重要。近来，人们提出了一些学习的卡饭估计模型来提高准确性，但目前还没有系统的基准或数据集可以让研究人员评估新的学习方法所取得的进展，甚至系统地开发新的学习方法。在本文中，我们将发布一个基准，其中包含对 20 个不同的真实数据库进行的数千次查询，用于学习的万有引力估计。与其他初始基准相比，我们的基准更加多样化，可用于系统地训练和测试学习模型。利用这一基准，我们探索了学习到的卡方估计是否能以 "零次 "的方式转移到未知的数据集上。我们训练了基于 GNN 的模型和基于变换器的模型，在三种情况下研究了这个问题：1-）基于实例；2-）零点；3-）微调。我们的结果表明，虽然我们在简单的单一查询中获得了很好的零次卡片性估计结果，但一旦加入连接，准确率就会下降。不过，我们的结果表明，通过微调，我们仍然可以利用预先训练好的模型来进行卡片品质估计，与针对特定实例的模型相比，大大减少了训练开销。我们正在开源我们的脚本，以收集统计数据、生成查询和训练数据集，从而促进更广泛的研究，也包括来自 ML 社区的关于万有引力估计这一重要问题的研究，特别是改进预训练万有引力估计等最近的研究方向。

{"title":"CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases","authors":"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan","doi":"arxiv-2408.16170","DOIUrl":"https://doi.org/arxiv-2408.16170","url":null,"abstract":"Cardinality estimation is crucial for enabling high query performance in\u0000relational databases. Recently learned cardinality estimation models have been\u0000proposed to improve accuracy but there is no systematic benchmark or datasets\u0000which allows researchers to evaluate the progress made by new learned\u0000approaches and even systematically develop new learned approaches. In this\u0000paper, we are releasing a benchmark, containing thousands of queries over 20\u0000distinct real-world databases for learned cardinality estimation. In contrast\u0000to other initial benchmarks, our benchmark is much more diverse and can be used\u0000for training and testing learned models systematically. Using this benchmark,\u0000we explored whether learned cardinality estimation can be transferred to an\u0000unseen dataset in a zero-shot manner. We trained GNN-based and\u0000transformer-based models to study the problem in three setups: 1-)\u0000instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\u0000we get promising results for zero-shot cardinality estimation on simple single\u0000table queries; as soon as we add joins, the accuracy drops. However, we show\u0000that with fine-tuning, we can still utilize pre-trained models for cardinality\u0000estimation, significantly reducing training overheads compared to instance\u0000specific models. We are open sourcing our scripts to collect statistics,\u0000generate queries and training datasets to foster more extensive research, also\u0000from the ML community on the important problem of cardinality estimation and in\u0000particular improve on recent directions such as pre-trained cardinality\u0000estimation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLM-assisted Labeling Function Generation for Semantic Type Detection 用于语义类型检测的 LLM 辅助标记功能生成

arXiv - CS - Databases

Pub Date : 2024-08-28 DOI: arxiv-2408.16173

Chenjie Li, Dan Zhang, Jin Wang

Detecting semantic types of columns in data lake tables is an importantapplication. A key bottleneck in semantic type detection is the availability ofhuman annotation due to the inherent complexity of data lakes. In this paper,we propose using programmatic weak supervision to assist in annotating thetraining data for semantic type detection by leveraging labeling functions. Onechallenge in this process is the difficulty of manually writing labelingfunctions due to the large volume and low quality of the data lake tabledatasets. To address this issue, we explore employing Large Language Models(LLMs) for labeling function generation and introduce several promptengineering strategies for this purpose. We conduct experiments on real-worldweb table datasets. Based on the initial results, we perform extensive analysisand provide empirical insights and future directions for researchers in thisfield.

检测数据湖表格中列的语义类型是一项重要应用。由于数据湖固有的复杂性，语义类型检测的一个关键瓶颈是人工标注的可用性。在本文中，我们建议使用程序化弱监督，利用标注函数来协助注释语义类型检测的训练数据。这一过程中的一个挑战是，由于数据湖标签集数量大、质量低，人工编写标签函数非常困难。为了解决这个问题，我们探索了利用大型语言模型（LLM）生成标签函数的方法，并为此引入了几种提示工程策略。我们在真实世界的网络表格数据集上进行了实验。在初步结果的基础上，我们进行了广泛的分析，并为该领域的研究人员提供了经验见解和未来方向。

引用次数: 0

Empowering Database Learning Through Remote Educational Escape Rooms 通过远程教育密室增强数据库学习能力

arXiv - CS - Databases

Pub Date : 2024-08-28 DOI: arxiv-2409.08284

Enrique Barra, Sonsoles López-Pernas, Aldo Gordillo, Alejandro Pozo, Andres Muñoz-Arcentales, Javier Conde

Learning about databases is indispensable for individuals studying softwareengineering or computer science or those involved in the IT industry. Weanalyzed a remote educational escape room for teaching about databases in fourdifferent higher education courses in two consecutive academic years. Weemployed three instruments for evaluation: a pre- and post-test to assess theescape room's effectiveness for student learning, a questionnaire to gatherstudents' perceptions, and a Web platform that unobtrusively records students'interactions and performance. We show novel evidence that educational escaperooms conducted remotely can be engaging as well as effective for teachingabout databases.

对于学习软件工程、计算机科学或从事 IT 行业的人来说，学习数据库知识是必不可少的。我们对连续两个学年在四门不同的高等教育课程中使用远程教育逃生室教授数据库知识的情况进行了分析。我们使用了三种工具进行评估：前测和后测来评估逃生室对学生学习的有效性，问卷调查来收集学生的看法，以及网络平台来不露痕迹地记录学生的互动和表现。我们展示了新的证据，证明远程进行的教育模拟室在数据库教学方面既能吸引学生，又能提高教学效果。

引用次数: 0

Enumeration of Minimal Hitting Sets Parameterized by Treewidth 以树宽为参数的最小命中集枚举

arXiv - CS - Databases

Pub Date : 2024-08-28 DOI: arxiv-2408.15776

Batya Kenig, Dan Shlomo Mizrahi

Enumerating the minimal hitting sets of a hypergraph is a problem whicharises in many data management applications that include constraint mining,discovering unique column combinations, and enumerating database repairs.Previously, Eiter et al. showed that the minimal hitting sets of an $n$-vertexhypergraph, with treewidth $w$, can be enumerated with delay $O^*(n^{w})$(ignoring polynomial factors), with space requirements that scale with theoutput size. We improve this to fixed-parameter-linear delay, following an FPTpreprocessing phase. The memory consumption of our algorithm is exponentialwith respect to the treewidth of the hypergraph.

枚举超图的最小命中集是一个在许多数据管理应用中都会出现的问题，这些应用包括约束挖掘、发现唯一列组合以及枚举数据库修复等。此前，Eiter 等人的研究表明，可以用 $O^*(n^{w})$（忽略多项式系数）的延迟枚举树宽为 $w$ 的 $n$ 顶点超图的最小命中集，其空间需求随输出大小而缩放。在 FPT 预处理阶段之后，我们将其改进为固定参数线性延迟。我们算法的内存消耗与超图的树宽呈指数关系。

引用次数: 0

Order-preserving pattern mining with forgetting mechanism 具有遗忘机制的保序模式挖掘

arXiv - CS - Databases

Pub Date : 2024-08-28 DOI: arxiv-2408.15563

Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu

Order-preserving pattern (OPP) mining is a type of sequential pattern miningmethod in which a group of ranks of time series is used to represent an OPP.This approach can discover frequent trends in time series. Existing OPP miningalgorithms consider data points at different time to be equally important;however, newer data usually have a more significant impact, while older datahave a weaker impact. We therefore introduce the forgetting mechanism into OPPmining to reduce the importance of older data. This paper explores the miningof OPPs with forgetting mechanism (OPF) and proposes an algorithm calledOPF-Miner that can discover frequent OPFs. OPF-Miner performs two tasks,candidate pattern generation and support calculation. In candidate patterngeneration, OPF-Miner employs a maximal support priority strategy and a grouppattern fusion strategy to avoid redundant pattern fusions. For supportcalculation, we propose an algorithm called support calculation with forgettingmechanism, which uses prefix and suffix pattern pruning strategies to avoidredundant support calculations. The experiments are conducted on nine datasetsand 12 alternative algorithms. The results verify that OPF-Miner is superior toother competitive algorithms. More importantly, OPF-Miner yields goodclustering performance for time series, since the forgetting mechanism isemployed. All algorithms can be downloaded fromhttps://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner.

保序模式（OPP）挖掘是一种序列模式挖掘方法，其中一组时间序列的等级被用来表示一个 OPP。现有的 OPP 挖掘算法认为不同时间的数据点同等重要；但是，较新的数据通常影响更大，而较老的数据影响较弱。因此，我们在 OPP 挖掘中引入了遗忘机制，以降低旧数据的重要性。本文探讨了带有遗忘机制（OPF）的 OPP 挖掘，并提出了一种名为 OPF-Miner 的算法，它可以发现频繁的 OPF。OPF-Miner 执行两项任务，即候选模式生成和支持计算。在候选模式生成中，OPF-Miner 采用了最大支持优先策略和组模式融合策略，以避免冗余模式融合。在支持计算方面，我们提出了一种名为 "带遗忘机制的支持计算 "的算法，它使用前缀和后缀模式剪枝策略来避免冗余的支持计算。我们在九个数据集和 12 种备选算法上进行了实验。结果验证了 OPF-Miner 优于其他竞争算法。更重要的是，由于采用了遗忘机制，OPF-Miner 对时间序列具有良好的聚类性能。所有算法可从https://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner 下载。

{"title":"Order-preserving pattern mining with forgetting mechanism","authors":"Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu","doi":"arxiv-2408.15563","DOIUrl":"https://doi.org/arxiv-2408.15563","url":null,"abstract":"Order-preserving pattern (OPP) mining is a type of sequential pattern mining\u0000method in which a group of ranks of time series is used to represent an OPP.\u0000This approach can discover frequent trends in time series. Existing OPP mining\u0000algorithms consider data points at different time to be equally important;\u0000however, newer data usually have a more significant impact, while older data\u0000have a weaker impact. We therefore introduce the forgetting mechanism into OPP\u0000mining to reduce the importance of older data. This paper explores the mining\u0000of OPPs with forgetting mechanism (OPF) and proposes an algorithm called\u0000OPF-Miner that can discover frequent OPFs. OPF-Miner performs two tasks,\u0000candidate pattern generation and support calculation. In candidate pattern\u0000generation, OPF-Miner employs a maximal support priority strategy and a group\u0000pattern fusion strategy to avoid redundant pattern fusions. For support\u0000calculation, we propose an algorithm called support calculation with forgetting\u0000mechanism, which uses prefix and suffix pattern pruning strategies to avoid\u0000redundant support calculations. The experiments are conducted on nine datasets\u0000and 12 alternative algorithms. The results verify that OPF-Miner is superior to\u0000other competitive algorithms. More importantly, OPF-Miner yields good\u0000clustering performance for time series, since the forgetting mechanism is\u0000employed. All algorithms can be downloaded from\u0000https://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Text2SQL is Not Enough: Unifying AI and Databases with TAG 仅有 Text2SQL 是不够的：用 TAG 统一人工智能和数据库

arXiv - CS - Databases

Pub Date : 2024-08-27 DOI: arxiv-2408.14717

Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia

AI systems that serve natural language questions over databases promise tounlock tremendous value. Such systems would allow users to leverage thepowerful reasoning and knowledge capabilities of language models (LMs)alongside the scalable computational power of data management systems. Thesecombined capabilities would empower users to ask arbitrary natural languagequestions over custom data sources. However, existing methods and benchmarksinsufficiently explore this setting. Text2SQL methods focus solely on naturallanguage questions that can be expressed in relational algebra, representing asmall subset of the questions real users wish to ask. Likewise,Retrieval-Augmented Generation (RAG) considers the limited subset of queriesthat can be answered with point lookups to one or a few data records within thedatabase. We propose Table-Augmented Generation (TAG), a unified andgeneral-purpose paradigm for answering natural language questions overdatabases. The TAG model represents a wide range of interactions between the LMand database that have been previously unexplored and creates exciting researchopportunities for leveraging the world knowledge and reasoning capabilities ofLMs over data. We systematically develop benchmarks to study the TAG problemand find that standard methods answer no more than 20% of queries correctly,confirming the need for further research in this area. We release code for thebenchmark at https://github.com/TAG-Research/TAG-Bench.

通过数据库为自然语言问题提供服务的人工智能系统有望带来巨大价值。这种系统将使用户能够利用语言模型（LM）的强大推理和知识能力，以及数据管理系统的可扩展计算能力。这些综合能力将使用户能够对自定义数据源提出任意的自然语言问题。然而，现有的方法和基准并没有充分探索这一环境。Text2SQL 方法只关注可以用关系代数表达的自然语言问题，这只是真实用户希望提出的问题的一小部分。同样，检索增强生成（RAG）考虑的是有限的查询子集，这些查询可以通过对数据库中的一条或几条数据记录进行点查询来回答。我们提出了表增强生成（TAG），这是一种统一的通用范例，用于回答数据库中的自然语言问题。TAG 模型代表了 LM 与数据库之间广泛的交互，而这些交互以前从未被探索过，它为利用 LM 的世界知识和数据推理能力创造了令人兴奋的研究机会。我们系统地开发了基准来研究 TAG 问题，并发现标准方法只能正确回答不超过 20% 的查询，这证实了在这一领域开展进一步研究的必要性。我们在 https://github.com/TAG-Research/TAG-Bench 上发布了该基准的代码。

{"title":"Text2SQL is Not Enough: Unifying AI and Databases with TAG","authors":"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia","doi":"arxiv-2408.14717","DOIUrl":"https://doi.org/arxiv-2408.14717","url":null,"abstract":"AI systems that serve natural language questions over databases promise to\u0000unlock tremendous value. Such systems would allow users to leverage the\u0000powerful reasoning and knowledge capabilities of language models (LMs)\u0000alongside the scalable computational power of data management systems. These\u0000combined capabilities would empower users to ask arbitrary natural language\u0000questions over custom data sources. However, existing methods and benchmarks\u0000insufficiently explore this setting. Text2SQL methods focus solely on natural\u0000language questions that can be expressed in relational algebra, representing a\u0000small subset of the questions real users wish to ask. Likewise,\u0000Retrieval-Augmented Generation (RAG) considers the limited subset of queries\u0000that can be answered with point lookups to one or a few data records within the\u0000database. We propose Table-Augmented Generation (TAG), a unified and\u0000general-purpose paradigm for answering natural language questions over\u0000databases. The TAG model represents a wide range of interactions between the LM\u0000and database that have been previously unexplored and creates exciting research\u0000opportunities for leveraging the world knowledge and reasoning capabilities of\u0000LMs over data. We systematically develop benchmarks to study the TAG problem\u0000and find that standard methods answer no more than 20% of queries correctly,\u0000confirming the need for further research in this area. We release code for the\u0000benchmark at https://github.com/TAG-Research/TAG-Bench.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Finding Convincing Views to Endorse a Claim 寻找令人信服的观点来支持主张

arXiv - CS - Databases

Pub Date : 2024-08-27 DOI: arxiv-2408.14974

Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology

Recent studies investigated the challenge of assessing the strength of agiven claim extracted from a dataset, particularly the claim's potential ofbeing misleading and cherry-picked. We focus on claims that compare answers toan aggregate query posed on a view that selects tuples. The strength of a claimamounts to the question of how likely it is that the view is carefully chosento support the claim, whereas less careful choices would lead to contradictoryclaims. We embark on the study of the reverse task that offers a complementaryangle in the critical assessment of data-based claims: given a claim, finduseful supporting views. The goal of this task is twofold. On the one hand, weaim to assist users in finding significant evidence of phenomena of interest.On the other hand, we wish to provide them with machinery to criticize orcounter given claims by extracting evidence of opposing statements. To be effective, the supporting sub-population should be significant anddefined by a ``natural'' view. We discuss several measures of naturalness andpropose ways of extracting the best views under each measure (and combinationsthereof). The main challenge is the computational cost, as na"ive search isinfeasible. We devise anytime algorithms that deploy two main steps: (1) apreliminary construction of a ranked list of attribute combinations that areassessed using fast-to-compute features, and (2) an efficient search for theactual views based on each attribute combination. We present a thoroughexperimental study that shows the effectiveness of our algorithms in terms ofquality and execution cost. We also present a user study to assess theusefulness of the naturalness measures.

最近的一些研究探讨了如何评估从数据集中提取的给定声明的强度，尤其是该声明是否可能具有误导性和偷梁换柱。我们重点研究的是对一个视图中选择图元的聚合查询的答案进行比较的声明。索赔的强度相当于这样一个问题：为支持索赔而精心选择视图的可能性有多大，而不那么精心的选择会导致相互矛盾的索赔。我们开始研究反向任务，它为批判性评估基于数据的主张提供了一个补充角度：给定一个主张，找出有用的支持观点。这项任务的目标是双重的。一方面，我们希望帮助用户找到感兴趣现象的重要证据；另一方面，我们希望为用户提供一种机制，通过提取反对声明的证据来批评或反驳给定的声明。要做到有效，支持的子群应该是重要的，并由 "自然 "的观点来定义。我们讨论了自然度的几种测量方法，并提出了在每种测量方法（以及它们的组合）下提取最佳观点的方法。主要的挑战在于计算成本，因为自然搜索是不可行的。我们设计了随时算法，主要有两个步骤：(1) 初步构建使用快速计算特征进行评估的属性组合排序列表；(2) 根据每个属性组合高效搜索实际视图。我们介绍了一项深入的实验研究，显示了我们的算法在质量和执行成本方面的有效性。我们还介绍了一项用户研究，以评估自然度测量的实用性。

{"title":"Finding Convincing Views to Endorse a Claim","authors":"Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology","doi":"arxiv-2408.14974","DOIUrl":"https://doi.org/arxiv-2408.14974","url":null,"abstract":"Recent studies investigated the challenge of assessing the strength of a\u0000given claim extracted from a dataset, particularly the claim's potential of\u0000being misleading and cherry-picked. We focus on claims that compare answers to\u0000an aggregate query posed on a view that selects tuples. The strength of a claim\u0000amounts to the question of how likely it is that the view is carefully chosen\u0000to support the claim, whereas less careful choices would lead to contradictory\u0000claims. We embark on the study of the reverse task that offers a complementary\u0000angle in the critical assessment of data-based claims: given a claim, find\u0000useful supporting views. The goal of this task is twofold. On the one hand, we\u0000aim to assist users in finding significant evidence of phenomena of interest.\u0000On the other hand, we wish to provide them with machinery to criticize or\u0000counter given claims by extracting evidence of opposing statements. To be effective, the supporting sub-population should be significant and\u0000defined by a ``natural'' view. We discuss several measures of naturalness and\u0000propose ways of extracting the best views under each measure (and combinations\u0000thereof). The main challenge is the computational cost, as na\"ive search is\u0000infeasible. We devise anytime algorithms that deploy two main steps: (1) a\u0000preliminary construction of a ranked list of attribute combinations that are\u0000assessed using fast-to-compute features, and (2) an efficient search for the\u0000actual views based on each attribute combination. We present a thorough\u0000experimental study that shows the effectiveness of our algorithms in terms of\u0000quality and execution cost. We also present a user study to assess the\u0000usefulness of the naturalness measures.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"184 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0