Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang
Modern Cardinality Estimators struggle with data updates. This research tackles this challenge within single-table. We introduce ICE, an Index-based Cardinality Estimator, the first data-driven estimator that enables instant, tuple-leveled updates. ICE has learned two key lessons from the multidimensional index and applied them to solve cardinality estimation in dynamic scenarios: (1) Index possesses the capability for swift training and seamless updating amidst vast multidimensional data. (2) Index offers precise data distribution, staying synchronized with the latest database version. These insights endow the index with the ability to be a highly accurate, data-driven model that rapidly adapts to data updates and is resilient to out-of-distribution challenges during query testing. To make a solitary index support cardinality estimation, we have crafted sophisticated algorithms for training, updating, and estimating, analyzing unbiasedness and variance. Extensive experiments demonstrate the superiority of ICE. ICE offers precise estimations and fast updates/construction across diverse workloads. Compared to state-of-the-art real-time query-driven models, ICE boasts superior accuracy (2-3 orders of magnitude more precise), faster updates (4.7-6.9 times faster), and significantly reduced training time (up to 1-3 orders of magnitude faster).
{"title":"Updateable Data-Driven Cardinality Estimator with Bounded Q-error","authors":"Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang","doi":"arxiv-2408.17209","DOIUrl":"https://doi.org/arxiv-2408.17209","url":null,"abstract":"Modern Cardinality Estimators struggle with data updates. This research\u0000tackles this challenge within single-table. We introduce ICE, an Index-based\u0000Cardinality Estimator, the first data-driven estimator that enables instant,\u0000tuple-leveled updates. ICE has learned two key lessons from the multidimensional index and applied\u0000them to solve cardinality estimation in dynamic scenarios: (1) Index possesses\u0000the capability for swift training and seamless updating amidst vast\u0000multidimensional data. (2) Index offers precise data distribution, staying\u0000synchronized with the latest database version. These insights endow the index\u0000with the ability to be a highly accurate, data-driven model that rapidly adapts\u0000to data updates and is resilient to out-of-distribution challenges during query\u0000testing. To make a solitary index support cardinality estimation, we have\u0000crafted sophisticated algorithms for training, updating, and estimating,\u0000analyzing unbiasedness and variance. Extensive experiments demonstrate the superiority of ICE. ICE offers precise\u0000estimations and fast updates/construction across diverse workloads. Compared to\u0000state-of-the-art real-time query-driven models, ICE boasts superior accuracy\u0000(2-3 orders of magnitude more precise), faster updates (4.7-6.9 times faster),\u0000and significantly reduced training time (up to 1-3 orders of magnitude faster).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder
Biobanks are indispensable resources for medical research collecting biological material and associated data and making them available for research projects and medical studies. For that, the biobank data has to meet certain criteria which can be formulated as adherence to the FAIR (findable, accessible, interoperable and reusable) principles. We developed a tool, CollectionLocator, which aims at increasing the FAIR compliance of biobank data by supporting researchers in identifying which biobank and which collection are likely to contain cases (material and data) satisfying the requirements of a defined research project when the detailed sample data is not available due to privacy restrictions. The CollectionLocator is based on an ontology-based metadata model to address the enormous heterogeneities and ensure the privacy of the donors of the biological samples and the data. Furthermore, the CollectionLocator represents the data and metadata quality of the collections such that the quality requirements of the requester can be matched with the quality of the available data. The concept of CollectionLocator is evaluated with a proof-of-concept implementation.
{"title":"CollectionLocator Level 1: Metadata-Based Search for Collections in Federated Biobanks","authors":"Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder","doi":"arxiv-2408.16422","DOIUrl":"https://doi.org/arxiv-2408.16422","url":null,"abstract":"Biobanks are indispensable resources for medical research collecting\u0000biological material and associated data and making them available for research\u0000projects and medical studies. For that, the biobank data has to meet certain\u0000criteria which can be formulated as adherence to the FAIR (findable,\u0000accessible, interoperable and reusable) principles. We developed a tool, CollectionLocator, which aims at increasing the FAIR\u0000compliance of biobank data by supporting researchers in identifying which\u0000biobank and which collection are likely to contain cases (material and data)\u0000satisfying the requirements of a defined research project when the detailed\u0000sample data is not available due to privacy restrictions. The CollectionLocator\u0000is based on an ontology-based metadata model to address the enormous\u0000heterogeneities and ensure the privacy of the donors of the biological samples\u0000and the data. Furthermore, the CollectionLocator represents the data and\u0000metadata quality of the collections such that the quality requirements of the\u0000requester can be matched with the quality of the available data. The concept of\u0000CollectionLocator is evaluated with a proof-of-concept implementation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao
Multimodal data has become a crucial element in the realm of big data analytics, driving advancements in data exploration, data mining, and empowering artificial intelligence applications. To support high-quality retrieval for these cutting-edge applications, a robust data retrieval platform should meet the requirements for transparent data storage, rich hybrid queries, effective feature representation, and high query efficiency. However, among the existing platforms, traditional schema-on-write systems, multi-model databases, vector databases, and data lakes, which are the primary options for multimodal data retrieval, are difficult to fulfill these requirements simultaneously. Therefore, there is an urgent need to develop a more versatile multimodal data retrieval platform to address these issues. In this paper, we introduce a Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index based on Data Lake (MQRLD). It leverages the transparent storage capabilities of data lakes, integrates the multimodal open API to provide a unified interface that supports rich hybrid queries, introduces a query-aware multimodal data feature representation strategy to obtain effective features, and offers high-dimensional learned indexes to optimize data query. We conduct a comparative analysis of the query performance of MQRLD against other methods for rich hybrid queries. Our results underscore the superior efficiency of MQRLD in handling multimodal data retrieval tasks, demonstrating its potential to significantly improve retrieval performance in complex environments. We also clarify some potential concerns in the discussion.
多模态数据已成为大数据分析领域的关键要素,推动着数据探索、数据挖掘和人工智能应用的进步。为了支持这些前沿应用的高质量检索,强大的数据检索平台应满足透明数据存储、丰富的混合查询、有效的特征表示和高查询效率等要求。然而,在现有的平台中,传统的写模式系统、多模型数据库、矢量数据库和数据湖等作为多模态数据检索的主要选择,很难同时满足这些要求。本文介绍了一种基于数据湖(Data Lake)的多模态数据检索平台(Multimodal Data Retrieval Platform withQuery-aware Feature Representation and Learned Index,MQRLD)。它利用数据湖的透明存储能力,集成多模态开放应用程序接口(API)以提供支持丰富混合查询的统一接口,引入查询感知多模态数据特征表示策略以获取有效特征,并提供高维学习索引以优化数据查询。我们对 MQRLD 的查询性能与其他富混合查询方法进行了比较分析。我们的研究结果表明,MQRLD 在处理多模态数据检索任务时具有卓越的效率,证明了它在复杂环境中显著提高检索性能的潜力。我们还在讨论中澄清了一些潜在的问题。
{"title":"MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake","authors":"Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao","doi":"arxiv-2408.16237","DOIUrl":"https://doi.org/arxiv-2408.16237","url":null,"abstract":"Multimodal data has become a crucial element in the realm of big data\u0000analytics, driving advancements in data exploration, data mining, and\u0000empowering artificial intelligence applications. To support high-quality\u0000retrieval for these cutting-edge applications, a robust data retrieval platform\u0000should meet the requirements for transparent data storage, rich hybrid queries,\u0000effective feature representation, and high query efficiency. However, among the\u0000existing platforms, traditional schema-on-write systems, multi-model databases,\u0000vector databases, and data lakes, which are the primary options for multimodal\u0000data retrieval, are difficult to fulfill these requirements simultaneously.\u0000Therefore, there is an urgent need to develop a more versatile multimodal data\u0000retrieval platform to address these issues. In this paper, we introduce a Multimodal Data Retrieval Platform with\u0000Query-aware Feature Representation and Learned Index based on Data Lake\u0000(MQRLD). It leverages the transparent storage capabilities of data lakes,\u0000integrates the multimodal open API to provide a unified interface that supports\u0000rich hybrid queries, introduces a query-aware multimodal data feature\u0000representation strategy to obtain effective features, and offers\u0000high-dimensional learned indexes to optimize data query. We conduct a\u0000comparative analysis of the query performance of MQRLD against other methods\u0000for rich hybrid queries. Our results underscore the superior efficiency of\u0000MQRLD in handling multimodal data retrieval tasks, demonstrating its potential\u0000to significantly improve retrieval performance in complex environments. We also\u0000clarify some potential concerns in the discussion.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan
Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.
卡片性估计对于关系数据库实现高性能查询至关重要。近来,人们提出了一些学习的卡饭估计模型来提高准确性,但目前还没有系统的基准或数据集可以让研究人员评估新的学习方法所取得的进展,甚至系统地开发新的学习方法。在本文中,我们将发布一个基准,其中包含对 20 个不同的真实数据库进行的数千次查询,用于学习的万有引力估计。与其他初始基准相比,我们的基准更加多样化,可用于系统地训练和测试学习模型。利用这一基准,我们探索了学习到的卡方估计是否能以 "零次 "的方式转移到未知的数据集上。我们训练了基于 GNN 的模型和基于变换器的模型,在三种情况下研究了这个问题:1-)基于实例;2-)零点;3-)微调。我们的结果表明,虽然我们在简单的单一查询中获得了很好的零次卡片性估计结果,但一旦加入连接,准确率就会下降。不过,我们的结果表明,通过微调,我们仍然可以利用预先训练好的模型来进行卡片品质估计,与针对特定实例的模型相比,大大减少了训练开销。我们正在开源我们的脚本,以收集统计数据、生成查询和训练数据集,从而促进更广泛的研究,也包括来自 ML 社区的关于万有引力估计这一重要问题的研究,特别是改进预训练万有引力估计等最近的研究方向。
{"title":"CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases","authors":"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan","doi":"arxiv-2408.16170","DOIUrl":"https://doi.org/arxiv-2408.16170","url":null,"abstract":"Cardinality estimation is crucial for enabling high query performance in\u0000relational databases. Recently learned cardinality estimation models have been\u0000proposed to improve accuracy but there is no systematic benchmark or datasets\u0000which allows researchers to evaluate the progress made by new learned\u0000approaches and even systematically develop new learned approaches. In this\u0000paper, we are releasing a benchmark, containing thousands of queries over 20\u0000distinct real-world databases for learned cardinality estimation. In contrast\u0000to other initial benchmarks, our benchmark is much more diverse and can be used\u0000for training and testing learned models systematically. Using this benchmark,\u0000we explored whether learned cardinality estimation can be transferred to an\u0000unseen dataset in a zero-shot manner. We trained GNN-based and\u0000transformer-based models to study the problem in three setups: 1-)\u0000instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\u0000we get promising results for zero-shot cardinality estimation on simple single\u0000table queries; as soon as we add joins, the accuracy drops. However, we show\u0000that with fine-tuning, we can still utilize pre-trained models for cardinality\u0000estimation, significantly reducing training overheads compared to instance\u0000specific models. We are open sourcing our scripts to collect statistics,\u0000generate queries and training datasets to foster more extensive research, also\u0000from the ML community on the important problem of cardinality estimation and in\u0000particular improve on recent directions such as pre-trained cardinality\u0000estimation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting semantic types of columns in data lake tables is an important application. A key bottleneck in semantic type detection is the availability of human annotation due to the inherent complexity of data lakes. In this paper, we propose using programmatic weak supervision to assist in annotating the training data for semantic type detection by leveraging labeling functions. One challenge in this process is the difficulty of manually writing labeling functions due to the large volume and low quality of the data lake table datasets. To address this issue, we explore employing Large Language Models (LLMs) for labeling function generation and introduce several prompt engineering strategies for this purpose. We conduct experiments on real-world web table datasets. Based on the initial results, we perform extensive analysis and provide empirical insights and future directions for researchers in this field.
{"title":"LLM-assisted Labeling Function Generation for Semantic Type Detection","authors":"Chenjie Li, Dan Zhang, Jin Wang","doi":"arxiv-2408.16173","DOIUrl":"https://doi.org/arxiv-2408.16173","url":null,"abstract":"Detecting semantic types of columns in data lake tables is an important\u0000application. A key bottleneck in semantic type detection is the availability of\u0000human annotation due to the inherent complexity of data lakes. In this paper,\u0000we propose using programmatic weak supervision to assist in annotating the\u0000training data for semantic type detection by leveraging labeling functions. One\u0000challenge in this process is the difficulty of manually writing labeling\u0000functions due to the large volume and low quality of the data lake table\u0000datasets. To address this issue, we explore employing Large Language Models\u0000(LLMs) for labeling function generation and introduce several prompt\u0000engineering strategies for this purpose. We conduct experiments on real-world\u0000web table datasets. Based on the initial results, we perform extensive analysis\u0000and provide empirical insights and future directions for researchers in this\u0000field.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enrique Barra, Sonsoles López-Pernas, Aldo Gordillo, Alejandro Pozo, Andres Muñoz-Arcentales, Javier Conde
Learning about databases is indispensable for individuals studying software engineering or computer science or those involved in the IT industry. We analyzed a remote educational escape room for teaching about databases in four different higher education courses in two consecutive academic years. We employed three instruments for evaluation: a pre- and post-test to assess the escape room's effectiveness for student learning, a questionnaire to gather students' perceptions, and a Web platform that unobtrusively records students' interactions and performance. We show novel evidence that educational escape rooms conducted remotely can be engaging as well as effective for teaching about databases.
对于学习软件工程、计算机科学或从事 IT 行业的人来说,学习数据库知识是必不可少的。我们对连续两个学年在四门不同的高等教育课程中使用远程教育逃生室教授数据库知识的情况进行了分析。我们使用了三种工具进行评估:前测和后测来评估逃生室对学生学习的有效性,问卷调查来收集学生的看法,以及网络平台来不露痕迹地记录学生的互动和表现。我们展示了新的证据,证明远程进行的教育模拟室在数据库教学方面既能吸引学生,又能提高教学效果。
{"title":"Empowering Database Learning Through Remote Educational Escape Rooms","authors":"Enrique Barra, Sonsoles López-Pernas, Aldo Gordillo, Alejandro Pozo, Andres Muñoz-Arcentales, Javier Conde","doi":"arxiv-2409.08284","DOIUrl":"https://doi.org/arxiv-2409.08284","url":null,"abstract":"Learning about databases is indispensable for individuals studying software\u0000engineering or computer science or those involved in the IT industry. We\u0000analyzed a remote educational escape room for teaching about databases in four\u0000different higher education courses in two consecutive academic years. We\u0000employed three instruments for evaluation: a pre- and post-test to assess the\u0000escape room's effectiveness for student learning, a questionnaire to gather\u0000students' perceptions, and a Web platform that unobtrusively records students'\u0000interactions and performance. We show novel evidence that educational escape\u0000rooms conducted remotely can be engaging as well as effective for teaching\u0000about databases.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enumerating the minimal hitting sets of a hypergraph is a problem which arises in many data management applications that include constraint mining, discovering unique column combinations, and enumerating database repairs. Previously, Eiter et al. showed that the minimal hitting sets of an $n$-vertex hypergraph, with treewidth $w$, can be enumerated with delay $O^*(n^{w})$ (ignoring polynomial factors), with space requirements that scale with the output size. We improve this to fixed-parameter-linear delay, following an FPT preprocessing phase. The memory consumption of our algorithm is exponential with respect to the treewidth of the hypergraph.
{"title":"Enumeration of Minimal Hitting Sets Parameterized by Treewidth","authors":"Batya Kenig, Dan Shlomo Mizrahi","doi":"arxiv-2408.15776","DOIUrl":"https://doi.org/arxiv-2408.15776","url":null,"abstract":"Enumerating the minimal hitting sets of a hypergraph is a problem which\u0000arises in many data management applications that include constraint mining,\u0000discovering unique column combinations, and enumerating database repairs.\u0000Previously, Eiter et al. showed that the minimal hitting sets of an $n$-vertex\u0000hypergraph, with treewidth $w$, can be enumerated with delay $O^*(n^{w})$\u0000(ignoring polynomial factors), with space requirements that scale with the\u0000output size. We improve this to fixed-parameter-linear delay, following an FPT\u0000preprocessing phase. The memory consumption of our algorithm is exponential\u0000with respect to the treewidth of the hypergraph.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu
Order-preserving pattern (OPP) mining is a type of sequential pattern mining method in which a group of ranks of time series is used to represent an OPP. This approach can discover frequent trends in time series. Existing OPP mining algorithms consider data points at different time to be equally important; however, newer data usually have a more significant impact, while older data have a weaker impact. We therefore introduce the forgetting mechanism into OPP mining to reduce the importance of older data. This paper explores the mining of OPPs with forgetting mechanism (OPF) and proposes an algorithm called OPF-Miner that can discover frequent OPFs. OPF-Miner performs two tasks, candidate pattern generation and support calculation. In candidate pattern generation, OPF-Miner employs a maximal support priority strategy and a group pattern fusion strategy to avoid redundant pattern fusions. For support calculation, we propose an algorithm called support calculation with forgetting mechanism, which uses prefix and suffix pattern pruning strategies to avoid redundant support calculations. The experiments are conducted on nine datasets and 12 alternative algorithms. The results verify that OPF-Miner is superior to other competitive algorithms. More importantly, OPF-Miner yields good clustering performance for time series, since the forgetting mechanism is employed. All algorithms can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner.
保序模式(OPP)挖掘是一种序列模式挖掘方法,其中一组时间序列的等级被用来表示一个 OPP。现有的 OPP 挖掘算法认为不同时间的数据点同等重要;但是,较新的数据通常影响更大,而较老的数据影响较弱。因此,我们在 OPP 挖掘中引入了遗忘机制,以降低旧数据的重要性。本文探讨了带有遗忘机制(OPF)的 OPP 挖掘,并提出了一种名为 OPF-Miner 的算法,它可以发现频繁的 OPF。OPF-Miner 执行两项任务,即候选模式生成和支持计算。在候选模式生成中,OPF-Miner 采用了最大支持优先策略和组模式融合策略,以避免冗余模式融合。在支持计算方面,我们提出了一种名为 "带遗忘机制的支持计算 "的算法,它使用前缀和后缀模式剪枝策略来避免冗余的支持计算。我们在九个数据集和 12 种备选算法上进行了实验。结果验证了 OPF-Miner 优于其他竞争算法。更重要的是,由于采用了遗忘机制,OPF-Miner 对时间序列具有良好的聚类性能。所有算法可从https://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner 下载。
{"title":"Order-preserving pattern mining with forgetting mechanism","authors":"Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu","doi":"arxiv-2408.15563","DOIUrl":"https://doi.org/arxiv-2408.15563","url":null,"abstract":"Order-preserving pattern (OPP) mining is a type of sequential pattern mining\u0000method in which a group of ranks of time series is used to represent an OPP.\u0000This approach can discover frequent trends in time series. Existing OPP mining\u0000algorithms consider data points at different time to be equally important;\u0000however, newer data usually have a more significant impact, while older data\u0000have a weaker impact. We therefore introduce the forgetting mechanism into OPP\u0000mining to reduce the importance of older data. This paper explores the mining\u0000of OPPs with forgetting mechanism (OPF) and proposes an algorithm called\u0000OPF-Miner that can discover frequent OPFs. OPF-Miner performs two tasks,\u0000candidate pattern generation and support calculation. In candidate pattern\u0000generation, OPF-Miner employs a maximal support priority strategy and a group\u0000pattern fusion strategy to avoid redundant pattern fusions. For support\u0000calculation, we propose an algorithm called support calculation with forgetting\u0000mechanism, which uses prefix and suffix pattern pruning strategies to avoid\u0000redundant support calculations. The experiments are conducted on nine datasets\u0000and 12 alternative algorithms. The results verify that OPF-Miner is superior to\u0000other competitive algorithms. More importantly, OPF-Miner yields good\u0000clustering performance for time series, since the forgetting mechanism is\u0000employed. All algorithms can be downloaded from\u0000https://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia
AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabilities of language models (LMs) alongside the scalable computational power of data management systems. These combined capabilities would empower users to ask arbitrary natural language questions over custom data sources. However, existing methods and benchmarks insufficiently explore this setting. Text2SQL methods focus solely on natural language questions that can be expressed in relational algebra, representing a small subset of the questions real users wish to ask. Likewise, Retrieval-Augmented Generation (RAG) considers the limited subset of queries that can be answered with point lookups to one or a few data records within the database. We propose Table-Augmented Generation (TAG), a unified and general-purpose paradigm for answering natural language questions over databases. The TAG model represents a wide range of interactions between the LM and database that have been previously unexplored and creates exciting research opportunities for leveraging the world knowledge and reasoning capabilities of LMs over data. We systematically develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly, confirming the need for further research in this area. We release code for the benchmark at https://github.com/TAG-Research/TAG-Bench.
{"title":"Text2SQL is Not Enough: Unifying AI and Databases with TAG","authors":"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia","doi":"arxiv-2408.14717","DOIUrl":"https://doi.org/arxiv-2408.14717","url":null,"abstract":"AI systems that serve natural language questions over databases promise to\u0000unlock tremendous value. Such systems would allow users to leverage the\u0000powerful reasoning and knowledge capabilities of language models (LMs)\u0000alongside the scalable computational power of data management systems. These\u0000combined capabilities would empower users to ask arbitrary natural language\u0000questions over custom data sources. However, existing methods and benchmarks\u0000insufficiently explore this setting. Text2SQL methods focus solely on natural\u0000language questions that can be expressed in relational algebra, representing a\u0000small subset of the questions real users wish to ask. Likewise,\u0000Retrieval-Augmented Generation (RAG) considers the limited subset of queries\u0000that can be answered with point lookups to one or a few data records within the\u0000database. We propose Table-Augmented Generation (TAG), a unified and\u0000general-purpose paradigm for answering natural language questions over\u0000databases. The TAG model represents a wide range of interactions between the LM\u0000and database that have been previously unexplored and creates exciting research\u0000opportunities for leveraging the world knowledge and reasoning capabilities of\u0000LMs over data. We systematically develop benchmarks to study the TAG problem\u0000and find that standard methods answer no more than 20% of queries correctly,\u0000confirming the need for further research in this area. We release code for the\u0000benchmark at https://github.com/TAG-Research/TAG-Bench.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology
Recent studies investigated the challenge of assessing the strength of a given claim extracted from a dataset, particularly the claim's potential of being misleading and cherry-picked. We focus on claims that compare answers to an aggregate query posed on a view that selects tuples. The strength of a claim amounts to the question of how likely it is that the view is carefully chosen to support the claim, whereas less careful choices would lead to contradictory claims. We embark on the study of the reverse task that offers a complementary angle in the critical assessment of data-based claims: given a claim, find useful supporting views. The goal of this task is twofold. On the one hand, we aim to assist users in finding significant evidence of phenomena of interest. On the other hand, we wish to provide them with machinery to criticize or counter given claims by extracting evidence of opposing statements. To be effective, the supporting sub-population should be significant and defined by a ``natural'' view. We discuss several measures of naturalness and propose ways of extracting the best views under each measure (and combinations thereof). The main challenge is the computational cost, as na"ive search is infeasible. We devise anytime algorithms that deploy two main steps: (1) a preliminary construction of a ranked list of attribute combinations that are assessed using fast-to-compute features, and (2) an efficient search for the actual views based on each attribute combination. We present a thorough experimental study that shows the effectiveness of our algorithms in terms of quality and execution cost. We also present a user study to assess the usefulness of the naturalness measures.
{"title":"Finding Convincing Views to Endorse a Claim","authors":"Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology","doi":"arxiv-2408.14974","DOIUrl":"https://doi.org/arxiv-2408.14974","url":null,"abstract":"Recent studies investigated the challenge of assessing the strength of a\u0000given claim extracted from a dataset, particularly the claim's potential of\u0000being misleading and cherry-picked. We focus on claims that compare answers to\u0000an aggregate query posed on a view that selects tuples. The strength of a claim\u0000amounts to the question of how likely it is that the view is carefully chosen\u0000to support the claim, whereas less careful choices would lead to contradictory\u0000claims. We embark on the study of the reverse task that offers a complementary\u0000angle in the critical assessment of data-based claims: given a claim, find\u0000useful supporting views. The goal of this task is twofold. On the one hand, we\u0000aim to assist users in finding significant evidence of phenomena of interest.\u0000On the other hand, we wish to provide them with machinery to criticize or\u0000counter given claims by extracting evidence of opposing statements. To be effective, the supporting sub-population should be significant and\u0000defined by a ``natural'' view. We discuss several measures of naturalness and\u0000propose ways of extracting the best views under each measure (and combinations\u0000thereof). The main challenge is the computational cost, as na\"ive search is\u0000infeasible. We devise anytime algorithms that deploy two main steps: (1) a\u0000preliminary construction of a ranked list of attribute combinations that are\u0000assessed using fast-to-compute features, and (2) an efficient search for the\u0000actual views based on each attribute combination. We present a thorough\u0000experimental study that shows the effectiveness of our algorithms in terms of\u0000quality and execution cost. We also present a user study to assess the\u0000usefulness of the naturalness measures.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}