首页 > 最新文献

arXiv - CS - Databases最新文献

英文 中文
Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! 杂乱的代码让 ML 管道管理变得困难?只需让 LLM 重写代码!
Pub Date : 2024-09-16 DOI: arxiv-2409.10081
Sebastian Schelter, Stefan Grafberger
Machine learning (ML) applications that learn from data are increasingly usedto automate impactful decisions. Unfortunately, these applications often fallshort of adequately managing critical data and complying with upcomingregulations. A technical reason for the persistence of these issues is that thedata pipelines in common ML libraries and cloud services lack fundamentaldeclarative, data-centric abstractions. Recent research has shown how suchabstractions enable techniques like provenance tracking and automaticinspection to help manage ML pipelines. Unfortunately, these approaches lackadoption in the real world because they require clean ML pipeline code writtenwith declarative APIs, instead of the messy imperative Python code that datascientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change theirestablished development practices. Instead, we propose to circumvent this "codeabstraction gap" by leveraging the code generation capabilities of largelanguage models (LLMs). Our idea is to rewrite messy data science code to acustom-tailored declarative pipeline abstraction, which we implement as aproof-of-concept in our prototype Lester. We detail its application for achallenging compliance management example involving "incremental viewmaintenance" of deployed ML pipelines. The code rewrites for our runningexample show the potential of LLMs to make messy data science code declarative,e.g., by identifying hand-coded joins in Python and turning them into joins ondataframes, or by generating declarative feature encoders from NumPy code.
从数据中学习的机器学习 (ML) 应用程序越来越多地用于自动做出有影响力的决策。遗憾的是,这些应用往往无法充分管理关键数据并遵守即将出台的法规。这些问题长期存在的一个技术原因是,常见的 ML 库和云服务中的数据管道缺乏以数据为中心的基本抽象。最近的研究表明,这种抽象如何使出处跟踪和自动检查等技术能够帮助管理 ML 管道。遗憾的是,这些方法在现实世界中缺乏采用,因为它们需要用声明式应用程序接口编写简洁的 ML 管道代码,而不是数据科学家通常为数据准备编写的混乱的命令式 Python 代码。我们认为,期望数据科学家改变既定的开发实践是不现实的。相反,我们建议利用大型语言模型(LLM)的代码生成能力来规避这种 "代码抽象差距"。我们的想法是将凌乱的数据科学代码重写为定制的声明式流水线抽象,我们在原型 Lester 中实现了这一概念验证。我们详细介绍了它在一个具有挑战性的合规性管理示例中的应用,该示例涉及已部署 ML 管道的 "增量视图维护"。我们正在运行的示例的代码重写显示了 LLM 在使杂乱的数据科学代码声明化方面的潜力,例如,通过识别 Python 中手工编码的连接并将其转化为数据帧上的连接,或者通过从 NumPy 代码生成声明性特征编码器。
{"title":"Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!","authors":"Sebastian Schelter, Stefan Grafberger","doi":"arxiv-2409.10081","DOIUrl":"https://doi.org/arxiv-2409.10081","url":null,"abstract":"Machine learning (ML) applications that learn from data are increasingly used\u0000to automate impactful decisions. Unfortunately, these applications often fall\u0000short of adequately managing critical data and complying with upcoming\u0000regulations. A technical reason for the persistence of these issues is that the\u0000data pipelines in common ML libraries and cloud services lack fundamental\u0000declarative, data-centric abstractions. Recent research has shown how such\u0000abstractions enable techniques like provenance tracking and automatic\u0000inspection to help manage ML pipelines. Unfortunately, these approaches lack\u0000adoption in the real world because they require clean ML pipeline code written\u0000with declarative APIs, instead of the messy imperative Python code that data\u0000scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their\u0000established development practices. Instead, we propose to circumvent this \"code\u0000abstraction gap\" by leveraging the code generation capabilities of large\u0000language models (LLMs). Our idea is to rewrite messy data science code to a\u0000custom-tailored declarative pipeline abstraction, which we implement as a\u0000proof-of-concept in our prototype Lester. We detail its application for a\u0000challenging compliance management example involving \"incremental view\u0000maintenance\" of deployed ML pipelines. The code rewrites for our running\u0000example show the potential of LLMs to make messy data science code declarative,\u0000e.g., by identifying hand-coded joins in Python and turning them into joins on\u0000dataframes, or by generating declarative feature encoders from NumPy code.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development of Data Evaluation Benchmark for Data Wrangling Recommendation System 为数据整理推荐系统开发数据评估基准
Pub Date : 2024-09-16 DOI: arxiv-2409.10635
Yuqing Wang, Anna Fariha
CoWrangler is a data-wrangling recommender system designed to streamline dataprocessing tasks. Recognizing that data processing is often time-consuming andcomplex for novice users, we aim to simplify the decision-making processregarding the most effective subsequent data operation. By analyzing over10,000 Kaggle notebooks spanning approximately 1,000 datasets, we deriveinsights into common data processing strategies employed by users acrossvarious tasks. This analysis helps us understand how dataset quality influenceswrangling operations, informing our ongoing efforts to possibly expand ourdataset sources in the future.
CoWrangler 是一个数据整理推荐系统,旨在简化数据处理任务。我们认识到数据处理对于新手用户来说往往既耗时又复杂,因此旨在简化有关最有效的后续数据操作的决策过程。通过分析跨越约 1,000 个数据集的 10,000 多本 Kaggle 笔记本,我们深入了解了用户在各种任务中采用的常见数据处理策略。这项分析有助于我们了解数据集质量如何影响数据重组操作,为我们未来可能扩大数据集来源的持续努力提供信息。
{"title":"Development of Data Evaluation Benchmark for Data Wrangling Recommendation System","authors":"Yuqing Wang, Anna Fariha","doi":"arxiv-2409.10635","DOIUrl":"https://doi.org/arxiv-2409.10635","url":null,"abstract":"CoWrangler is a data-wrangling recommender system designed to streamline data\u0000processing tasks. Recognizing that data processing is often time-consuming and\u0000complex for novice users, we aim to simplify the decision-making process\u0000regarding the most effective subsequent data operation. By analyzing over\u000010,000 Kaggle notebooks spanning approximately 1,000 datasets, we derive\u0000insights into common data processing strategies employed by users across\u0000various tasks. This analysis helps us understand how dataset quality influences\u0000wrangling operations, informing our ongoing efforts to possibly expand our\u0000dataset sources in the future.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast and Adaptive Bulk Loading of Multidimensional Points 多维点的快速自适应批量加载
Pub Date : 2024-09-14 DOI: arxiv-2409.09447
Moin Hussain Moti, Dimitris Papadias
Existing methods for bulk loading disk-based multidimensional points involvemultiple applications of external sorting. In this paper, we propose techniquesthat apply linear scan, and are therefore significantly faster. The resultingFMBI Index possesses several desirable properties, including almost full andsquare nodes with zero overlap, and has excellent query performance. As asecond contribution, we develop an adaptive version AMBI, which utilizes thequery workload to build a partial index only for parts of the data space thatcontain query results. Finally, we extend FMBI and AMBI to parallel bulkloading and query processing in distributed systems. An extensive experimentalevaluation with real datasets confirms that FMBI and AMBI clearly outperformcompetitors in terms of combined index construction and query processing cost,sometimes by orders of magnitude.
现有的基于磁盘的多维点批量加载方法涉及多种外部排序应用。在本文中,我们提出了应用线性扫描的技术,因此速度明显更快。由此产生的 FMBI 索引具有几个理想的特性,包括几乎全节点和零重叠的方形节点,并具有出色的查询性能。作为第二个贡献,我们开发了一个自适应版本 AMBI,它利用查询工作量只为包含查询结果的数据空间部分建立部分索引。最后,我们将 FMBI 和 AMBI 扩展到分布式系统中的并行批量加载和查询处理。利用真实数据集进行的广泛实验评估证实,FMBI 和 AMBI 在综合索引构建和查询处理成本方面明显优于竞争对手,有时甚至超出几个数量级。
{"title":"Fast and Adaptive Bulk Loading of Multidimensional Points","authors":"Moin Hussain Moti, Dimitris Papadias","doi":"arxiv-2409.09447","DOIUrl":"https://doi.org/arxiv-2409.09447","url":null,"abstract":"Existing methods for bulk loading disk-based multidimensional points involve\u0000multiple applications of external sorting. In this paper, we propose techniques\u0000that apply linear scan, and are therefore significantly faster. The resulting\u0000FMBI Index possesses several desirable properties, including almost full and\u0000square nodes with zero overlap, and has excellent query performance. As a\u0000second contribution, we develop an adaptive version AMBI, which utilizes the\u0000query workload to build a partial index only for parts of the data space that\u0000contain query results. Finally, we extend FMBI and AMBI to parallel bulk\u0000loading and query processing in distributed systems. An extensive experimental\u0000evaluation with real datasets confirms that FMBI and AMBI clearly outperform\u0000competitors in terms of combined index construction and query processing cost,\u0000sometimes by orders of magnitude.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Matrix Profile for Anomaly Detection on Multidimensional Time Series 用于多维时间序列异常检测的矩阵剖面图
Pub Date : 2024-09-14 DOI: arxiv-2409.09298
Chin-Chia Michael Yeh, Audrey Der, Uday Singh Saini, Vivian Lai, Yan Zheng, Junpeng Wang, Xin Dai, Zhongfang Zhuang, Yujie Fan, Huiyuan Chen, Prince Osei Aboagye, Liang Wang, Wei Zhang, Eamonn Keogh
The Matrix Profile (MP), a versatile tool for time series data mining, hasbeen shown effective in time series anomaly detection (TSAD). This paper delvesinto the problem of anomaly detection in multidimensional time series, a commonoccurrence in real-world applications. For instance, in a manufacturingfactory, multiple sensors installed across the site collect time-varying datafor analysis. The Matrix Profile, named for its role in profiling the matrixstoring pairwise distance between subsequences of univariate time series,becomes complex in multidimensional scenarios. If the input univariate timeseries has n subsequences, the pairwise distance matrix is a n x n matrix. In amultidimensional time series with d dimensions, the pairwise distanceinformation must be stored in a n x n x d tensor. In this paper, we firstanalyze different strategies for condensing this tensor into a profile vector.We then investigate the potential of extending the MP to efficiently findk-nearest neighbors for anomaly detection. Finally, we benchmark themultidimensional MP against 19 baseline methods on 119 multidimensional TSADdatasets. The experiments covers three learning setups: unsupervised,supervised, and semi-supervised. MP is the only method that consistentlydelivers high performance across all setups.
矩阵剖面图(MP)是一种用于时间序列数据挖掘的多功能工具,在时间序列异常检测(TSAD)中被证明是有效的。本文深入探讨了多维时间序列中的异常检测问题,这是现实世界应用中的常见问题。例如,在制造工厂中,安装在厂区各处的多个传感器会收集时变数据进行分析。矩阵剖面(Matrix Profile)因其作用是剖析存储单变量时间序列子序列之间成对距离的矩阵而得名,在多维场景中变得复杂。如果输入的单变量时间序列有 n 个子序列,那么成对距离矩阵就是 n x n 矩阵。在具有 d 维的多维时间序列中,成对距离信息必须存储在 n x n x d 张量中。在本文中,我们首先分析了将该张量压缩成轮廓向量的不同策略,然后研究了扩展 MP 以高效查找近邻进行异常检测的潜力。最后,我们在 119 个多维 TSAD 数据集上用 19 种基准方法对多维 MP 进行了基准测试。实验涵盖三种学习设置:无监督、有监督和半监督。MP 是唯一一种在所有设置中都能始终保持高性能的方法。
{"title":"Matrix Profile for Anomaly Detection on Multidimensional Time Series","authors":"Chin-Chia Michael Yeh, Audrey Der, Uday Singh Saini, Vivian Lai, Yan Zheng, Junpeng Wang, Xin Dai, Zhongfang Zhuang, Yujie Fan, Huiyuan Chen, Prince Osei Aboagye, Liang Wang, Wei Zhang, Eamonn Keogh","doi":"arxiv-2409.09298","DOIUrl":"https://doi.org/arxiv-2409.09298","url":null,"abstract":"The Matrix Profile (MP), a versatile tool for time series data mining, has\u0000been shown effective in time series anomaly detection (TSAD). This paper delves\u0000into the problem of anomaly detection in multidimensional time series, a common\u0000occurrence in real-world applications. For instance, in a manufacturing\u0000factory, multiple sensors installed across the site collect time-varying data\u0000for analysis. The Matrix Profile, named for its role in profiling the matrix\u0000storing pairwise distance between subsequences of univariate time series,\u0000becomes complex in multidimensional scenarios. If the input univariate time\u0000series has n subsequences, the pairwise distance matrix is a n x n matrix. In a\u0000multidimensional time series with d dimensions, the pairwise distance\u0000information must be stored in a n x n x d tensor. In this paper, we first\u0000analyze different strategies for condensing this tensor into a profile vector.\u0000We then investigate the potential of extending the MP to efficiently find\u0000k-nearest neighbors for anomaly detection. Finally, we benchmark the\u0000multidimensional MP against 19 baseline methods on 119 multidimensional TSAD\u0000datasets. The experiments covers three learning setups: unsupervised,\u0000supervised, and semi-supervised. MP is the only method that consistently\u0000delivers high performance across all setups.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"213 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extending predictive process monitoring for collaborative processes 为协作流程扩展预测性流程监控
Pub Date : 2024-09-13 DOI: arxiv-2409.09212
Daniel Calegari, Andrea Delgado
Process mining on business process execution data has focused primarily onorchestration-type processes performed in a single organization(intra-organizational). Collaborative (inter-organizational) processes, unlikethose of orchestration type, expand several organizations (for example, ine-Government), adding complexity and various challenges both for theirimplementation and for their discovery, prediction, and analysis of theirexecution. Predictive process monitoring is based on exploiting execution datafrom past instances to predict the execution of current cases. It is possibleto make predictions on the next activity and remaining time, among others, toanticipate possible deviations, violations, and delays in the processes to takepreventive measures (e.g., re-allocation of resources). In this work, wepropose an extension for collaborative processes of traditional processprediction, considering particularities of this type of process, which addinformation of interest in this context, for example, the next activity ofwhich participant or the following message to be exchanged between twoparticipants.
对业务流程执行数据的流程挖掘主要集中在单个组织(组织内)执行的协调型流程。协作(组织间)流程不属于协调类型,它扩展了多个组织(例如,在政府中),增加了其实施以及发现、预测和分析其执行情况的复杂性和各种挑战。预测性流程监控的基础是利用过去实例的执行数据来预测当前案例的执行情况。它可以对下一个活动和剩余时间等进行预测,以预测流程中可能出现的偏差、违规和延迟,从而采取预防措施(如重新分配资源)。在这项工作中,我们考虑到协作流程的特殊性,提出了对传统流程预测的一种扩展,即在此背景下添加感兴趣的信息,例如哪位参与者的下一项活动或两位参与者之间要交换的后续信息。
{"title":"Extending predictive process monitoring for collaborative processes","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09212","DOIUrl":"https://doi.org/arxiv-2409.09212","url":null,"abstract":"Process mining on business process execution data has focused primarily on\u0000orchestration-type processes performed in a single organization\u0000(intra-organizational). Collaborative (inter-organizational) processes, unlike\u0000those of orchestration type, expand several organizations (for example, in\u0000e-Government), adding complexity and various challenges both for their\u0000implementation and for their discovery, prediction, and analysis of their\u0000execution. Predictive process monitoring is based on exploiting execution data\u0000from past instances to predict the execution of current cases. It is possible\u0000to make predictions on the next activity and remaining time, among others, to\u0000anticipate possible deviations, violations, and delays in the processes to take\u0000preventive measures (e.g., re-allocation of resources). In this work, we\u0000propose an extension for collaborative processes of traditional process\u0000prediction, considering particularities of this type of process, which add\u0000information of interest in this context, for example, the next activity of\u0000which participant or the following message to be exchanged between two\u0000participants.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Systematic Review on Process Mining for Curricular Analysis 课程分析过程挖掘系统综述
Pub Date : 2024-09-13 DOI: arxiv-2409.09204
Daniel Calegari, Andrea Delgado
Educational Process Mining (EPM) is a data analysis technique that is used toimprove educational processes. It is based on Process Mining (PM), whichinvolves gathering records (logs) of events to discover process models andanalyze the data from a process-centric perspective. One specific applicationof EPM is curriculum mining, which focuses on understanding the learningprogram students follow to achieve educational goals. This is important forinstitutional curriculum decision-making and quality improvement. Therefore,academic institutions can benefit from organizing the existing techniques,capabilities, and limitations. We conducted a systematic literature review toidentify works on applying PM to curricular analysis and provide insights forfurther research. From the analysis of 22 primary studies, we found thatresults can be classified into five categories concerning the objectives theypursue: the discovery of educational trajectories, the identification ofdeviations in the observed behavior of students, the analysis of bottlenecks,the analysis of stopout and dropout problems, and the generation ofrecommendation. Moreover, we identified some open challenges and opportunities,such as standardizing for replicating studies to perform cross-universitycurricular analysis and strengthening the connection between PM and data miningfor improving curricular analysis.
教育过程挖掘(EPM)是一种用于改进教育过程的数据分析技术。它以流程挖掘(PM)为基础,即收集事件记录(日志)以发现流程模型,并从以流程为中心的角度分析数据。课程挖掘是 EPM 的一个具体应用,其重点是了解学生为实现教育目标而遵循的学习计划。这对机构课程决策和质量改进非常重要。因此,学术机构可以从整理现有技术、能力和局限性中获益。我们进行了系统的文献综述,以确定将项目管理应用于课程分析的作品,并为进一步研究提供启示。通过对 22 项主要研究的分析,我们发现这些成果可按其追求的目标分为五类:发现教育轨迹、识别所观察到的学生行为中的差异、分析瓶颈、分析停学和辍学问题以及提出建议。此外,我们还发现了一些尚待解决的挑战和机遇,例如为开展跨大学课程分析而进行的标准化重复研究,以及为改进课程分析而加强项目管理与数据挖掘之间的联系。
{"title":"A Systematic Review on Process Mining for Curricular Analysis","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09204","DOIUrl":"https://doi.org/arxiv-2409.09204","url":null,"abstract":"Educational Process Mining (EPM) is a data analysis technique that is used to\u0000improve educational processes. It is based on Process Mining (PM), which\u0000involves gathering records (logs) of events to discover process models and\u0000analyze the data from a process-centric perspective. One specific application\u0000of EPM is curriculum mining, which focuses on understanding the learning\u0000program students follow to achieve educational goals. This is important for\u0000institutional curriculum decision-making and quality improvement. Therefore,\u0000academic institutions can benefit from organizing the existing techniques,\u0000capabilities, and limitations. We conducted a systematic literature review to\u0000identify works on applying PM to curricular analysis and provide insights for\u0000further research. From the analysis of 22 primary studies, we found that\u0000results can be classified into five categories concerning the objectives they\u0000pursue: the discovery of educational trajectories, the identification of\u0000deviations in the observed behavior of students, the analysis of bottlenecks,\u0000the analysis of stopout and dropout problems, and the generation of\u0000recommendation. Moreover, we identified some open challenges and opportunities,\u0000such as standardizing for replicating studies to perform cross-university\u0000curricular analysis and strengthening the connection between PM and data mining\u0000for improving curricular analysis.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DPconv: Super-Polynomially Faster Join Ordering DPconv:超快连接排序
Pub Date : 2024-09-12 DOI: arxiv-2409.08013
Mihail Stoian, Andreas Kipf
We revisit the join ordering problem in query optimization. The standardexact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This isprohibitively expensive for large queries, which are not that uncommon anymore.We develop a new algorithmic framework based on subset convolution. DPconvachieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$time-barrier for the first time. We show that the instantiation of ourframework for the $C_max$ cost function is up to 30x faster than DPccp forlarge clique queries.
我们重温了查询优化中的连接排序问题。标准精确算法 DPccp 的最坏运行时间为 $O(3^n)$。这对于大型查询来说昂贵得令人望而却步,而大型查询已不再罕见。我们开发了一种基于子集卷积的新算法框架。我们开发了基于子集卷积的新算法框架。与 DPccp 相比,DPconv 实现了超多项式提速,首次突破了 $O(3^n)$ 时间障碍。我们展示了我们的框架在$C_max$成本函数上的实例化,在大型clique查询上比DPccp快了30倍。
{"title":"DPconv: Super-Polynomially Faster Join Ordering","authors":"Mihail Stoian, Andreas Kipf","doi":"arxiv-2409.08013","DOIUrl":"https://doi.org/arxiv-2409.08013","url":null,"abstract":"We revisit the join ordering problem in query optimization. The standard\u0000exact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This is\u0000prohibitively expensive for large queries, which are not that uncommon anymore.\u0000We develop a new algorithmic framework based on subset convolution. DPconv\u0000achieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$\u0000time-barrier for the first time. We show that the instantiation of our\u0000framework for the $C_max$ cost function is up to 30x faster than DPccp for\u0000large clique queries.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ranked Enumeration for Database Queries 数据库查询的排序枚举
Pub Date : 2024-09-12 DOI: arxiv-2409.08142
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
Ranked enumeration is a query-answering paradigm where the query answers arereturned incrementally in order of importance (instead of returning all answersat once). Importance is defined by a ranking function that can be specific tothe application, but typically involves either a lexicographic order (e.g.,"ORDER BY R.A, S.B" in SQL) or a weighted sum of attributes (e.g., "ORDER BY3*R.A + 2*S.B"). We recently introduced any-k algorithms for (multi-way) joinqueries, which push ranking into joins and avoid materializing intermediateresults until necessary. The top-ranked answers are returned asymptoticallyfaster than the common join-then-rank approach of database systems, resultingin orders-of-magnitude speedup in practice. In addition to their practical usefulness, our techniques complement a longline of theoretical research on unranked enumeration, where answers are alsoreturned incrementally, but with no explicit ordering requirement. For a broadclass of ranking functions with certain monotonicity properties, includinglexicographic orders and sum-based rankings, the ordering requirementsurprisingly does not increase the asymptotic time or space complexity, apartfrom logarithmic factors. A key insight of our work is the connection between ranked enumeration fordatabase queries and the fundamental task of computing the kth-shortest path ina graph. Uncovering these connections allowed us to ground our approach in therich literature of that problem and connect ideas that had been explored inisolation before. In this article, we adopt a pragmatic approach and present aslightly simplified version of the algorithm without the shortest-pathinterpretation. We believe that this will benefit practitioners looking toimplement and optimize any-k approaches.
排序枚举是一种查询回答范例,在这种范例中,查询答案按重要性顺序逐步返回(而不是一次性返回所有答案)。重要性由一个排序函数来定义,该函数可以是特定于应用的,但通常涉及一个词法排序(例如,SQL 中的 "ORDER BY R.A,S.B")或一个属性加权和(例如,"ORDER BY3*R.A + 2*S.B")。我们最近为(多向)连接查询引入了 any-k 算法,它将排序推入连接,并在必要时避免中间结果的具体化。与数据库系统中常见的先连接后排名的方法相比,排名靠前的答案的返回速度在渐近上更快,从而在实践中实现了数量级的提速。除了实用性之外,我们的技术也是对无排序枚举理论研究的补充,在无排序枚举中,答案也是逐步返回的,但没有明确的排序要求。对于一类具有某些单调性的排序函数,包括线性排序和基于和的排序,除了对数因素外,排序要求出人意料地没有增加渐进时间或空间复杂性。我们工作的一个关键见解是数据库查询的排序枚举与计算图中第 k 条最短路径这一基本任务之间的联系。发现这些联系使我们能够将我们的方法建立在有关该问题的丰富文献基础之上,并将以前孤立探索过的观点联系起来。在本文中,我们采用了一种务实的方法,提出了一个略微简化的算法版本,但没有对最短路径进行解释。我们相信,这将有益于希望实施和优化任意 K 方法的实践者。
{"title":"Ranked Enumeration for Database Queries","authors":"Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald","doi":"arxiv-2409.08142","DOIUrl":"https://doi.org/arxiv-2409.08142","url":null,"abstract":"Ranked enumeration is a query-answering paradigm where the query answers are\u0000returned incrementally in order of importance (instead of returning all answers\u0000at once). Importance is defined by a ranking function that can be specific to\u0000the application, but typically involves either a lexicographic order (e.g.,\u0000\"ORDER BY R.A, S.B\" in SQL) or a weighted sum of attributes (e.g., \"ORDER BY\u00003*R.A + 2*S.B\"). We recently introduced any-k algorithms for (multi-way) join\u0000queries, which push ranking into joins and avoid materializing intermediate\u0000results until necessary. The top-ranked answers are returned asymptotically\u0000faster than the common join-then-rank approach of database systems, resulting\u0000in orders-of-magnitude speedup in practice. In addition to their practical usefulness, our techniques complement a long\u0000line of theoretical research on unranked enumeration, where answers are also\u0000returned incrementally, but with no explicit ordering requirement. For a broad\u0000class of ranking functions with certain monotonicity properties, including\u0000lexicographic orders and sum-based rankings, the ordering requirement\u0000surprisingly does not increase the asymptotic time or space complexity, apart\u0000from logarithmic factors. A key insight of our work is the connection between ranked enumeration for\u0000database queries and the fundamental task of computing the kth-shortest path in\u0000a graph. Uncovering these connections allowed us to ground our approach in the\u0000rich literature of that problem and connect ideas that had been explored in\u0000isolation before. In this article, we adopt a pragmatic approach and present a\u0000slightly simplified version of the algorithm without the shortest-path\u0000interpretation. We believe that this will benefit practitioners looking to\u0000implement and optimize any-k approaches.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
meds_reader: A fast and efficient EHR processing library meds_reader:快速高效的电子病历处理库
Pub Date : 2024-09-12 DOI: arxiv-2409.09095
Ethan Steinberg, Michael Wornow, Suhana Bedi, Jason Alan Fries, Matthew B. A. McDermott, Nigam H. Shah
The growing demand for machine learning in healthcare requires processingincreasingly large electronic health record (EHR) datasets, but existingpipelines are not computationally efficient or scalable. In this paper, weintroduce meds_reader, an optimized Python package for efficient EHR dataprocessing that is designed to take advantage of many intrinsic properties ofEHR data for improved speed. We then demonstrate the benefits of meds_reader byreimplementing key components of two major EHR processing pipelines, achieving10-100x improvements in memory, speed, and disk usage. The code for meds_readercan be found at https://github.com/som-shahlab/meds_reader.
医疗保健领域对机器学习的需求日益增长,需要处理越来越大的电子健康记录(EHR)数据集,但现有的管道在计算效率和可扩展性方面都不尽如人意。在本文中,我们介绍了 meds_reader,这是一个用于高效处理电子病历数据的优化 Python 软件包,旨在利用电子病历数据的许多固有属性来提高处理速度。然后,我们通过对两个主要电子病历处理流水线关键组件的重新实施,展示了 meds_reader 的优势,在内存、速度和磁盘使用方面实现了 10-100 倍的改进。meds_reader 的代码见 https://github.com/som-shahlab/meds_reader。
{"title":"meds_reader: A fast and efficient EHR processing library","authors":"Ethan Steinberg, Michael Wornow, Suhana Bedi, Jason Alan Fries, Matthew B. A. McDermott, Nigam H. Shah","doi":"arxiv-2409.09095","DOIUrl":"https://doi.org/arxiv-2409.09095","url":null,"abstract":"The growing demand for machine learning in healthcare requires processing\u0000increasingly large electronic health record (EHR) datasets, but existing\u0000pipelines are not computationally efficient or scalable. In this paper, we\u0000introduce meds_reader, an optimized Python package for efficient EHR data\u0000processing that is designed to take advantage of many intrinsic properties of\u0000EHR data for improved speed. We then demonstrate the benefits of meds_reader by\u0000reimplementing key components of two major EHR processing pipelines, achieving\u000010-100x improvements in memory, speed, and disk usage. The code for meds_reader\u0000can be found at https://github.com/som-shahlab/meds_reader.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
echemdb Toolkit -- a Lightweight Approach to Getting Data Ready for Data Management Solutions echemdb 工具包 -- 为数据管理解决方案准备数据的轻量级方法
Pub Date : 2024-09-11 DOI: arxiv-2409.07083
Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth
According to the FAIR (findability, accessibility, interoperability, andreusability) principles, scientific data should always be stored withmachine-readable descriptive metadata. Existing solutions to store data withmetadata, such as electronic lab notebooks (ELN), are often verydomain-specific and not sufficiently generic for arbitrary experimental orcomputational results. In this work, we present open-source echemdb toolkit for creating andhandling data and metadata. The toolkit is running entirely on the file systemlevel using a file-based approach, which facilitates integration with othertools in a FAIR data life cycle and means that no complicated server setup isrequired. This also makes the toolkit more accessible to the average researchersince no understanding of more sophisticated database technologies is required. We showcase several aspects and applications of the toolkit: automaticannotation of raw research data with human- and machine-readable metadata, dataconversion into standardised frictionless Data Packages, and an API forexploring the data. We also illustrate the web frameworks to illustrate thedata using example data from research into energy conversion and storage.
根据 FAIR(可查找性、可访问性、互操作性和可重用性)原则,科学数据应始终以机器可读的描述性元数据存储。现有的元数据存储解决方案,如电子实验笔记本(ELN),通常都是针对特定领域的,对任意实验或计算结果的通用性不够。在这项工作中,我们提出了用于创建和处理数据与元数据的开源 echemdb 工具包。该工具包采用基于文件的方法,完全在文件系统级运行,这便于在 FAIR 数据生命周期中与其他工具集成,也意味着无需复杂的服务器设置。这也使得普通研究人员更容易使用该工具包,因为不需要了解更复杂的数据库技术。我们展示了该工具包的几个方面和应用:使用人类和机器可读元数据自动标注原始研究数据、将数据转换为标准化的无摩擦数据包以及探索数据的应用程序接口。我们还利用能源转换和存储研究中的示例数据说明了说明数据的网络框架。
{"title":"echemdb Toolkit -- a Lightweight Approach to Getting Data Ready for Data Management Solutions","authors":"Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth","doi":"arxiv-2409.07083","DOIUrl":"https://doi.org/arxiv-2409.07083","url":null,"abstract":"According to the FAIR (findability, accessibility, interoperability, and\u0000reusability) principles, scientific data should always be stored with\u0000machine-readable descriptive metadata. Existing solutions to store data with\u0000metadata, such as electronic lab notebooks (ELN), are often very\u0000domain-specific and not sufficiently generic for arbitrary experimental or\u0000computational results. In this work, we present open-source echemdb toolkit for creating and\u0000handling data and metadata. The toolkit is running entirely on the file system\u0000level using a file-based approach, which facilitates integration with other\u0000tools in a FAIR data life cycle and means that no complicated server setup is\u0000required. This also makes the toolkit more accessible to the average researcher\u0000since no understanding of more sophisticated database technologies is required. We showcase several aspects and applications of the toolkit: automatic\u0000annotation of raw research data with human- and machine-readable metadata, data\u0000conversion into standardised frictionless Data Packages, and an API for\u0000exploring the data. We also illustrate the web frameworks to illustrate the\u0000data using example data from research into energy conversion and storage.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Databases
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1