Machine learning (ML) applications that learn from data are increasingly used to automate impactful decisions. Unfortunately, these applications often fall short of adequately managing critical data and complying with upcoming regulations. A technical reason for the persistence of these issues is that the data pipelines in common ML libraries and cloud services lack fundamental declarative, data-centric abstractions. Recent research has shown how such abstractions enable techniques like provenance tracking and automatic inspection to help manage ML pipelines. Unfortunately, these approaches lack adoption in the real world because they require clean ML pipeline code written with declarative APIs, instead of the messy imperative Python code that data scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their established development practices. Instead, we propose to circumvent this "code abstraction gap" by leveraging the code generation capabilities of large language models (LLMs). Our idea is to rewrite messy data science code to a custom-tailored declarative pipeline abstraction, which we implement as a proof-of-concept in our prototype Lester. We detail its application for a challenging compliance management example involving "incremental view maintenance" of deployed ML pipelines. The code rewrites for our running example show the potential of LLMs to make messy data science code declarative, e.g., by identifying hand-coded joins in Python and turning them into joins on dataframes, or by generating declarative feature encoders from NumPy code.
从数据中学习的机器学习 (ML) 应用程序越来越多地用于自动做出有影响力的决策。遗憾的是,这些应用往往无法充分管理关键数据并遵守即将出台的法规。这些问题长期存在的一个技术原因是,常见的 ML 库和云服务中的数据管道缺乏以数据为中心的基本抽象。最近的研究表明,这种抽象如何使出处跟踪和自动检查等技术能够帮助管理 ML 管道。遗憾的是,这些方法在现实世界中缺乏采用,因为它们需要用声明式应用程序接口编写简洁的 ML 管道代码,而不是数据科学家通常为数据准备编写的混乱的命令式 Python 代码。我们认为,期望数据科学家改变既定的开发实践是不现实的。相反,我们建议利用大型语言模型(LLM)的代码生成能力来规避这种 "代码抽象差距"。我们的想法是将凌乱的数据科学代码重写为定制的声明式流水线抽象,我们在原型 Lester 中实现了这一概念验证。我们详细介绍了它在一个具有挑战性的合规性管理示例中的应用,该示例涉及已部署 ML 管道的 "增量视图维护"。我们正在运行的示例的代码重写显示了 LLM 在使杂乱的数据科学代码声明化方面的潜力,例如,通过识别 Python 中手工编码的连接并将其转化为数据帧上的连接,或者通过从 NumPy 代码生成声明性特征编码器。
{"title":"Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!","authors":"Sebastian Schelter, Stefan Grafberger","doi":"arxiv-2409.10081","DOIUrl":"https://doi.org/arxiv-2409.10081","url":null,"abstract":"Machine learning (ML) applications that learn from data are increasingly used\u0000to automate impactful decisions. Unfortunately, these applications often fall\u0000short of adequately managing critical data and complying with upcoming\u0000regulations. A technical reason for the persistence of these issues is that the\u0000data pipelines in common ML libraries and cloud services lack fundamental\u0000declarative, data-centric abstractions. Recent research has shown how such\u0000abstractions enable techniques like provenance tracking and automatic\u0000inspection to help manage ML pipelines. Unfortunately, these approaches lack\u0000adoption in the real world because they require clean ML pipeline code written\u0000with declarative APIs, instead of the messy imperative Python code that data\u0000scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their\u0000established development practices. Instead, we propose to circumvent this \"code\u0000abstraction gap\" by leveraging the code generation capabilities of large\u0000language models (LLMs). Our idea is to rewrite messy data science code to a\u0000custom-tailored declarative pipeline abstraction, which we implement as a\u0000proof-of-concept in our prototype Lester. We detail its application for a\u0000challenging compliance management example involving \"incremental view\u0000maintenance\" of deployed ML pipelines. The code rewrites for our running\u0000example show the potential of LLMs to make messy data science code declarative,\u0000e.g., by identifying hand-coded joins in Python and turning them into joins on\u0000dataframes, or by generating declarative feature encoders from NumPy code.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CoWrangler is a data-wrangling recommender system designed to streamline data processing tasks. Recognizing that data processing is often time-consuming and complex for novice users, we aim to simplify the decision-making process regarding the most effective subsequent data operation. By analyzing over 10,000 Kaggle notebooks spanning approximately 1,000 datasets, we derive insights into common data processing strategies employed by users across various tasks. This analysis helps us understand how dataset quality influences wrangling operations, informing our ongoing efforts to possibly expand our dataset sources in the future.
{"title":"Development of Data Evaluation Benchmark for Data Wrangling Recommendation System","authors":"Yuqing Wang, Anna Fariha","doi":"arxiv-2409.10635","DOIUrl":"https://doi.org/arxiv-2409.10635","url":null,"abstract":"CoWrangler is a data-wrangling recommender system designed to streamline data\u0000processing tasks. Recognizing that data processing is often time-consuming and\u0000complex for novice users, we aim to simplify the decision-making process\u0000regarding the most effective subsequent data operation. By analyzing over\u000010,000 Kaggle notebooks spanning approximately 1,000 datasets, we derive\u0000insights into common data processing strategies employed by users across\u0000various tasks. This analysis helps us understand how dataset quality influences\u0000wrangling operations, informing our ongoing efforts to possibly expand our\u0000dataset sources in the future.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing methods for bulk loading disk-based multidimensional points involve multiple applications of external sorting. In this paper, we propose techniques that apply linear scan, and are therefore significantly faster. The resulting FMBI Index possesses several desirable properties, including almost full and square nodes with zero overlap, and has excellent query performance. As a second contribution, we develop an adaptive version AMBI, which utilizes the query workload to build a partial index only for parts of the data space that contain query results. Finally, we extend FMBI and AMBI to parallel bulk loading and query processing in distributed systems. An extensive experimental evaluation with real datasets confirms that FMBI and AMBI clearly outperform competitors in terms of combined index construction and query processing cost, sometimes by orders of magnitude.
{"title":"Fast and Adaptive Bulk Loading of Multidimensional Points","authors":"Moin Hussain Moti, Dimitris Papadias","doi":"arxiv-2409.09447","DOIUrl":"https://doi.org/arxiv-2409.09447","url":null,"abstract":"Existing methods for bulk loading disk-based multidimensional points involve\u0000multiple applications of external sorting. In this paper, we propose techniques\u0000that apply linear scan, and are therefore significantly faster. The resulting\u0000FMBI Index possesses several desirable properties, including almost full and\u0000square nodes with zero overlap, and has excellent query performance. As a\u0000second contribution, we develop an adaptive version AMBI, which utilizes the\u0000query workload to build a partial index only for parts of the data space that\u0000contain query results. Finally, we extend FMBI and AMBI to parallel bulk\u0000loading and query processing in distributed systems. An extensive experimental\u0000evaluation with real datasets confirms that FMBI and AMBI clearly outperform\u0000competitors in terms of combined index construction and query processing cost,\u0000sometimes by orders of magnitude.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Matrix Profile (MP), a versatile tool for time series data mining, has been shown effective in time series anomaly detection (TSAD). This paper delves into the problem of anomaly detection in multidimensional time series, a common occurrence in real-world applications. For instance, in a manufacturing factory, multiple sensors installed across the site collect time-varying data for analysis. The Matrix Profile, named for its role in profiling the matrix storing pairwise distance between subsequences of univariate time series, becomes complex in multidimensional scenarios. If the input univariate time series has n subsequences, the pairwise distance matrix is a n x n matrix. In a multidimensional time series with d dimensions, the pairwise distance information must be stored in a n x n x d tensor. In this paper, we first analyze different strategies for condensing this tensor into a profile vector. We then investigate the potential of extending the MP to efficiently find k-nearest neighbors for anomaly detection. Finally, we benchmark the multidimensional MP against 19 baseline methods on 119 multidimensional TSAD datasets. The experiments covers three learning setups: unsupervised, supervised, and semi-supervised. MP is the only method that consistently delivers high performance across all setups.
矩阵剖面图(MP)是一种用于时间序列数据挖掘的多功能工具,在时间序列异常检测(TSAD)中被证明是有效的。本文深入探讨了多维时间序列中的异常检测问题,这是现实世界应用中的常见问题。例如,在制造工厂中,安装在厂区各处的多个传感器会收集时变数据进行分析。矩阵剖面(Matrix Profile)因其作用是剖析存储单变量时间序列子序列之间成对距离的矩阵而得名,在多维场景中变得复杂。如果输入的单变量时间序列有 n 个子序列,那么成对距离矩阵就是 n x n 矩阵。在具有 d 维的多维时间序列中,成对距离信息必须存储在 n x n x d 张量中。在本文中,我们首先分析了将该张量压缩成轮廓向量的不同策略,然后研究了扩展 MP 以高效查找近邻进行异常检测的潜力。最后,我们在 119 个多维 TSAD 数据集上用 19 种基准方法对多维 MP 进行了基准测试。实验涵盖三种学习设置:无监督、有监督和半监督。MP 是唯一一种在所有设置中都能始终保持高性能的方法。
{"title":"Matrix Profile for Anomaly Detection on Multidimensional Time Series","authors":"Chin-Chia Michael Yeh, Audrey Der, Uday Singh Saini, Vivian Lai, Yan Zheng, Junpeng Wang, Xin Dai, Zhongfang Zhuang, Yujie Fan, Huiyuan Chen, Prince Osei Aboagye, Liang Wang, Wei Zhang, Eamonn Keogh","doi":"arxiv-2409.09298","DOIUrl":"https://doi.org/arxiv-2409.09298","url":null,"abstract":"The Matrix Profile (MP), a versatile tool for time series data mining, has\u0000been shown effective in time series anomaly detection (TSAD). This paper delves\u0000into the problem of anomaly detection in multidimensional time series, a common\u0000occurrence in real-world applications. For instance, in a manufacturing\u0000factory, multiple sensors installed across the site collect time-varying data\u0000for analysis. The Matrix Profile, named for its role in profiling the matrix\u0000storing pairwise distance between subsequences of univariate time series,\u0000becomes complex in multidimensional scenarios. If the input univariate time\u0000series has n subsequences, the pairwise distance matrix is a n x n matrix. In a\u0000multidimensional time series with d dimensions, the pairwise distance\u0000information must be stored in a n x n x d tensor. In this paper, we first\u0000analyze different strategies for condensing this tensor into a profile vector.\u0000We then investigate the potential of extending the MP to efficiently find\u0000k-nearest neighbors for anomaly detection. Finally, we benchmark the\u0000multidimensional MP against 19 baseline methods on 119 multidimensional TSAD\u0000datasets. The experiments covers three learning setups: unsupervised,\u0000supervised, and semi-supervised. MP is the only method that consistently\u0000delivers high performance across all setups.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"213 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Process mining on business process execution data has focused primarily on orchestration-type processes performed in a single organization (intra-organizational). Collaborative (inter-organizational) processes, unlike those of orchestration type, expand several organizations (for example, in e-Government), adding complexity and various challenges both for their implementation and for their discovery, prediction, and analysis of their execution. Predictive process monitoring is based on exploiting execution data from past instances to predict the execution of current cases. It is possible to make predictions on the next activity and remaining time, among others, to anticipate possible deviations, violations, and delays in the processes to take preventive measures (e.g., re-allocation of resources). In this work, we propose an extension for collaborative processes of traditional process prediction, considering particularities of this type of process, which add information of interest in this context, for example, the next activity of which participant or the following message to be exchanged between two participants.
{"title":"Extending predictive process monitoring for collaborative processes","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09212","DOIUrl":"https://doi.org/arxiv-2409.09212","url":null,"abstract":"Process mining on business process execution data has focused primarily on\u0000orchestration-type processes performed in a single organization\u0000(intra-organizational). Collaborative (inter-organizational) processes, unlike\u0000those of orchestration type, expand several organizations (for example, in\u0000e-Government), adding complexity and various challenges both for their\u0000implementation and for their discovery, prediction, and analysis of their\u0000execution. Predictive process monitoring is based on exploiting execution data\u0000from past instances to predict the execution of current cases. It is possible\u0000to make predictions on the next activity and remaining time, among others, to\u0000anticipate possible deviations, violations, and delays in the processes to take\u0000preventive measures (e.g., re-allocation of resources). In this work, we\u0000propose an extension for collaborative processes of traditional process\u0000prediction, considering particularities of this type of process, which add\u0000information of interest in this context, for example, the next activity of\u0000which participant or the following message to be exchanged between two\u0000participants.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Educational Process Mining (EPM) is a data analysis technique that is used to improve educational processes. It is based on Process Mining (PM), which involves gathering records (logs) of events to discover process models and analyze the data from a process-centric perspective. One specific application of EPM is curriculum mining, which focuses on understanding the learning program students follow to achieve educational goals. This is important for institutional curriculum decision-making and quality improvement. Therefore, academic institutions can benefit from organizing the existing techniques, capabilities, and limitations. We conducted a systematic literature review to identify works on applying PM to curricular analysis and provide insights for further research. From the analysis of 22 primary studies, we found that results can be classified into five categories concerning the objectives they pursue: the discovery of educational trajectories, the identification of deviations in the observed behavior of students, the analysis of bottlenecks, the analysis of stopout and dropout problems, and the generation of recommendation. Moreover, we identified some open challenges and opportunities, such as standardizing for replicating studies to perform cross-university curricular analysis and strengthening the connection between PM and data mining for improving curricular analysis.
{"title":"A Systematic Review on Process Mining for Curricular Analysis","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09204","DOIUrl":"https://doi.org/arxiv-2409.09204","url":null,"abstract":"Educational Process Mining (EPM) is a data analysis technique that is used to\u0000improve educational processes. It is based on Process Mining (PM), which\u0000involves gathering records (logs) of events to discover process models and\u0000analyze the data from a process-centric perspective. One specific application\u0000of EPM is curriculum mining, which focuses on understanding the learning\u0000program students follow to achieve educational goals. This is important for\u0000institutional curriculum decision-making and quality improvement. Therefore,\u0000academic institutions can benefit from organizing the existing techniques,\u0000capabilities, and limitations. We conducted a systematic literature review to\u0000identify works on applying PM to curricular analysis and provide insights for\u0000further research. From the analysis of 22 primary studies, we found that\u0000results can be classified into five categories concerning the objectives they\u0000pursue: the discovery of educational trajectories, the identification of\u0000deviations in the observed behavior of students, the analysis of bottlenecks,\u0000the analysis of stopout and dropout problems, and the generation of\u0000recommendation. Moreover, we identified some open challenges and opportunities,\u0000such as standardizing for replicating studies to perform cross-university\u0000curricular analysis and strengthening the connection between PM and data mining\u0000for improving curricular analysis.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We revisit the join ordering problem in query optimization. The standard exact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This is prohibitively expensive for large queries, which are not that uncommon anymore. We develop a new algorithmic framework based on subset convolution. DPconv achieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$ time-barrier for the first time. We show that the instantiation of our framework for the $C_max$ cost function is up to 30x faster than DPccp for large clique queries.
{"title":"DPconv: Super-Polynomially Faster Join Ordering","authors":"Mihail Stoian, Andreas Kipf","doi":"arxiv-2409.08013","DOIUrl":"https://doi.org/arxiv-2409.08013","url":null,"abstract":"We revisit the join ordering problem in query optimization. The standard\u0000exact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This is\u0000prohibitively expensive for large queries, which are not that uncommon anymore.\u0000We develop a new algorithmic framework based on subset convolution. DPconv\u0000achieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$\u0000time-barrier for the first time. We show that the instantiation of our\u0000framework for the $C_max$ cost function is up to 30x faster than DPccp for\u0000large clique queries.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
Ranked enumeration is a query-answering paradigm where the query answers are returned incrementally in order of importance (instead of returning all answers at once). Importance is defined by a ranking function that can be specific to the application, but typically involves either a lexicographic order (e.g., "ORDER BY R.A, S.B" in SQL) or a weighted sum of attributes (e.g., "ORDER BY 3*R.A + 2*S.B"). We recently introduced any-k algorithms for (multi-way) join queries, which push ranking into joins and avoid materializing intermediate results until necessary. The top-ranked answers are returned asymptotically faster than the common join-then-rank approach of database systems, resulting in orders-of-magnitude speedup in practice. In addition to their practical usefulness, our techniques complement a long line of theoretical research on unranked enumeration, where answers are also returned incrementally, but with no explicit ordering requirement. For a broad class of ranking functions with certain monotonicity properties, including lexicographic orders and sum-based rankings, the ordering requirement surprisingly does not increase the asymptotic time or space complexity, apart from logarithmic factors. A key insight of our work is the connection between ranked enumeration for database queries and the fundamental task of computing the kth-shortest path in a graph. Uncovering these connections allowed us to ground our approach in the rich literature of that problem and connect ideas that had been explored in isolation before. In this article, we adopt a pragmatic approach and present a slightly simplified version of the algorithm without the shortest-path interpretation. We believe that this will benefit practitioners looking to implement and optimize any-k approaches.
排序枚举是一种查询回答范例,在这种范例中,查询答案按重要性顺序逐步返回(而不是一次性返回所有答案)。重要性由一个排序函数来定义,该函数可以是特定于应用的,但通常涉及一个词法排序(例如,SQL 中的 "ORDER BY R.A,S.B")或一个属性加权和(例如,"ORDER BY3*R.A + 2*S.B")。我们最近为(多向)连接查询引入了 any-k 算法,它将排序推入连接,并在必要时避免中间结果的具体化。与数据库系统中常见的先连接后排名的方法相比,排名靠前的答案的返回速度在渐近上更快,从而在实践中实现了数量级的提速。除了实用性之外,我们的技术也是对无排序枚举理论研究的补充,在无排序枚举中,答案也是逐步返回的,但没有明确的排序要求。对于一类具有某些单调性的排序函数,包括线性排序和基于和的排序,除了对数因素外,排序要求出人意料地没有增加渐进时间或空间复杂性。我们工作的一个关键见解是数据库查询的排序枚举与计算图中第 k 条最短路径这一基本任务之间的联系。发现这些联系使我们能够将我们的方法建立在有关该问题的丰富文献基础之上,并将以前孤立探索过的观点联系起来。在本文中,我们采用了一种务实的方法,提出了一个略微简化的算法版本,但没有对最短路径进行解释。我们相信,这将有益于希望实施和优化任意 K 方法的实践者。
{"title":"Ranked Enumeration for Database Queries","authors":"Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald","doi":"arxiv-2409.08142","DOIUrl":"https://doi.org/arxiv-2409.08142","url":null,"abstract":"Ranked enumeration is a query-answering paradigm where the query answers are\u0000returned incrementally in order of importance (instead of returning all answers\u0000at once). Importance is defined by a ranking function that can be specific to\u0000the application, but typically involves either a lexicographic order (e.g.,\u0000\"ORDER BY R.A, S.B\" in SQL) or a weighted sum of attributes (e.g., \"ORDER BY\u00003*R.A + 2*S.B\"). We recently introduced any-k algorithms for (multi-way) join\u0000queries, which push ranking into joins and avoid materializing intermediate\u0000results until necessary. The top-ranked answers are returned asymptotically\u0000faster than the common join-then-rank approach of database systems, resulting\u0000in orders-of-magnitude speedup in practice. In addition to their practical usefulness, our techniques complement a long\u0000line of theoretical research on unranked enumeration, where answers are also\u0000returned incrementally, but with no explicit ordering requirement. For a broad\u0000class of ranking functions with certain monotonicity properties, including\u0000lexicographic orders and sum-based rankings, the ordering requirement\u0000surprisingly does not increase the asymptotic time or space complexity, apart\u0000from logarithmic factors. A key insight of our work is the connection between ranked enumeration for\u0000database queries and the fundamental task of computing the kth-shortest path in\u0000a graph. Uncovering these connections allowed us to ground our approach in the\u0000rich literature of that problem and connect ideas that had been explored in\u0000isolation before. In this article, we adopt a pragmatic approach and present a\u0000slightly simplified version of the algorithm without the shortest-path\u0000interpretation. We believe that this will benefit practitioners looking to\u0000implement and optimize any-k approaches.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ethan Steinberg, Michael Wornow, Suhana Bedi, Jason Alan Fries, Matthew B. A. McDermott, Nigam H. Shah
The growing demand for machine learning in healthcare requires processing increasingly large electronic health record (EHR) datasets, but existing pipelines are not computationally efficient or scalable. In this paper, we introduce meds_reader, an optimized Python package for efficient EHR data processing that is designed to take advantage of many intrinsic properties of EHR data for improved speed. We then demonstrate the benefits of meds_reader by reimplementing key components of two major EHR processing pipelines, achieving 10-100x improvements in memory, speed, and disk usage. The code for meds_reader can be found at https://github.com/som-shahlab/meds_reader.
{"title":"meds_reader: A fast and efficient EHR processing library","authors":"Ethan Steinberg, Michael Wornow, Suhana Bedi, Jason Alan Fries, Matthew B. A. McDermott, Nigam H. Shah","doi":"arxiv-2409.09095","DOIUrl":"https://doi.org/arxiv-2409.09095","url":null,"abstract":"The growing demand for machine learning in healthcare requires processing\u0000increasingly large electronic health record (EHR) datasets, but existing\u0000pipelines are not computationally efficient or scalable. In this paper, we\u0000introduce meds_reader, an optimized Python package for efficient EHR data\u0000processing that is designed to take advantage of many intrinsic properties of\u0000EHR data for improved speed. We then demonstrate the benefits of meds_reader by\u0000reimplementing key components of two major EHR processing pipelines, achieving\u000010-100x improvements in memory, speed, and disk usage. The code for meds_reader\u0000can be found at https://github.com/som-shahlab/meds_reader.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth
According to the FAIR (findability, accessibility, interoperability, and reusability) principles, scientific data should always be stored with machine-readable descriptive metadata. Existing solutions to store data with metadata, such as electronic lab notebooks (ELN), are often very domain-specific and not sufficiently generic for arbitrary experimental or computational results. In this work, we present open-source echemdb toolkit for creating and handling data and metadata. The toolkit is running entirely on the file system level using a file-based approach, which facilitates integration with other tools in a FAIR data life cycle and means that no complicated server setup is required. This also makes the toolkit more accessible to the average researcher since no understanding of more sophisticated database technologies is required. We showcase several aspects and applications of the toolkit: automatic annotation of raw research data with human- and machine-readable metadata, data conversion into standardised frictionless Data Packages, and an API for exploring the data. We also illustrate the web frameworks to illustrate the data using example data from research into energy conversion and storage.
{"title":"echemdb Toolkit -- a Lightweight Approach to Getting Data Ready for Data Management Solutions","authors":"Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth","doi":"arxiv-2409.07083","DOIUrl":"https://doi.org/arxiv-2409.07083","url":null,"abstract":"According to the FAIR (findability, accessibility, interoperability, and\u0000reusability) principles, scientific data should always be stored with\u0000machine-readable descriptive metadata. Existing solutions to store data with\u0000metadata, such as electronic lab notebooks (ELN), are often very\u0000domain-specific and not sufficiently generic for arbitrary experimental or\u0000computational results. In this work, we present open-source echemdb toolkit for creating and\u0000handling data and metadata. The toolkit is running entirely on the file system\u0000level using a file-based approach, which facilitates integration with other\u0000tools in a FAIR data life cycle and means that no complicated server setup is\u0000required. This also makes the toolkit more accessible to the average researcher\u0000since no understanding of more sophisticated database technologies is required. We showcase several aspects and applications of the toolkit: automatic\u0000annotation of raw research data with human- and machine-readable metadata, data\u0000conversion into standardised frictionless Data Packages, and an API for\u0000exploring the data. We also illustrate the web frameworks to illustrate the\u0000data using example data from research into energy conversion and storage.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}