Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data

arXiv - CS - Databases Pub Date : 2024-09-03 DOI:arxiv-2409.01517

Gavin Chait

引用次数: 0

Abstract

This paper presents an open-source curatorial toolkit intended to produce well-structured and interoperable data. Curation is divided into discrete components, with a schema-centric focus for auditable restructuring of complex and scattered tabular data to conform to a destination schema. Task separation allows development of software and analysis without source data being present. Transformations are captured as high-level sequential scripts describing schema-to-schema mappings, reducing complexity and resource requirements. Ultimately, data are transformed, but the objective is that any data meeting a schema definition can be restructured using a crosswalk. The toolkit is available both as a Python package, and as a 'no-code' visual web application. A visual example is presented, derived from a longitudinal study where scattered source data from hundreds of local councils are integrated into a single database.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

可审计、可重复使用的横道图，可快速、按比例整合分散的表格数据

本文介绍了一个旨在生成结构良好、可互操作的数据的开放源代码编辑工具包。数据整理被划分为不同的组成部分，以模式为中心，对复杂而分散的表格数据进行可审计的重组，使其符合目标模式。任务分离允许在不存在源数据的情况下开发软件和进行分析。转换以描述模式到模式映射的高级顺序脚本的形式进行，从而降低了复杂性和资源需求。最终，数据将被转换，但目标是任何符合模式定义的数据都可以使用横道图进行重组。该工具包既可以作为 Python 软件包提供，也可以作为 "无代码 "可视化网络应用程序提供。本文介绍了一个可视化示例，该示例来自一项纵向研究，研究将数百个地方议会的零散源数据整合到同一个数据库中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes