{"title":"Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data","authors":"Gavin Chait","doi":"arxiv-2409.01517","DOIUrl":null,"url":null,"abstract":"This paper presents an open-source curatorial toolkit intended to produce\nwell-structured and interoperable data. Curation is divided into discrete\ncomponents, with a schema-centric focus for auditable restructuring of complex\nand scattered tabular data to conform to a destination schema. Task separation\nallows development of software and analysis without source data being present.\nTransformations are captured as high-level sequential scripts describing\nschema-to-schema mappings, reducing complexity and resource requirements.\nUltimately, data are transformed, but the objective is that any data meeting a\nschema definition can be restructured using a crosswalk. The toolkit is\navailable both as a Python package, and as a 'no-code' visual web application.\nA visual example is presented, derived from a longitudinal study where\nscattered source data from hundreds of local councils are integrated into a\nsingle database.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents an open-source curatorial toolkit intended to produce
well-structured and interoperable data. Curation is divided into discrete
components, with a schema-centric focus for auditable restructuring of complex
and scattered tabular data to conform to a destination schema. Task separation
allows development of software and analysis without source data being present.
Transformations are captured as high-level sequential scripts describing
schema-to-schema mappings, reducing complexity and resource requirements.
Ultimately, data are transformed, but the objective is that any data meeting a
schema definition can be restructured using a crosswalk. The toolkit is
available both as a Python package, and as a 'no-code' visual web application.
A visual example is presented, derived from a longitudinal study where
scattered source data from hundreds of local councils are integrated into a
single database.