SDTA: An Algebra for Statistical Data Transformation

33rd International Conference on Scientific and Statistical Database Management Pub Date : 2021-07-06 DOI:10.1145/3468791.3468811

Jie Song, H. Jagadish, George Alter

{"title":"SDTA: An Algebra for Statistical Data Transformation","authors":"Jie Song, H. Jagadish, George Alter","doi":"10.1145/3468791.3468811","DOIUrl":null,"url":null,"abstract":"Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"187 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"33rd International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3468791.3468811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SDTA:统计数据转换的代数

统计数据操作是许多数据科学分析管道的关键组成部分，特别是作为数据摄取的一部分。这个任务一般是通过用SPSS, Stata, SAS, R, Python (Pandas)等语言编写转换脚本来完成的。这些工具支持的完全不同的数据模型、语言表示和转换操作使得最终用户很难理解和记录所执行的转换，开发人员也很难跨语言移植转换代码。为了应对这些挑战，我们提出了统计数据转换的正式范式。它由一个数据模型组成，称为结构化数据转换数据模型(SDTDM)，其灵感来自多个统计转换框架的数据模型;一个代数，结构数据转换代数(SDTA)，不仅能够转换SDTDM内的数据，还能够转换多个结构级别的元数据;以及一种等效的描述性语言，称为结构化数据转换语言(SDTL)，最近被DDI联盟采用，该联盟维护元数据的国际标准，并将其作为其产品套件的一部分。对社会经济数据进行实际统计转换的实验表明，SDTL分别能成功表示从存储库中获得的SAS中的4185条命令和SPSS中的9087条命令中的86.1%和91.6%。我们用例子说明SDTA/SDTL如何帮助统计数据转换的文档化，这是数据集元数据中经常被忽视的一个重要方面。我们提出了一个称为C2Metadata的系统，它自动捕获SDTL中的转换和来源信息，作为元数据的一部分。此外，鉴于从源统计语言到SDTA/SDTL的转换机制，我们展示了如何将功能等效的转换程序转换为其他功能等效的程序，使用相同或不同的语言，允许代码重用和结果可再现性。我们还说明了使用SDTA来优化SDTL转换的可能性，使用类似于SQL优化的基于规则的重写。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

33rd International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量

期刊最新文献

Caching Support for Range Query Processing on Bitmap Indices Distributed Enumeration of Four Node Graphlets at Quadrillion-Scale Automatic Selection of Analytic Platforms with ASAP-DM HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources On Lowering Merge Costs of an LSM Tree