Enabling data-centric AI through data quality management and data literacy

IF 1 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS IT-Information Technology Pub Date : 2022-02-18 DOI:10.1515/itit-2021-0048
Ziawasch Abedjan
{"title":"Enabling data-centric AI through data quality management and data literacy","authors":"Ziawasch Abedjan","doi":"10.1515/itit-2021-0048","DOIUrl":null,"url":null,"abstract":"Abstract Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.","PeriodicalId":43953,"journal":{"name":"IT-Information Technology","volume":null,"pages":null},"PeriodicalIF":1.0000,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IT-Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/itit-2021-0048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 2

Abstract

Abstract Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过数据质量管理和数据素养实现以数据为中心的人工智能
摘要数据的生成速度非常缓慢。与此同时,人们对将这些数据用于涵盖所有可以想象的领域的用例有着永不满足的兴趣,包括健康、气候、商业和游戏。除了围绕数据驱动创新的新的社会技术挑战之外,仍然存在阻碍数据驱动技术可用性的开放数据处理挑战。人们普遍认为,克服语法和语义方面的数据异构性,将各种来源组合起来以实现共同目标是一个主要瓶颈。此外,这些数据的质量总是受到质疑,因为今天的数据科学管道是高度临时的,没有必要关心来源。最后,质量标准超越了个人价值观的语法和语义正确性,但也包含了群体层面的约束,如与受保护群体的平等平等和机会,在这一过程中发挥着越来越重要的作用。传统的数据集成研究主要集中在公司合并后的集成,其中必须集成客户或产品数据库。虽然这通常已经够难的了,但今天的挑战加剧了,因为越来越多的利益相关者正在使用数据分析工具来获得特定领域的见解。我把这种现象称为数据科学的民主化,这是一个具有挑战性和必要性的过程。新型系统需要用户友好,不仅受过培训的数据库管理员可以处理这些系统,而且需要不太懂计算机科学的利益相关者。因此,我们的研究重点是用于数据准备和管理的可扩展示例驱动技术。此外,我们认为,重要的是教育全社会了解数据驱动世界的影响,并积极宣传数据素养这一基本能力的概念。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IT-Information Technology
IT-Information Technology COMPUTER SCIENCE, INFORMATION SYSTEMS-
CiteScore
3.80
自引率
0.00%
发文量
29
期刊最新文献
Wildfire prediction for California using and comparing Spatio-Temporal Knowledge Graphs Machine learning in AI Factories – five theses for developing, managing and maintaining data-driven artificial intelligence at large scale Machine learning applications Machine learning in sensor identification for industrial systems Machine learning and cyber security
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1