Data Quality Management for Big Data Applications

Majida Yaseen Khaleel, Murtadha M. Hamad
{"title":"Data Quality Management for Big Data Applications","authors":"Majida Yaseen Khaleel, Murtadha M. Hamad","doi":"10.1109/DeSE.2019.00072","DOIUrl":null,"url":null,"abstract":"Currently, as a result of the continuous increase of data, one of the key issues is the development of systems and applications to deal with storage, management and processing of big numbers of data. These data are found in unstructured ways. Data management with traditional approaches is inappropriate because of the large and complex data sizes. Hadoop is a suitable solution for the continuous increase in data sizes. The important characteristics of the Hadoop are distributed processing, high storage space, and easy administration. Hadoop is better known for distributed file systems. In this paper, we have proposed techniques and algorithms that deal with big data including data collecting, data preprocessing, algorithms for data cleaning, A Technique for Converting Unstructured Data to Structured Data using metadata, distributed data file system (fragmentation algorithm) and Quality assurance algorithms by using the model is the statistical model to evaluate the highest educational institutions. We concluded that Metadata accelerates query response required and facilitates query execution, metadata will be content for reports, fields and descriptions. Total time access for three complex queries in distributed processing it is 00: 03: 00 per second while in nondistributed processing it is at 00: 15: 77 per second, average is approximately five minutes per second. Quality assurance note values (T-test) is 0.239 and values (T-dis) is 1.96, as a result of dealing with scientific sets and humanities sets. In the comparison law, it can be deduced that if the t-test is smaller than the t-dis; so there is no difference between the mean of the scientific and humanities samples, the values of C.V for both scientific is (8.585) and humanities sets is (7.427), using the law of homogeneity know whether any sets are more homogeneous whenever the value of a small C.V was more homogeneous however the humanity set is more homogeneity.","PeriodicalId":6632,"journal":{"name":"2019 12th International Conference on Developments in eSystems Engineering (DeSE)","volume":"60 1","pages":"357-362"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 12th International Conference on Developments in eSystems Engineering (DeSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DeSE.2019.00072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Currently, as a result of the continuous increase of data, one of the key issues is the development of systems and applications to deal with storage, management and processing of big numbers of data. These data are found in unstructured ways. Data management with traditional approaches is inappropriate because of the large and complex data sizes. Hadoop is a suitable solution for the continuous increase in data sizes. The important characteristics of the Hadoop are distributed processing, high storage space, and easy administration. Hadoop is better known for distributed file systems. In this paper, we have proposed techniques and algorithms that deal with big data including data collecting, data preprocessing, algorithms for data cleaning, A Technique for Converting Unstructured Data to Structured Data using metadata, distributed data file system (fragmentation algorithm) and Quality assurance algorithms by using the model is the statistical model to evaluate the highest educational institutions. We concluded that Metadata accelerates query response required and facilitates query execution, metadata will be content for reports, fields and descriptions. Total time access for three complex queries in distributed processing it is 00: 03: 00 per second while in nondistributed processing it is at 00: 15: 77 per second, average is approximately five minutes per second. Quality assurance note values (T-test) is 0.239 and values (T-dis) is 1.96, as a result of dealing with scientific sets and humanities sets. In the comparison law, it can be deduced that if the t-test is smaller than the t-dis; so there is no difference between the mean of the scientific and humanities samples, the values of C.V for both scientific is (8.585) and humanities sets is (7.427), using the law of homogeneity know whether any sets are more homogeneous whenever the value of a small C.V was more homogeneous however the humanity set is more homogeneity.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向大数据应用的数据质量管理
当前,由于数据量的不断增加,开发处理大量数据的存储、管理和处理的系统和应用程序是一个关键问题。这些数据以非结构化的方式被发现。由于数据规模庞大且复杂,采用传统方法进行数据管理是不合适的。Hadoop是一个适合于数据量持续增长的解决方案。Hadoop的重要特点是分布式处理、高存储空间和易于管理。Hadoop更出名的是分布式文件系统。在本文中,我们提出了处理大数据的技术和算法,包括数据收集、数据预处理、数据清洗算法、利用元数据将非结构化数据转换为结构化数据的技术、分布式数据文件系统(碎片化算法)和质量保证算法,并利用该模型作为评价高等院校的统计模型。我们的结论是,元数据加快了查询响应的速度,方便了查询的执行,元数据将成为报告、字段和描述的内容。在分布式处理中,三个复杂查询的总访问时间为每秒00:03:00,而在非分布式处理中,访问时间为每秒00:15:77,平均大约为每秒5分钟。质量保证笔记值(T-test)为0.239,值(T-dis)为1.96,这是处理科学集和人文集的结果。在比较律中,可以推导出,如果t检验小于t dis;因此,科学和人文样本的平均值之间没有差异,科学集的C.V值为(8.585),人文集的C.V值为(7.427),使用同质性定律知道,当一个小的C.V值更均匀而人文集更均匀时,是否有任何集更均匀。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fresh and Mechanical Properties of Self-Compacting Lightweight Concrete Containing Ponza Aggregates LPLian: Angle-Constrained Path Finding in Dynamic Grids The Sentiment Analysis of Unstructured Social Network Data Using the Extended Ontology SentiWordNet Investigation of IDC Structures for Graphene Based Biosensors Using Low Frequency EIS Method Comparing Unsupervised Layers in Neural Networks for Financial Time Series Prediction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1