{"title":"Data Quality Management for Big Data Applications","authors":"Majida Yaseen Khaleel, Murtadha M. Hamad","doi":"10.1109/DeSE.2019.00072","DOIUrl":null,"url":null,"abstract":"Currently, as a result of the continuous increase of data, one of the key issues is the development of systems and applications to deal with storage, management and processing of big numbers of data. These data are found in unstructured ways. Data management with traditional approaches is inappropriate because of the large and complex data sizes. Hadoop is a suitable solution for the continuous increase in data sizes. The important characteristics of the Hadoop are distributed processing, high storage space, and easy administration. Hadoop is better known for distributed file systems. In this paper, we have proposed techniques and algorithms that deal with big data including data collecting, data preprocessing, algorithms for data cleaning, A Technique for Converting Unstructured Data to Structured Data using metadata, distributed data file system (fragmentation algorithm) and Quality assurance algorithms by using the model is the statistical model to evaluate the highest educational institutions. We concluded that Metadata accelerates query response required and facilitates query execution, metadata will be content for reports, fields and descriptions. Total time access for three complex queries in distributed processing it is 00: 03: 00 per second while in nondistributed processing it is at 00: 15: 77 per second, average is approximately five minutes per second. Quality assurance note values (T-test) is 0.239 and values (T-dis) is 1.96, as a result of dealing with scientific sets and humanities sets. In the comparison law, it can be deduced that if the t-test is smaller than the t-dis; so there is no difference between the mean of the scientific and humanities samples, the values of C.V for both scientific is (8.585) and humanities sets is (7.427), using the law of homogeneity know whether any sets are more homogeneous whenever the value of a small C.V was more homogeneous however the humanity set is more homogeneity.","PeriodicalId":6632,"journal":{"name":"2019 12th International Conference on Developments in eSystems Engineering (DeSE)","volume":"60 1","pages":"357-362"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 12th International Conference on Developments in eSystems Engineering (DeSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DeSE.2019.00072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Currently, as a result of the continuous increase of data, one of the key issues is the development of systems and applications to deal with storage, management and processing of big numbers of data. These data are found in unstructured ways. Data management with traditional approaches is inappropriate because of the large and complex data sizes. Hadoop is a suitable solution for the continuous increase in data sizes. The important characteristics of the Hadoop are distributed processing, high storage space, and easy administration. Hadoop is better known for distributed file systems. In this paper, we have proposed techniques and algorithms that deal with big data including data collecting, data preprocessing, algorithms for data cleaning, A Technique for Converting Unstructured Data to Structured Data using metadata, distributed data file system (fragmentation algorithm) and Quality assurance algorithms by using the model is the statistical model to evaluate the highest educational institutions. We concluded that Metadata accelerates query response required and facilitates query execution, metadata will be content for reports, fields and descriptions. Total time access for three complex queries in distributed processing it is 00: 03: 00 per second while in nondistributed processing it is at 00: 15: 77 per second, average is approximately five minutes per second. Quality assurance note values (T-test) is 0.239 and values (T-dis) is 1.96, as a result of dealing with scientific sets and humanities sets. In the comparison law, it can be deduced that if the t-test is smaller than the t-dis; so there is no difference between the mean of the scientific and humanities samples, the values of C.V for both scientific is (8.585) and humanities sets is (7.427), using the law of homogeneity know whether any sets are more homogeneous whenever the value of a small C.V was more homogeneous however the humanity set is more homogeneity.