Variety of data in the ETL processes in the cloud: State of the art

Papa Senghane Diouf, Aliou Boly, S. Ndiaye
{"title":"Variety of data in the ETL processes in the cloud: State of the art","authors":"Papa Senghane Diouf, Aliou Boly, S. Ndiaye","doi":"10.1109/ICIRD.2018.8376308","DOIUrl":null,"url":null,"abstract":"The ETL (Extract-Transform-Load) processes are responsible for integrating data into a place called datawarehouse. In the ETL phase, data are extracted from various sources, they are transformed before being loaded into the datawarehouse. It is then a mandatory step in the decision-making process. But ETL is also a long and costly step in the use of human and IT resources. However, in the context of big data, characterized by 3V (Volume, Variety, Velocity), the speed of processing has become a decisive factor in search of competitiveness. In order to facilitate the implementation of the ETL, a solution is then to use the infrastructures of cloud computing whose resources in computation and storage are \"unlimited\". This has resulted in considerable progress in terms of availability and scalability for the success of projects. But it remains a major problem: the cost can quickly become prohibitive with \"pay-per-use\" model of the cloud. It is in this context that we have realized a state of the art on the performance of ETL processes in the cloud in terms of volume and velocity. According to the ETL strategy, in this state of the art, some authors have suggested solutions which use parallelization techniques such as MapReduce and relying on the classical ETL approach while for other, in a big data environment, the use of new ETL strategies is required to face to big data challenges. This study has shown that, despite the many solutions that have been proposed in the literature, the issue of data integration in a big data environment still arises. In addition, ETL tools also must deal with the heterogeneity of data formats and structures. As our previous work in this area were limited to the volume and the velocity of data, we are going, in this paper, to review studies that have treated variety in big data integration in the cloud.","PeriodicalId":397098,"journal":{"name":"2018 IEEE International Conference on Innovative Research and Development (ICIRD)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Innovative Research and Development (ICIRD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIRD.2018.8376308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

The ETL (Extract-Transform-Load) processes are responsible for integrating data into a place called datawarehouse. In the ETL phase, data are extracted from various sources, they are transformed before being loaded into the datawarehouse. It is then a mandatory step in the decision-making process. But ETL is also a long and costly step in the use of human and IT resources. However, in the context of big data, characterized by 3V (Volume, Variety, Velocity), the speed of processing has become a decisive factor in search of competitiveness. In order to facilitate the implementation of the ETL, a solution is then to use the infrastructures of cloud computing whose resources in computation and storage are "unlimited". This has resulted in considerable progress in terms of availability and scalability for the success of projects. But it remains a major problem: the cost can quickly become prohibitive with "pay-per-use" model of the cloud. It is in this context that we have realized a state of the art on the performance of ETL processes in the cloud in terms of volume and velocity. According to the ETL strategy, in this state of the art, some authors have suggested solutions which use parallelization techniques such as MapReduce and relying on the classical ETL approach while for other, in a big data environment, the use of new ETL strategies is required to face to big data challenges. This study has shown that, despite the many solutions that have been proposed in the literature, the issue of data integration in a big data environment still arises. In addition, ETL tools also must deal with the heterogeneity of data formats and structures. As our previous work in this area were limited to the volume and the velocity of data, we are going, in this paper, to review studies that have treated variety in big data integration in the cloud.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
云中的ETL过程中的各种数据:最新技术
ETL(提取-转换-加载)过程负责将数据集成到一个称为数据仓库的地方。在ETL阶段,从各种来源提取数据,在加载到数据仓库之前对它们进行转换。这是决策过程中的一个强制性步骤。但是,在使用人力和IT资源方面,ETL也是一个漫长而昂贵的步骤。然而,在以3V (Volume, Variety, Velocity)为特征的大数据背景下,处理速度已成为寻求竞争力的决定性因素。为了促进ETL的实现,一种解决方案是使用云计算的基础设施,其计算和存储资源是“无限的”。这在项目成功的可用性和可伸缩性方面取得了相当大的进展。但这仍然是一个主要问题:在“按使用付费”的云计算模式下,成本很快就会变得令人望而却步。正是在这种情况下,我们在云中实现了ETL进程在体积和速度方面的性能的最新状态。根据ETL策略,在目前的技术水平下,一些作者提出了使用并行化技术(如MapReduce)和依赖经典ETL方法的解决方案,而对于另一些人来说,在大数据环境下,需要使用新的ETL策略来面对大数据挑战。本研究表明,尽管文献中提出了许多解决方案,但大数据环境下的数据集成问题仍然存在。此外,ETL工具还必须处理数据格式和结构的异构性。由于我们之前在这一领域的工作仅限于数据的数量和速度,因此在本文中,我们将回顾处理云中的大数据集成中的多样性的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Streamlining mobile app deployment with Jenkins and Fastlane in the case of Catrobat's pocket code Pocket code build variants Lithium recovery from Bledug Kuwu Mud volcano using water leaching method Copyright Information An approach towards developing tower of Hanoi sequence based distributed multi-channel parallel rendezvous for ad hoc networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1