Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

T. El-Sayed, M. Badawy, A. El-Sayed
{"title":"Impact of Small Files on Hadoop Performance: Literature Survey and Open Points","authors":"T. El-Sayed, M. Badawy, A. El-Sayed","doi":"10.21608/mjeer.2019.62728","DOIUrl":null,"url":null,"abstract":"Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.","PeriodicalId":218019,"journal":{"name":"Menoufia Journal of Electronic Engineering Research","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Menoufia Journal of Electronic Engineering Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/mjeer.2019.62728","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
小文件对Hadoop性能的影响:文献综述和开放点
Hadoop是一个由java编写的开源框架,用于大数据处理。它由两个主要组件组成:HDFS (HadoopDistributed File System)和MapReduce。HDFS用于存储数据,MapReduce用于以分布式处理形式分发和处理应用程序任务。最近,一些研究人员使用Hadoop来处理大数据。结果表明,Hadoop在处理大文件(大于数据节点块大小的文件)时表现良好。然而,Hadoop的性能会随着小于其块大小的小文件而下降。这是因为,smallfiles消耗DataNode和NameNode的内存,并增加应用程序的执行时间(即降低mapreduce性能)。本文对Hadoop中的小文件问题进行了定义,并对现有的解决该问题的方法进行了分类和讨论。此外,在考虑更好的方法来提高hadoop处理小文件时的性能时,必须考虑一些开放点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Classification of Brain Neuroimaging for Alzheimer's Disease Employing Principal Component Analysis DICOM Medical Image Security with DNA- Non-Uniform Cellular Automata and JSMP Map Based Encryption Technique Photonic Crystal Fiber Sensors, Literature Review, Challenges, and Some Novel Trends Cascading ensemble machine learning algorithms for maize yield level prediction Vibration Control of Horizontally Supported Jeffcott-Rotor System Utilizing PIRC-controller
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1