{"title":"Impact of Small Files on Hadoop Performance: Literature Survey and Open Points","authors":"T. El-Sayed, M. Badawy, A. El-Sayed","doi":"10.21608/mjeer.2019.62728","DOIUrl":null,"url":null,"abstract":"Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.","PeriodicalId":218019,"journal":{"name":"Menoufia Journal of Electronic Engineering Research","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Menoufia Journal of Electronic Engineering Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/mjeer.2019.62728","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.