Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Menoufia Journal of Electronic Engineering Research Pub Date : 2019-01-01 DOI:10.21608/mjeer.2019.62728

T. El-Sayed, M. Badawy, A. El-Sayed

引用次数: 4

Abstract

Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

小文件对Hadoop性能的影响:文献综述和开放点

Hadoop是一个由java编写的开源框架，用于大数据处理。它由两个主要组件组成:HDFS (HadoopDistributed File System)和MapReduce。HDFS用于存储数据，MapReduce用于以分布式处理形式分发和处理应用程序任务。最近，一些研究人员使用Hadoop来处理大数据。结果表明，Hadoop在处理大文件(大于数据节点块大小的文件)时表现良好。然而，Hadoop的性能会随着小于其块大小的小文件而下降。这是因为，smallfiles消耗DataNode和NameNode的内存，并增加应用程序的执行时间(即降低mapreduce性能)。本文对Hadoop中的小文件问题进行了定义，并对现有的解决该问题的方法进行了分类和讨论。此外，在考虑更好的方法来提高hadoop处理小文件时的性能时，必须考虑一些开放点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Menoufia Journal of Electronic Engineering Research

自引率

0.00%

发文量