大数据平台Hadoop的性能挑战与解决方案

Q3 Computer Science Recent Advances in Computer Science and Communications Pub Date : 2023-06-08 DOI:10.2174/2666255816666230608165146

Balraj Singh, H. Verma, Vishu Madaan

{"title":"大数据平台Hadoop的性能挑战与解决方案","authors":"Balraj Singh, H. Verma, Vishu Madaan","doi":"10.2174/2666255816666230608165146","DOIUrl":null,"url":null,"abstract":"\n\nThe present era demands continuous support to bring improvements in executing complex analytics on large-scale data and to work beyond traditional systems.\n\n\n\nThe need for processing diverse data types and solutions for different domains of the industry is rising. Such needs increase the requirement for sophisticated techniques and methods to enhance the existing platforms and mechanisms further. It provides an opportunity for the research community to investigate further into the existing systems, find potential issues, and propose new ways to improve the current systems. Hadoop is a popular choice to manage and process Big data. It is an open-source platform and a front-runner in the batch processing of large-scale jobs. The economy associated with the cluster in scaling is low as compared to other platforms. However, this popularity by no means guarantees high performance in all scenarios. With the continuous evolution in data development and industrial requirements, it is imperative to investigate and look into new methods and techniques to bring advancements to the existing system.\n\n\n\nA systematic review is represented in this paper to have an insight into the current progress in this field. Research publications from various sources are taken and analyzed. The performance of a cluster largely depends upon the different job processing mechanisms and policies associated with it.\n\n\n\nWhile extensive studies and solutions are proposed, the performance bottlenecks in terms of load balancing, resource utilization, content management, and efficient processing prevail. Not many of the solutions are there on scheduling about the trade-off between different parameters, the process of content splitting and merging is not explored to a large extent and the skew mitigation solutions are more focused on Reduce side of the MapReduce while the Map side is not utilized much for load balancing.\n","PeriodicalId":36514,"journal":{"name":"Recent Advances in Computer Science and Communications","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Challenges and Solutions in Big Data Platform Hadoop\",\"authors\":\"Balraj Singh, H. Verma, Vishu Madaan\",\"doi\":\"10.2174/2666255816666230608165146\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n\\nThe present era demands continuous support to bring improvements in executing complex analytics on large-scale data and to work beyond traditional systems.\\n\\n\\n\\nThe need for processing diverse data types and solutions for different domains of the industry is rising. Such needs increase the requirement for sophisticated techniques and methods to enhance the existing platforms and mechanisms further. It provides an opportunity for the research community to investigate further into the existing systems, find potential issues, and propose new ways to improve the current systems. Hadoop is a popular choice to manage and process Big data. It is an open-source platform and a front-runner in the batch processing of large-scale jobs. The economy associated with the cluster in scaling is low as compared to other platforms. However, this popularity by no means guarantees high performance in all scenarios. With the continuous evolution in data development and industrial requirements, it is imperative to investigate and look into new methods and techniques to bring advancements to the existing system.\\n\\n\\n\\nA systematic review is represented in this paper to have an insight into the current progress in this field. Research publications from various sources are taken and analyzed. The performance of a cluster largely depends upon the different job processing mechanisms and policies associated with it.\\n\\n\\n\\nWhile extensive studies and solutions are proposed, the performance bottlenecks in terms of load balancing, resource utilization, content management, and efficient processing prevail. Not many of the solutions are there on scheduling about the trade-off between different parameters, the process of content splitting and merging is not explored to a large extent and the skew mitigation solutions are more focused on Reduce side of the MapReduce while the Map side is not utilized much for load balancing.\\n\",\"PeriodicalId\":36514,\"journal\":{\"name\":\"Recent Advances in Computer Science and Communications\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Recent Advances in Computer Science and Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2174/2666255816666230608165146\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recent Advances in Computer Science and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/2666255816666230608165146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

当前的时代需要持续的支持，以提高对大规模数据执行复杂分析的能力，并超越传统系统。针对行业不同领域处理不同数据类型和解决方案的需求正在上升。这种需求增加了对先进技术和方法的需求，以进一步加强现有平台和机制。它为研究团体提供了进一步调查现有系统、发现潜在问题并提出改进当前系统的新方法的机会。Hadoop是管理和处理大数据的流行选择。它是一个开源平台，在大规模作业的批处理方面处于领先地位。与其他平台相比，集群在扩展方面的经济效益较低。但是，这种受欢迎程度并不能保证在所有场景中都具有高性能。随着数据开发和工业需求的不断发展，有必要调查和研究新的方法和技术，为现有系统带来进步。本文对这一领域的最新进展作了系统的综述。研究出版物从各种来源采取和分析。集群的性能在很大程度上取决于与之相关的不同作业处理机制和策略。虽然提出了广泛的研究和解决方案，但在负载平衡、资源利用、内容管理和高效处理方面的性能瓶颈仍然普遍存在。关于不同参数之间权衡的调度解决方案并不多，内容拆分和合并的过程也没有深入探讨，缓解倾斜的解决方案更多地集中在MapReduce的Reduce端，而Map端在负载平衡方面的利用并不多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance Challenges and Solutions in Big Data Platform Hadoop

The present era demands continuous support to bring improvements in executing complex analytics on large-scale data and to work beyond traditional systems. The need for processing diverse data types and solutions for different domains of the industry is rising. Such needs increase the requirement for sophisticated techniques and methods to enhance the existing platforms and mechanisms further. It provides an opportunity for the research community to investigate further into the existing systems, find potential issues, and propose new ways to improve the current systems. Hadoop is a popular choice to manage and process Big data. It is an open-source platform and a front-runner in the batch processing of large-scale jobs. The economy associated with the cluster in scaling is low as compared to other platforms. However, this popularity by no means guarantees high performance in all scenarios. With the continuous evolution in data development and industrial requirements, it is imperative to investigate and look into new methods and techniques to bring advancements to the existing system. A systematic review is represented in this paper to have an insight into the current progress in this field. Research publications from various sources are taken and analyzed. The performance of a cluster largely depends upon the different job processing mechanisms and policies associated with it. While extensive studies and solutions are proposed, the performance bottlenecks in terms of load balancing, resource utilization, content management, and efficient processing prevail. Not many of the solutions are there on scheduling about the trade-off between different parameters, the process of content splitting and merging is not explored to a large extent and the skew mitigation solutions are more focused on Reduce side of the MapReduce while the Map side is not utilized much for load balancing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Recent Advances in Computer Science and Communications Computer Science-Computer Science (all)

CiteScore

2.50

自引率

0.00%

发文量

142