海量数据并行程序的性能瓶颈诊断

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) Pub Date : 2016-05-16 DOI:10.1109/CCGrid.2016.81

Vinícius Dias, R. Moreira, Wagner Meira Jr, D. Guedes

{"title":"海量数据并行程序的性能瓶颈诊断","authors":"Vinícius Dias, R. Moreira, Wagner Meira Jr, D. Guedes","doi":"10.1109/CCGrid.2016.81","DOIUrl":null,"url":null,"abstract":"The increasing amount of data being stored and the variety of applications being proposed recently to make use of those data enabled a whole new generation of parallel programming environments and paradigms. Although most of these novel environments provide abstract programming interfaces and embed several run-time strategies that simplify several typical tasks in parallel and distributed systems, achieving good performance is still a challenge. In this paper we identify some common sources of performance degradation in the Spark programming environment and discuss some diagnosis dimensions that can be used to better understand such degradation. We then describe our experience in the use of those dimensions to drive the identification performance problems, and suggest how their impact may be minimized considering real applications.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Diagnosing Performance Bottlenecks in Massive Data Parallel Programs\",\"authors\":\"Vinícius Dias, R. Moreira, Wagner Meira Jr, D. Guedes\",\"doi\":\"10.1109/CCGrid.2016.81\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The increasing amount of data being stored and the variety of applications being proposed recently to make use of those data enabled a whole new generation of parallel programming environments and paradigms. Although most of these novel environments provide abstract programming interfaces and embed several run-time strategies that simplify several typical tasks in parallel and distributed systems, achieving good performance is still a challenge. In this paper we identify some common sources of performance degradation in the Spark programming environment and discuss some diagnosis dimensions that can be used to better understand such degradation. We then describe our experience in the use of those dimensions to drive the identification performance problems, and suggest how their impact may be minimized considering real applications.\",\"PeriodicalId\":103641,\"journal\":{\"name\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid.2016.81\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

存储的数据量的增加以及最近提出的利用这些数据的各种应用程序使新一代并行编程环境和范式成为可能。尽管这些新环境中的大多数都提供了抽象的编程接口，并嵌入了一些运行时策略，以简化并行和分布式系统中的一些典型任务，但实现良好的性能仍然是一个挑战。在本文中，我们确定了Spark编程环境中性能下降的一些常见来源，并讨论了一些可以用来更好地理解这种下降的诊断维度。然后，我们描述了我们在使用这些维度来驱动识别性能问题方面的经验，并建议如何考虑实际应用程序来最小化它们的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Diagnosing Performance Bottlenecks in Massive Data Parallel Programs

The increasing amount of data being stored and the variety of applications being proposed recently to make use of those data enabled a whole new generation of parallel programming environments and paradigms. Although most of these novel environments provide abstract programming interfaces and embed several run-time strategies that simplify several typical tasks in parallel and distributed systems, achieving good performance is still a challenge. In this paper we identify some common sources of performance degradation in the Spark programming environment and discuss some diagnosis dimensions that can be used to better understand such degradation. We then describe our experience in the use of those dimensions to drive the identification performance problems, and suggest how their impact may be minimized considering real applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

自引率

0.00%

发文量