Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds

2013 35th International Conference on Software Engineering (ICSE) Pub Date : 2013-05-18 DOI:10.1109/ICSE.2013.6606586

Weiyi Shang, Z. Jiang, H. Hemmati, Bram Adams, A. Hassan, Patrick Martin

{"title":"Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds","authors":"Weiyi Shang, Z. Jiang, H. Hemmati, Bram Adams, A. Hassan, Patrick Martin","doi":"10.1109/ICSE.2013.6606586","DOIUrl":null,"url":null,"abstract":"Big data analytics is the process of examining large amounts of data (big data) in an effort to uncover hidden patterns or unknown correlations. Big Data Analytics Applications (BDA Apps) are a new type of software applications, which analyze big data using massive parallel processing frameworks (e.g., Hadoop). Developers of such applications typically develop them using a small sample of data in a pseudo-cloud environment. Afterwards, they deploy the applications in a large-scale cloud environment with considerably more processing power and larger input data (reminiscent of the mainframe days). Working with BDA App developers in industry over the past three years, we noticed that the runtime analysis and debugging of such applications in the deployment phase cannot be easily addressed by traditional monitoring and debugging approaches. In this paper, as a first step in assisting developers of BDA Apps for cloud deployments, we propose a lightweight approach for uncovering differences between pseudo and large-scale cloud deployments. Our approach makes use of the readily-available yet rarely used execution logs from these platforms. Our approach abstracts the execution logs, recovers the execution sequences, and compares the sequences between the pseudo and cloud deployments. Through a case study on three representative Hadoop-based BDA Apps, we show that our approach can rapidly direct the attention of BDA App developers to the major differences between the two deployments. Knowledge of such differences is essential in verifying BDA Apps when analyzing big data in the cloud. Using injected deployment faults, we show that our approach not only significantly reduces the deployment verification effort, but also provides very few false positives when identifying deployment failures.","PeriodicalId":322423,"journal":{"name":"2013 35th International Conference on Software Engineering (ICSE)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"165","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 35th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE.2013.6606586","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 165

Abstract

Big data analytics is the process of examining large amounts of data (big data) in an effort to uncover hidden patterns or unknown correlations. Big Data Analytics Applications (BDA Apps) are a new type of software applications, which analyze big data using massive parallel processing frameworks (e.g., Hadoop). Developers of such applications typically develop them using a small sample of data in a pseudo-cloud environment. Afterwards, they deploy the applications in a large-scale cloud environment with considerably more processing power and larger input data (reminiscent of the mainframe days). Working with BDA App developers in industry over the past three years, we noticed that the runtime analysis and debugging of such applications in the deployment phase cannot be easily addressed by traditional monitoring and debugging approaches. In this paper, as a first step in assisting developers of BDA Apps for cloud deployments, we propose a lightweight approach for uncovering differences between pseudo and large-scale cloud deployments. Our approach makes use of the readily-available yet rarely used execution logs from these platforms. Our approach abstracts the execution logs, recovers the execution sequences, and compares the sequences between the pseudo and cloud deployments. Through a case study on three representative Hadoop-based BDA Apps, we show that our approach can rapidly direct the attention of BDA App developers to the major differences between the two deployments. Knowledge of such differences is essential in verifying BDA Apps when analyzing big data in the cloud. Using injected deployment faults, we show that our approach not only significantly reduces the deployment verification effort, but also provides very few false positives when identifying deployment failures.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

协助大数据分析应用开发人员在Hadoop云上部署

大数据分析是检查大量数据(大数据)以发现隐藏模式或未知相关性的过程。大数据分析应用程序(BDA Apps)是一种新型的软件应用程序，它使用大规模并行处理框架(如Hadoop)来分析大数据。此类应用程序的开发人员通常使用伪云环境中的小样本数据来开发它们。然后，他们将应用程序部署到具有更强处理能力和更大输入数据的大规模云环境中(让人想起大型机时代)。在过去的三年中，我们与行业中的BDA应用程序开发人员合作，注意到在部署阶段对此类应用程序的运行时分析和调试不能通过传统的监控和调试方法轻松解决。在本文中，作为帮助BDA应用程序的云部署开发人员的第一步，我们提出了一种轻量级方法来揭示伪云部署和大规模云部署之间的差异。我们的方法利用了这些平台上容易获得但很少使用的执行日志。我们的方法抽象执行日志，恢复执行序列，并比较伪部署和云部署之间的序列。通过对三个具有代表性的基于hadoop的BDA应用程序的案例研究，我们展示了我们的方法可以迅速将BDA应用程序开发人员的注意力引导到两种部署之间的主要差异上。在分析云中的大数据时，了解这些差异对于验证BDA应用程序至关重要。使用注入的部署错误，我们表明我们的方法不仅显著地减少了部署验证工作，而且在识别部署失败时提供了很少的误报。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 35th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量