Danielle Lambion, Robert Schmitz, R. Cordingly, Navid Heydari, W. Lloyd
In this paper, we leverage a Natural Language Processing (NLP) pipeline for topic modeling consisting of three functions for data preprocessing, model training, and inferencing to analyze serverless platform performance variation. Specifically, we investigated performance using x86_64 and ARM64 processors over a 24-hour day starting at midnight local time on four cloud regions across three continents on AWS Lambda. We identified public cloud resource contention by leveraging the CPU steal metric, and examined relationships to NLP pipeline runtime. Intel x86_64 Xeon processors at the same clock rate as ARM64 processors (Graviton 2) were more than 23% faster for model training, but ARM64 processors were faster for data preprocessing and inferencing. Use of the Intel x86_64 architecture for the NLP pipeline was up to 33.4% more expensive than ARM64 as a result of incentivized pricing from the cloud provider and slower pipeline runtime due to greater resource contention for Intel processors.
{"title":"Characterizing X86 and ARM Serverless Performance Variation: A Natural Language Processing Case Study","authors":"Danielle Lambion, Robert Schmitz, R. Cordingly, Navid Heydari, W. Lloyd","doi":"10.1145/3491204.3543506","DOIUrl":"https://doi.org/10.1145/3491204.3543506","url":null,"abstract":"In this paper, we leverage a Natural Language Processing (NLP) pipeline for topic modeling consisting of three functions for data preprocessing, model training, and inferencing to analyze serverless platform performance variation. Specifically, we investigated performance using x86_64 and ARM64 processors over a 24-hour day starting at midnight local time on four cloud regions across three continents on AWS Lambda. We identified public cloud resource contention by leveraging the CPU steal metric, and examined relationships to NLP pipeline runtime. Intel x86_64 Xeon processors at the same clock rate as ARM64 processors (Graviton 2) were more than 23% faster for model training, but ARM64 processors were faster for data preprocessing and inferencing. Use of the Intel x86_64 architecture for the NLP pipeline was up to 33.4% more expensive than ARM64 as a result of incentivized pricing from the cloud provider and slower pipeline runtime due to greater resource contention for Intel processors.","PeriodicalId":129216,"journal":{"name":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","volume":"271 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134011908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mark Leznik, Md Shahriar Iqbal, Igor A. Trubin, Arne Lochner, Pooyan Jamshidi, A. Bauer
Commits to the MongoDB software repository trigger a collection of automatically run tests. Here, the identification of commits responsible for performance regressions is paramount. Previously, the process relied on manual inspection of time series graphs to identify significant changes, later replaced with a threshold-based detection system. However, neither system was sufficient for finding changes in performance in a timely manner. This work describes our recent implementation of a change point detection system built upon time series features, a voting system, the Perfomalist approach, and XGBoost. The algorithm produces a list of change points representing significant changes from a given history of performance results. We are able to automatically detect change points and achieve an 83% accuracy, all while reducing the human effort in the process.
{"title":"Change Point Detection for MongoDB Time Series Performance Regression","authors":"Mark Leznik, Md Shahriar Iqbal, Igor A. Trubin, Arne Lochner, Pooyan Jamshidi, A. Bauer","doi":"10.1145/3491204.3527488","DOIUrl":"https://doi.org/10.1145/3491204.3527488","url":null,"abstract":"Commits to the MongoDB software repository trigger a collection of automatically run tests. Here, the identification of commits responsible for performance regressions is paramount. Previously, the process relied on manual inspection of time series graphs to identify significant changes, later replaced with a threshold-based detection system. However, neither system was sufficient for finding changes in performance in a timely manner. This work describes our recent implementation of a change point detection system built upon time series features, a voting system, the Perfomalist approach, and XGBoost. The algorithm produces a list of change points representing significant changes from a given history of performance results. We are able to automatically detect change points and achieve an 83% accuracy, all while reducing the human effort in the process.","PeriodicalId":129216,"journal":{"name":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115510747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The large compute load and memory footprint of modern deep neural networks motivates the use of accelerators for high through- put deployments in application spanning multiple domains. In this paper, we evaluate throughput capabilities of a comparatively new hardware from Graphcore, IPU-M2000 that supports massive par- allelism and in-memory compute. For a text detection model, we measured the throughput and power variations with batch size. We also evaluate compressed versions of this model and analyze perfor- mance variation with model precision. Additionally, we compare IPU (Intelligence Processing Unit) results with state-of-the-art GPU and FPGA deployments of a compute intensive text region detec- tion application. Our experiments suggest, IPU supports superior throughput, 27×, 1.89×, and 1.56× as compared to CPU, FPGA DPU and A100 GPU, respectively for text detection application.
{"title":"Performance Evaluation of GraphCore IPU-M2000 Accelerator for Text Detection Application","authors":"Nupur Sumeet, Karan Rawat, M. Nambiar","doi":"10.1145/3491204.3527469","DOIUrl":"https://doi.org/10.1145/3491204.3527469","url":null,"abstract":"The large compute load and memory footprint of modern deep neural networks motivates the use of accelerators for high through- put deployments in application spanning multiple domains. In this paper, we evaluate throughput capabilities of a comparatively new hardware from Graphcore, IPU-M2000 that supports massive par- allelism and in-memory compute. For a text detection model, we measured the throughput and power variations with batch size. We also evaluate compressed versions of this model and analyze perfor- mance variation with model precision. Additionally, we compare IPU (Intelligence Processing Unit) results with state-of-the-art GPU and FPGA deployments of a compute intensive text region detec- tion application. Our experiments suggest, IPU supports superior throughput, 27×, 1.89×, and 1.56× as compared to CPU, FPGA DPU and A100 GPU, respectively for text detection application.","PeriodicalId":129216,"journal":{"name":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116311265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The High-Level Synthesis (HLS) tools aid in simplified and faster design development without familiarity with Hardware Description Language (HDL) and Register Transfer Logic (RTL) design flow that can be implemented on an FPGA (Field Programmable Gate Array). However, it is not straight forward to trace and link source code to synthesized hardware design. On the other hand, the traditional RTL-based design development flow provides the fine-grained performance profile through waveforms. With the same level of visibility in HLS designs, the designers can identify the performance-bottlenecks and obtain the target performance by iteratively fine-tuning the source code. Although, the HLS development tools provide the low-level waveforms, interpreting them in terms of source code variables is a challenging and tedious task. Addressing this gap, we propose to demonstrate an automated profiler tool, HLS_Profiler, that provides a performance profile of source code in a cycle-accurate manner.
{"title":"HLS_Profiler: Non-Intrusive Profiling tool for HLS based Applications","authors":"Nupur Sumeet, D. Deeksha, M. Nambiar","doi":"10.1145/3491204.3527496","DOIUrl":"https://doi.org/10.1145/3491204.3527496","url":null,"abstract":"The High-Level Synthesis (HLS) tools aid in simplified and faster design development without familiarity with Hardware Description Language (HDL) and Register Transfer Logic (RTL) design flow that can be implemented on an FPGA (Field Programmable Gate Array). However, it is not straight forward to trace and link source code to synthesized hardware design. On the other hand, the traditional RTL-based design development flow provides the fine-grained performance profile through waveforms. With the same level of visibility in HLS designs, the designers can identify the performance-bottlenecks and obtain the target performance by iteratively fine-tuning the source code. Although, the HLS development tools provide the low-level waveforms, interpreting them in terms of source code variables is a challenging and tedious task. Addressing this gap, we propose to demonstrate an automated profiler tool, HLS_Profiler, that provides a performance profile of source code in a cycle-accurate manner.","PeriodicalId":129216,"journal":{"name":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122848528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thrivikraman V, Vishnu R. Dixit, Nikhil Ram S, Vikas K. Gowda, Santhosh Kumar Vasudevan, Subramaniam Kalambur
With the evolution of microservice applications, the underlying architectures have become increasingly complex compared to their monolith counterparts. This mainly brings in the challenge of observability. By providing a deeper understanding into the functioning of distributed applications, observability enables improving the performance of the system by obtaining a view of the bottlenecks in the implementation. The observability provided by currently existing tools that perform dynamic tracing on distributed applications is limited to the user-space and requires the application to be instrumented to track request flows. In this paper, we present a new open-source framework MiSeRTrace that can trace the end-to-end path of requests entering a microservice application at the kernel space without requiring instrumentation or modification of the application. Observability at the comprehensiveness of the kernel space allows breaking down of various steps in activities such as network transfers and IO tasks, thus enabling root cause based performance analysis and accurate identification of hotspots. MiSeRTrace supports tracing user-enabled kernel events provided by frameworks such as bpftrace or ftrace and isolates kernel activity associated with each application request with minimal overheads. We then demonstrate the working of the solution with results on a benchmark microservice application.
{"title":"MiSeRTrace: Kernel-level Request Tracing for Microservice Visibility","authors":"Thrivikraman V, Vishnu R. Dixit, Nikhil Ram S, Vikas K. Gowda, Santhosh Kumar Vasudevan, Subramaniam Kalambur","doi":"10.1145/3491204.3527462","DOIUrl":"https://doi.org/10.1145/3491204.3527462","url":null,"abstract":"With the evolution of microservice applications, the underlying architectures have become increasingly complex compared to their monolith counterparts. This mainly brings in the challenge of observability. By providing a deeper understanding into the functioning of distributed applications, observability enables improving the performance of the system by obtaining a view of the bottlenecks in the implementation. The observability provided by currently existing tools that perform dynamic tracing on distributed applications is limited to the user-space and requires the application to be instrumented to track request flows. In this paper, we present a new open-source framework MiSeRTrace that can trace the end-to-end path of requests entering a microservice application at the kernel space without requiring instrumentation or modification of the application. Observability at the comprehensiveness of the kernel space allows breaking down of various steps in activities such as network transfers and IO tasks, thus enabling root cause based performance analysis and accurate identification of hotspots. MiSeRTrace supports tracing user-enabled kernel events provided by frameworks such as bpftrace or ftrace and isolates kernel activity associated with each application request with minimal overheads. We then demonstrate the working of the solution with results on a benchmark microservice application.","PeriodicalId":129216,"journal":{"name":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133267727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Floriment Klinaku, Martina Rapp, Jörg Henß, Stephan Rhode
Data-intensive container-based cloud applications have become popular with the increased use cases in the Internet of Things domain. Challenges arise when engineering such applications to meet quality requirements, both classical ones like performance and emerging ones like resilience. There is a lack of reference use cases, applications, and experiences when prototyping such applications that could benefit the research community. Moreover, it is hard to generate realistic and reliable workloads that exercise the resources according to a specification. Hence, designing reference applications that would exhibit similar performance behavior in such environments is hard. In this paper, we present a work in progress towards a reference use case and application for data-intensive containerized cloud applications having an industrial motivation. Moreover, to generate reliable CPU workloads we make use of ProtoCom, a well-known library for the generation of resource demands, and report the performance under various quality requirements in a Kubernetes cluster of moderate size. Finally, we present the scalability of the current solution assuming a particular autoscaling policy. Results of the calibration show high variability of the ProtoCom library when executed in a cloud environment. We observe a moderate association between the occupancy of node and the relative variability of execution time.
{"title":"Beauty and the Beast: A Case Study on Performance Prototyping of Data-Intensive Containerized Cloud Applications","authors":"Floriment Klinaku, Martina Rapp, Jörg Henß, Stephan Rhode","doi":"10.1145/3491204.3527482","DOIUrl":"https://doi.org/10.1145/3491204.3527482","url":null,"abstract":"Data-intensive container-based cloud applications have become popular with the increased use cases in the Internet of Things domain. Challenges arise when engineering such applications to meet quality requirements, both classical ones like performance and emerging ones like resilience. There is a lack of reference use cases, applications, and experiences when prototyping such applications that could benefit the research community. Moreover, it is hard to generate realistic and reliable workloads that exercise the resources according to a specification. Hence, designing reference applications that would exhibit similar performance behavior in such environments is hard. In this paper, we present a work in progress towards a reference use case and application for data-intensive containerized cloud applications having an industrial motivation. Moreover, to generate reliable CPU workloads we make use of ProtoCom, a well-known library for the generation of resource demands, and report the performance under various quality requirements in a Kubernetes cluster of moderate size. Finally, we present the scalability of the current solution assuming a particular autoscaling policy. Results of the calibration show high variability of the ProtoCom library when executed in a cloud environment. We observe a moderate association between the occupancy of node and the relative variability of execution time.","PeriodicalId":129216,"journal":{"name":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","volume":" 17","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120933934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","authors":"","doi":"10.1145/3491204","DOIUrl":"https://doi.org/10.1145/3491204","url":null,"abstract":"","PeriodicalId":129216,"journal":{"name":"Companion of the 2022 ACM/SPEC International Conference on Performance Engineering","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130115896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}