Pub Date : 2021-10-01DOI: 10.1016/j.tbench.2021.100001
Daniel D. Nguyen, Karen L. Karavanic
Current trends in HPC, such as the push to exascale, convergence with Big Data, and growing complexity of HPC applications, have created gaps that traditional performance tools do not cover. One example is Holistic HPC Workflows — HPC workflows comprising multiple codes, paradigms, or platforms that are not developed using a workflow management system. To diagnose the performance of these applications, we define a new metric called Workflow Critical Path (WCP), a data-oriented metric for Holistic HPC Workflows. WCP constructs graphs that span across the workflow codes and platforms, using data states as vertices and data mutations as edges. Using cloud-based technologies, we implement a prototype called Crux, a distributed analysis tool for calculating and visualizing WCP. Our experiments with a workflow simulator on Amazon Web Services show Crux is scalable and capable of correctly calculating WCP for common Holistic HPC workflow patterns. We explore the use of WCP and discuss how Crux could be used in a production HPC environment.
当前HPC的趋势,如向百亿亿级的推动、与大数据的融合以及HPC应用程序的日益复杂,已经产生了传统性能工具无法覆盖的差距。一个例子是整体HPC工作流——HPC工作流包含多个代码、范例或平台,这些代码、范例或平台不是使用工作流管理系统开发的。为了诊断这些应用程序的性能,我们定义了一个名为工作流关键路径(WCP)的新指标,这是一个面向数据的整体HPC工作流指标。WCP构建跨越工作流代码和平台的图形,使用数据状态作为顶点,使用数据突变作为边。使用基于云的技术,我们实现了一个名为Crux的原型,这是一个用于计算和可视化WCP的分布式分析工具。我们在Amazon Web Services上的工作流模拟器上的实验表明Crux是可伸缩的,并且能够正确计算常见的整体HPC工作流模式的WCP。我们将探索WCP的使用,并讨论如何在生产HPC环境中使用Crux。
{"title":"Workflow Critical Path: A data-oriented critical path metric for Holistic HPC Workflows","authors":"Daniel D. Nguyen, Karen L. Karavanic","doi":"10.1016/j.tbench.2021.100001","DOIUrl":"https://doi.org/10.1016/j.tbench.2021.100001","url":null,"abstract":"<div><p>Current trends in HPC, such as the push to exascale, convergence with Big Data, and growing complexity of HPC applications, have created gaps that traditional performance tools do not cover. One example is Holistic HPC Workflows — HPC workflows comprising multiple codes, paradigms, or platforms that are not developed using a workflow management system. To diagnose the performance of these applications, we define a new metric called Workflow Critical Path (WCP), a data-oriented metric for Holistic HPC Workflows. WCP constructs graphs that span across the workflow codes and platforms, using data states as vertices and data mutations as edges. Using cloud-based technologies, we implement a prototype called Crux, a distributed analysis tool for calculating and visualizing WCP. Our experiments with a workflow simulator on Amazon Web Services show Crux is scalable and capable of correctly calculating WCP for common Holistic HPC workflow patterns. We explore the use of WCP and discuss how Crux could be used in a production HPC environment.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"1 1","pages":"Article 100001"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485921000016/pdfft?md5=b03db3d209b242d8a2f663d25834fd8c&pid=1-s2.0-S2772485921000016-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92003797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1016/j.tbench.2021.100009
Jiaqiang Liu, Jingwei Sun, Zhongtian Xu, Guangzhong Sun
The huge storage and computation cost of convolutional neural networks (CNN) make them challenging to meet the real-time inference requirement in many applications. Existing channel pruning methods mainly focus on removing unimportant channels in a CNN model based on rule-of-thumb designs, using reduced floating-point operations (FLOPs) and parameter numbers to measure the pruning quality. The inference latency of pruned models is often overlooked. In this paper, we propose a latency-aware automatic CNN channel pruning method (LACP), which aims to search low latency and accurate pruned network structure automatically. We evaluate the inaccuracy of measuring pruning quality by FLOPs and the number of parameters, and use the model inference latency as the direct optimization metric. To bridge model pruning and inference acceleration, we analyze the inference latency of convolutional layers on GPU. Results show that the inference latency of convolutional layers exhibits a staircase pattern along with channel number due to the GPU tail effect. Based on that observation, we greatly shrink the search space of network structures. Then we apply an evolutionary procedure to search a computationally efficient pruned network structure, which reduces the inference latency and maintains the model accuracy. Experiments and comparisons with state-of-the-art methods on three image classification datasets show that our method can achieve better inference acceleration with less accuracy loss.
{"title":"Latency-aware automatic CNN channel pruning with GPU runtime analysis","authors":"Jiaqiang Liu, Jingwei Sun, Zhongtian Xu, Guangzhong Sun","doi":"10.1016/j.tbench.2021.100009","DOIUrl":"10.1016/j.tbench.2021.100009","url":null,"abstract":"<div><p>The huge storage and computation cost of convolutional neural networks (CNN) make them challenging to meet the real-time inference requirement in many applications. Existing channel pruning methods mainly focus on removing unimportant channels in a CNN model based on rule-of-thumb designs, using reduced floating-point operations (FLOPs) and parameter numbers to measure the pruning quality. The inference latency of pruned models is often overlooked. In this paper, we propose a latency-aware automatic CNN channel pruning method (LACP), which aims to search low latency and accurate pruned network structure automatically. We evaluate the inaccuracy of measuring pruning quality by FLOPs and the number of parameters, and use the model inference latency as the direct optimization metric. To bridge model pruning and inference acceleration, we analyze the inference latency of convolutional layers on GPU. Results show that the inference latency of convolutional layers exhibits a staircase pattern along with channel number due to the GPU tail effect. Based on that observation, we greatly shrink the search space of network structures. Then we apply an evolutionary procedure to search a computationally efficient pruned network structure, which reduces the inference latency and maintains the model accuracy. Experiments and comparisons with state-of-the-art methods on three image classification datasets show that our method can achieve better inference acceleration with less accuracy loss.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"1 1","pages":"Article 100009"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485921000090/pdfft?md5=e3e618453811eda67a8549ff2a96e500&pid=1-s2.0-S2772485921000090-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77841061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1016/j.tbench.2021.100006
Duo Zhang, Mai Zheng
Diagnosing storage system failures is challenging even for professionals. One recent example is the “When Solid State Drives Are Not That Solid” incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed for failures caused by a Linux kernel bug. With the system complexity keeps increasing, diagnosing failures will likely become more difficult.
To better understand real-world failures and the potential limitations of state-of-the-art tools, we first conduct an empirical study on 277 user-reported storage failures in this paper. We characterize the issues along multiple dimensions (e.g., time to resolve, kernel components involved), which provides a quantitative measurement of the challenge in practice. Moreover, we analyze a set of the storage issues in depth and derive a benchmark suite called . The benchmark suite includes the necessary workloads and software environments to reproduce 9 storage failures, covers 4 different file systems and the block I/O layer of the storage stack, and enables realistic evaluation of diverse kernel-level tools for debugging.
To demonstrate the usage, we apply to study two representative tools for debugging. We focus on measuring the observations that the tools enable developers to make (i.e., observability), and derive concrete metrics to measure the observability qualitatively and quantitatively. Our measurement demonstrates the different design tradeoffs in terms of debugging information and overhead. More importantly, we observe that both tools may behave abnormally when applied to diagnose a few tricky cases. Also, we find that neither tool can provide low-level information on how the persistent storage states are changed, which is essential for understanding storage failures. To address the limitation, we develop lightweight extensions to enable such functionality in both tools. We hope that and the enabled measurements will inspire follow-up research in benchmarking and tool support and help address the challenge of failure diagnosis in general.
{"title":"Benchmarking for Observability: The Case of Diagnosing Storage Failures","authors":"Duo Zhang, Mai Zheng","doi":"10.1016/j.tbench.2021.100006","DOIUrl":"10.1016/j.tbench.2021.100006","url":null,"abstract":"<div><p>Diagnosing storage system failures is challenging even for professionals. One recent example is the “When Solid State Drives Are Not That Solid” incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed for failures caused by a Linux kernel bug. With the system complexity keeps increasing, diagnosing failures will likely become more difficult.</p><p>To better understand real-world failures and the potential limitations of state-of-the-art tools, we first conduct an empirical study on 277 user-reported storage failures in this paper. We characterize the issues along multiple dimensions (e.g., time to resolve, kernel components involved), which provides a quantitative measurement of the challenge in practice. Moreover, we analyze a set of the storage issues in depth and derive a benchmark suite called <span><math><mrow><mi>B</mi><mi>u</mi><mi>g</mi><mi>B</mi><mi>e</mi><mi>n</mi><mi>c</mi><msup><mrow><mi>h</mi></mrow><mrow><mi>k</mi></mrow></msup></mrow></math></span>. The benchmark suite includes the necessary workloads and software environments to reproduce 9 storage failures, covers 4 different file systems and the block I/O layer of the storage stack, and enables realistic evaluation of diverse kernel-level tools for debugging.</p><p>To demonstrate the usage, we apply <span><math><mrow><mi>B</mi><mi>u</mi><mi>g</mi><mi>B</mi><mi>e</mi><mi>n</mi><mi>c</mi><msup><mrow><mi>h</mi></mrow><mrow><mi>k</mi></mrow></msup></mrow></math></span> to study two representative tools for debugging. We focus on measuring the observations that the tools enable developers to make (i.e., observability), and derive concrete metrics to measure the observability qualitatively and quantitatively. Our measurement demonstrates the different design tradeoffs in terms of debugging information and overhead. More importantly, we observe that both tools may behave abnormally when applied to diagnose a few tricky cases. Also, we find that neither tool can provide low-level information on how the persistent storage states are changed, which is essential for understanding storage failures. To address the limitation, we develop lightweight extensions to enable such functionality in both tools. We hope that <span><math><mrow><mi>B</mi><mi>u</mi><mi>g</mi><mi>B</mi><mi>e</mi><mi>n</mi><mi>c</mi><msup><mrow><mi>h</mi></mrow><mrow><mi>k</mi></mrow></msup></mrow></math></span> and the enabled measurements will inspire follow-up research in benchmarking and tool support and help address the challenge of failure diagnosis in general.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"1 1","pages":"Article 100006"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485921000065/pdfft?md5=ccacbd5e0872d394b75b85db5386c9b4&pid=1-s2.0-S2772485921000065-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75003267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1016/j.tbench.2021.100004
Fan Zhang , Chunjie Luo , Chuanxin Lan , Jianfeng Zhan
With the development of the Electronic Health Record (EHR) technique, vast volumes of digital clinical data are generated. Based on the data, many methods are developed to improve the performance of clinical predictions. Among those methods, Deep Neural Networks (DNN) have been proven outstanding with respect to accuracy by employing many patient instances and events (features). However, each patient-specific event requires time and money. Collecting too many features before making a decision is insufferable, especially for time-critical tasks such as mortality prediction. So it is essential to predict with high accuracy using as minimal clinical events as possible, which makes feature selection a critical question. This paper presents detailed benchmarking results of various feature selection methods, applying different classification and regression algorithms for clinical prediction tasks, including mortality prediction, length of stay prediction, and ICD-9 code group prediction. We use the publicly available dataset, Medical Information Mart for Intensive Care III (MIMIC-III), in our experiments. Our results show that Genetic Algorithm (GA) based methods perform well with only a few features and outperform others. Besides, for the mortality prediction task, the feature subset selected by GA for one classifier can also be used to others while achieving good performance.
{"title":"Benchmarking feature selection methods with different prediction models on large-scale healthcare event data","authors":"Fan Zhang , Chunjie Luo , Chuanxin Lan , Jianfeng Zhan","doi":"10.1016/j.tbench.2021.100004","DOIUrl":"10.1016/j.tbench.2021.100004","url":null,"abstract":"<div><p>With the development of the Electronic Health Record (EHR) technique, vast volumes of digital clinical data are generated. Based on the data, many methods are developed to improve the performance of clinical predictions. Among those methods, Deep Neural Networks (DNN) have been proven outstanding with respect to accuracy by employing many patient instances and events (features). However, each patient-specific event requires time and money. Collecting too many features before making a decision is insufferable, especially for time-critical tasks such as mortality prediction. So it is essential to predict with high accuracy using as minimal clinical events as possible, which makes feature selection a critical question. This paper presents detailed benchmarking results of various feature selection methods, applying different classification and regression algorithms for clinical prediction tasks, including mortality prediction, length of stay prediction, and ICD-9 code group prediction. We use the publicly available dataset, Medical Information Mart for Intensive Care III (MIMIC-III), in our experiments. Our results show that Genetic Algorithm (GA) based methods perform well with only a few features and outperform others. Besides, for the mortality prediction task, the feature subset selected by GA for one classifier can also be used to others while achieving good performance.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"1 1","pages":"Article 100004"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485921000041/pdfft?md5=99fde57d63abff83585ab50c968fd9b0&pid=1-s2.0-S2772485921000041-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81823992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1016/j.tbench.2021.100011
Yectli A. Huerta , David J. Lilja
Software patches are made available to fix security vulnerabilities, enhance performance, and usability. Previous works focused on measuring the performance effect of patches on benchmark runtimes. In this study, we used the Top-Down microarchitecture analysis method to understand how pipeline bottlenecks were affected by the application of the Spectre and Meltdown security patches. Bottleneck analysis makes it possible to better understand how different hardware resources are being utilized, highlighting portions of the pipeline where possible improvements could be achieved. We complement the Top-Down analysis technique with the use a normalization technique from the field of economics, purchasing power parity (PPP), to better understand the relative difference between patched and unpatched runs. In this study, we showed that security patches had an effect that was reflected on the corresponding Top-Down metrics. We showed that recent compilers are not as negatively affected as previously reported. Out of the 14 benchmarks that make up the SPEC OMP2012 suite, three had noticeable slowdowns when the patches were applied. We also found that Top-Down metrics had large relative differences when the security patches were applied, differences that standard techniques based in absolute, non-normalized, metrics failed to highlight.
{"title":"Revisiting the effects of the Spectre and Meltdown patches using the top-down microarchitectural method and purchasing power parity theory","authors":"Yectli A. Huerta , David J. Lilja","doi":"10.1016/j.tbench.2021.100011","DOIUrl":"10.1016/j.tbench.2021.100011","url":null,"abstract":"<div><p>Software patches are made available to fix security vulnerabilities, enhance performance, and usability. Previous works focused on measuring the performance effect of patches on benchmark runtimes. In this study, we used the Top-Down microarchitecture analysis method to understand how pipeline bottlenecks were affected by the application of the Spectre and Meltdown security patches. Bottleneck analysis makes it possible to better understand how different hardware resources are being utilized, highlighting portions of the pipeline where possible improvements could be achieved. We complement the Top-Down analysis technique with the use a normalization technique from the field of economics, purchasing power parity (PPP), to better understand the relative difference between patched and unpatched runs. In this study, we showed that security patches had an effect that was reflected on the corresponding Top-Down metrics. We showed that recent compilers are not as negatively affected as previously reported. Out of the 14 benchmarks that make up the SPEC OMP2012 suite, three had noticeable slowdowns when the patches were applied. We also found that Top-Down metrics had large relative differences when the security patches were applied, differences that standard techniques based in absolute, non-normalized, metrics failed to highlight.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"1 1","pages":"Article 100011"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485921000119/pdfft?md5=6e1139ebac69c084c6fe482b82fdd42c&pid=1-s2.0-S2772485921000119-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91428938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}