Proceedings of the 48th International Conference on Parallel Processing最新文献_第4页

Improving Short Job Latency Performance in Hybrid Job Schedulers with Dice 使用Dice改进混合作业调度器的短作业延迟性能

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337851

Wei Zhou, K. White, Hongfeng Yu

It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the "head-of-line" blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9%, 54.5%, and 43.5% improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2%, 74.1%, and 85.3% improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.

在企业数据中心中，经常会发现长批作业和对延迟敏感的短作业的混合。最近，混合作业调度器作为传统集中式作业调度器的有吸引力的替代品而出现。本文通过跟踪驱动实验研究了两种具有代表性的混合作业调度器(Hawk和Eagle)的作业完成-延迟性能，发现由于工作负载的突发性波动，短作业仍然会遇到长延迟问题。为此，我们提出了混合作业调度器的通用性能优化框架Dice，以缓解短作业的高作业完成延迟问题。Dice由两种简单而有效的技术组成:弹性大小和机会主义抢占。弹性分级和机会抢占都跟踪短作业的任务等待时间。当短作业的平均任务等待时间较长时，弹性分级可以动态自适应地增加短作业的分区大小，使短作业优先于长作业。另一方面，机会主义抢占从一般分区中运行的长任务中按需抢占资源，以缓解短作业的“排队”阻塞问题。我们用Dice增强了这两个调度器，并在我们的原型实现中评估了Dice的性能改进。实验结果表明，在Google跟踪下，Dice在Hawk中短作业的第50百分位、第75百分位和第90百分位作业完成延迟分别提高了50.9%、54.5%和43.5%，在Eagle中分别提高了33.2%、74.1%和85.3%，且对长作业的性能成本较低。

{"title":"Improving Short Job Latency Performance in Hybrid Job Schedulers with Dice","authors":"Wei Zhou, K. White, Hongfeng Yu","doi":"10.1145/3337821.3337851","DOIUrl":"https://doi.org/10.1145/3337821.3337851","url":null,"abstract":"It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the \"head-of-line\" blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9%, 54.5%, and 43.5% improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2%, 74.1%, and 85.3% improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131757672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

MAC: Memory Access Coalescer for 3D-Stacked Memory MAC:用于3d堆叠内存的内存访问聚合器

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337867

Xi Wang, Antonino Tumeo, John D. Leidel, Jie Li, Yong Chen

Emerging data-intensive applications, such as graph analytics and data mining, exhibit irregular memory access patterns. Research has shown that with these memory-bound applications, traditional cache-based processor architectures, which exploit locality and regular patterns to mitigate the memory-wall issue, are inefficient. Meantime, novel 3D-stacked memory devices, such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), promise significant increases in bandwidth that appear extremely appealing for memory-bound applications. However, conventional memory interfaces designed for cache-based architectures and JEDEC DDR devices fit poorly with the 3D-stacked memory, which leads to significant under-utilization of the promised high bandwidth. As a response to these issues, in this paper we propose MAC (Memory Access Coalescer), a coalescing unit for the 3D-stacked memory. We discuss the design and implementation of MAC, in the context of a custom designed cache-less architecture targeted at data-intensive, irregular applications. Through a custom simulation infrastructure based on the RISC-V toolchain, we show that MAC achieves a coalescing efficiency of 52.85% on average. It improves the performance of the memory system by 60.73% on average for a large set of irregular workloads.

新兴的数据密集型应用程序，如图形分析和数据挖掘，呈现出不规则的内存访问模式。研究表明，对于这些内存绑定的应用程序，传统的基于缓存的处理器架构(利用局部性和规则模式来缓解内存墙问题)是低效的。与此同时，新型3d堆叠存储设备，如混合内存立方体(HMC)和高带宽内存(HBM)，有望显著提高带宽，这对于内存受限的应用程序来说非常有吸引力。然而，为基于缓存的架构和JEDEC DDR设备设计的传统内存接口不适合3d堆叠内存，这导致承诺的高带宽利用率严重不足。为了解决这些问题，本文提出了一种用于3d堆叠存储器的聚结单元MAC (Memory Access Coalescer)。我们讨论了MAC的设计和实现，在针对数据密集型、不规则应用程序的定制设计的无缓存架构的背景下。通过基于RISC-V工具链的自定义仿真基础设施，我们表明MAC实现了平均52.85%的聚并效率。对于大量不规律的工作负载，内存系统的性能平均提高60.73%。

{"title":"MAC: Memory Access Coalescer for 3D-Stacked Memory","authors":"Xi Wang, Antonino Tumeo, John D. Leidel, Jie Li, Yong Chen","doi":"10.1145/3337821.3337867","DOIUrl":"https://doi.org/10.1145/3337821.3337867","url":null,"abstract":"Emerging data-intensive applications, such as graph analytics and data mining, exhibit irregular memory access patterns. Research has shown that with these memory-bound applications, traditional cache-based processor architectures, which exploit locality and regular patterns to mitigate the memory-wall issue, are inefficient. Meantime, novel 3D-stacked memory devices, such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), promise significant increases in bandwidth that appear extremely appealing for memory-bound applications. However, conventional memory interfaces designed for cache-based architectures and JEDEC DDR devices fit poorly with the 3D-stacked memory, which leads to significant under-utilization of the promised high bandwidth. As a response to these issues, in this paper we propose MAC (Memory Access Coalescer), a coalescing unit for the 3D-stacked memory. We discuss the design and implementation of MAC, in the context of a custom designed cache-less architecture targeted at data-intensive, irregular applications. Through a custom simulation infrastructure based on the RISC-V toolchain, we show that MAC achieves a coalescing efficiency of 52.85% on average. It improves the performance of the memory system by 60.73% on average for a large set of irregular workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129643745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Adaptive Learning for Concept Drift in Application Performance Modeling 应用程序性能建模中概念漂移的自适应学习

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337922

Sandeep Madireddy, Prasanna Balaprakash, P. Carns, R. Latham, Glenn K. Lockwood, R. Ross, S. Snyder, Stefan M. Wild

Supervised learning is a promising approach for modeling the performance of applications running on large HPC systems. A key assumption in supervised learning is that the training and testing data are obtained under the same conditions. However, in production HPC systems these conditions might not hold because the conditions of the platform can change over time as a result of hardware degradation, hardware replacement, software upgrade, and configuration updates. These changes could alter the data distribution in a way that affects the accuracy of the predictive performance models and render them less useful; this phenomenon is referred to as concept drift. Ignoring concept drift can lead to suboptimal resource usage and decreased efficiency when those performance models are deployed for tuning and job scheduling in production systems. To address this issue, we propose a concept-drift-aware predictive modeling approach that comprises two components: (1) an online Bayesian changepoint detection method that can automatically identify the location of events that lead to concept drift in near-real time and (2) a moment-matching transformation inspired by transfer learning that converts the training data collected before the drift to be useful for retraining. We use application input/output performance data collected on Cori, a production supercomputing system at the National Energy Research Scientific Computing Center, to demonstrate the effectiveness of our approach. The results show that concept-drift-aware models obtain significant improvement in accuracy; the median absolute error of the best-performing Gaussian process regression improved by 58.8% when the proposed approaches were used.

监督学习是对运行在大型高性能计算系统上的应用程序进行性能建模的一种很有前途的方法。监督学习的一个关键假设是训练和测试数据是在相同的条件下获得的。然而，在生产HPC系统中，这些条件可能不成立，因为平台的条件可能随着时间的推移而变化，如硬件降级、硬件更换、软件升级和配置更新。这些变化可能会改变数据分布，从而影响预测性能模型的准确性，使其变得不那么有用;这种现象被称为概念漂移。当在生产系统中部署这些性能模型用于调优和作业调度时，忽略概念漂移可能导致资源使用不理想，并降低效率。为了解决这个问题，我们提出了一种概念漂移感知的预测建模方法，该方法由两个部分组成:(1)在线贝叶斯变化点检测方法，该方法可以近实时地自动识别导致概念漂移的事件的位置;(2)由迁移学习启发的时刻匹配转换，该转换将漂移前收集的训练数据转换为对再训练有用的数据。我们使用在Cori(国家能源研究科学计算中心的生产超级计算系统)上收集的应用程序输入/输出性能数据来证明我们方法的有效性。结果表明，概念漂移感知模型的精度得到了显著提高;采用上述方法后，表现最佳的高斯过程回归的中位数绝对误差提高了58.8%。

{"title":"Adaptive Learning for Concept Drift in Application Performance Modeling","authors":"Sandeep Madireddy, Prasanna Balaprakash, P. Carns, R. Latham, Glenn K. Lockwood, R. Ross, S. Snyder, Stefan M. Wild","doi":"10.1145/3337821.3337922","DOIUrl":"https://doi.org/10.1145/3337821.3337922","url":null,"abstract":"Supervised learning is a promising approach for modeling the performance of applications running on large HPC systems. A key assumption in supervised learning is that the training and testing data are obtained under the same conditions. However, in production HPC systems these conditions might not hold because the conditions of the platform can change over time as a result of hardware degradation, hardware replacement, software upgrade, and configuration updates. These changes could alter the data distribution in a way that affects the accuracy of the predictive performance models and render them less useful; this phenomenon is referred to as concept drift. Ignoring concept drift can lead to suboptimal resource usage and decreased efficiency when those performance models are deployed for tuning and job scheduling in production systems. To address this issue, we propose a concept-drift-aware predictive modeling approach that comprises two components: (1) an online Bayesian changepoint detection method that can automatically identify the location of events that lead to concept drift in near-real time and (2) a moment-matching transformation inspired by transfer learning that converts the training data collected before the drift to be useful for retraining. We use application input/output performance data collected on Cori, a production supercomputing system at the National Energy Research Scientific Computing Center, to demonstrate the effectiveness of our approach. The results show that concept-drift-aware models obtain significant improvement in accuracy; the median absolute error of the best-performing Gaussian process regression improved by 58.8% when the proposed approaches were used.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131156833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Compiler-Assisted GPU Thread Throttling for Reduced Cache Contention 编译器辅助GPU线程节流减少缓存争用

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337886

Hyunjun Kim, Sungin Hong, Hyeonsu Lee, Euiseong Seo, Hwansoo Han

Modern GPUs concurrently deploy thousands of threads to maximize thread level parallelism (TLP) for performance. For some applications, however, maximized TLP leads to significant performance degradation, as many concurrent threads compete for the limited amount of the data cache. In this paper, we propose a compiler-assisted thread throttling scheme, which limits the number of active thread groups to reduce cache contention and consequently improve the performance. A few dynamic thread throttling schemes have been proposed to alleviate cache contention by monitoring the cache behavior, but they often fail to provide timely responses to the dynamic changes in the cache behavior, as they adjust the parallelism afterwards in response to the monitored behavior. Our thread throttling scheme relies on compile-time adjustment of active thread groups to fit their memory footprints to the L1D capacity. We evaluated the proposed scheme with GPU programs that suffer from cache contention. Our approach improved the performance of original programs by 42.96% on average, and this is 8.97% performance boost in comparison to the static thread throttling schemes.

现代gpu同时部署数千个线程，以最大限度地提高线程级并行性(TLP)的性能。然而，对于某些应用程序，最大化的TLP会导致显著的性能下降，因为许多并发线程会争夺有限的数据缓存。在本文中，我们提出了一个编译器辅助的线程节流方案，该方案限制活动线程组的数量，以减少缓存争用，从而提高性能。已经提出了一些动态线程节流方案，通过监视缓存行为来缓解缓存争用，但是它们通常不能及时响应缓存行为的动态变化，因为它们在响应被监视的行为之后调整并行性。我们的线程调节方案依赖于对活动线程组的编译时调整，以使它们的内存占用符合L1D容量。我们用遭受缓存争用的GPU程序来评估所提出的方案。我们的方法平均将原始程序的性能提高了42.96%，与静态线程节流方案相比，性能提高了8.97%。

{"title":"Compiler-Assisted GPU Thread Throttling for Reduced Cache Contention","authors":"Hyunjun Kim, Sungin Hong, Hyeonsu Lee, Euiseong Seo, Hwansoo Han","doi":"10.1145/3337821.3337886","DOIUrl":"https://doi.org/10.1145/3337821.3337886","url":null,"abstract":"Modern GPUs concurrently deploy thousands of threads to maximize thread level parallelism (TLP) for performance. For some applications, however, maximized TLP leads to significant performance degradation, as many concurrent threads compete for the limited amount of the data cache. In this paper, we propose a compiler-assisted thread throttling scheme, which limits the number of active thread groups to reduce cache contention and consequently improve the performance. A few dynamic thread throttling schemes have been proposed to alleviate cache contention by monitoring the cache behavior, but they often fail to provide timely responses to the dynamic changes in the cache behavior, as they adjust the parallelism afterwards in response to the monitored behavior. Our thread throttling scheme relies on compile-time adjustment of active thread groups to fit their memory footprints to the L1D capacity. We evaluated the proposed scheme with GPU programs that suffer from cache contention. Our approach improved the performance of original programs by 42.96% on average, and this is 8.97% performance boost in comparison to the static thread throttling schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134158668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

ECoST: Energy-Efficient Co-Locating and Self-Tuning MapReduce Applications 成本:节能的协同定位和自调优MapReduce应用程序

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337834

Maria Malik, Hassan Ghasemzadeh, T. Mohsenin, Rosario Cammarota, Liang Zhao, Avesta Sasan, H. Homayoun, S. Rafatirad

Datacenters provide high performance and flexibility for users and cost efficiency for operators. Hyperscale datacenters are harnessing massively scalable computer resources for large-scale data analysis. However, cloud/datacenter infrastructure does not scale as fast as the input data volume and computational requirements of big data and analytics technologies. Thus, more applications need to share CPU at the node level that could have large impact on performance and operational cost. To address this challenge, in this paper we show that, concurrently fine-tune parameters at the application, microarchitecture, and system levels are creating opportunities to co-locate applications at the node level and improve energy-efficiency of the server while maintaining performance. Co-locating and self-tuning of unknown applications are challenging problems, especially when co-locating multiple big data applications concurrently with many tuning knobs, potentially requiring exhaustive brute-force search to find the right settings. This research challenge upsurges an imminent need to develop a technique that co-locates applications at a node level and predict the optimal system, architecture and application level configure parameters to achieve the maximum energy efficiency. It promotes the scale-down of computational nodes by presenting the Energy-Efficient Co-Locating and Self-Tuning (ECoST) technique for data intensive applications. ECoST proof of concept was successfully tested on MapReduce platform. ECoST can also be deployed on other data-intensive frameworks where there are several parameters for power and performance tuning optimizations. ECoST collects run-time hardware performance counter data and implements various machine learning models from as simple as a lookup table or decision tree based to as complex as neural network based to predict the energy-efficiency of co-located applications. Experimental data show energy efficiency is achieved within 4% of the upper bound results when co-locating multiple applications at a node level. ECoST is also scalable, being within 8% of upper bound on an 8-node server.

数据中心为用户提供高性能和灵活性，为运营商提供成本效益。超大规模数据中心利用大规模可扩展的计算机资源进行大规模数据分析。然而，云/数据中心基础设施的扩展速度赶不上大数据和分析技术的输入数据量和计算需求。因此，更多的应用程序需要在节点级别共享CPU，这可能会对性能和操作成本产生很大影响。为了应对这一挑战，在本文中，我们展示了在应用程序、微体系结构和系统级别同时微调参数，从而创造了在节点级别共同定位应用程序的机会，并在保持性能的同时提高了服务器的能源效率。未知应用程序的共定位和自调优是一个具有挑战性的问题，特别是当多个大数据应用程序同时具有许多调优旋钮时，可能需要彻底的强力搜索来找到正确的设置。这项研究挑战迫切需要开发一种技术，在节点级别共同定位应用程序，并预测最佳系统，架构和应用级别配置参数，以实现最大的能源效率。它通过为数据密集型应用提供节能的协同定位和自调优(ECoST)技术，促进了计算节点的缩减。ECoST概念验证在MapReduce平台上成功测试。ECoST也可以部署在其他数据密集型框架上，其中有几个参数用于功率和性能调优优化。ECoST收集运行时硬件性能计数器数据，并实现各种机器学习模型，从简单的查找表或基于决策树的模型到复杂的基于神经网络的模型，以预测共存应用程序的能源效率。实验数据表明，当多个应用程序在节点级别上共存时，能量效率在上限结果的4%以内实现。ECoST也是可扩展的，在8节点服务器上的可扩展性在上限的8%以内。

{"title":"ECoST: Energy-Efficient Co-Locating and Self-Tuning MapReduce Applications","authors":"Maria Malik, Hassan Ghasemzadeh, T. Mohsenin, Rosario Cammarota, Liang Zhao, Avesta Sasan, H. Homayoun, S. Rafatirad","doi":"10.1145/3337821.3337834","DOIUrl":"https://doi.org/10.1145/3337821.3337834","url":null,"abstract":"Datacenters provide high performance and flexibility for users and cost efficiency for operators. Hyperscale datacenters are harnessing massively scalable computer resources for large-scale data analysis. However, cloud/datacenter infrastructure does not scale as fast as the input data volume and computational requirements of big data and analytics technologies. Thus, more applications need to share CPU at the node level that could have large impact on performance and operational cost. To address this challenge, in this paper we show that, concurrently fine-tune parameters at the application, microarchitecture, and system levels are creating opportunities to co-locate applications at the node level and improve energy-efficiency of the server while maintaining performance. Co-locating and self-tuning of unknown applications are challenging problems, especially when co-locating multiple big data applications concurrently with many tuning knobs, potentially requiring exhaustive brute-force search to find the right settings. This research challenge upsurges an imminent need to develop a technique that co-locates applications at a node level and predict the optimal system, architecture and application level configure parameters to achieve the maximum energy efficiency. It promotes the scale-down of computational nodes by presenting the Energy-Efficient Co-Locating and Self-Tuning (ECoST) technique for data intensive applications. ECoST proof of concept was successfully tested on MapReduce platform. ECoST can also be deployed on other data-intensive frameworks where there are several parameters for power and performance tuning optimizations. ECoST collects run-time hardware performance counter data and implements various machine learning models from as simple as a lookup table or decision tree based to as complex as neural network based to predict the energy-efficiency of co-located applications. Experimental data show energy efficiency is achieved within 4% of the upper bound results when co-locating multiple applications at a node level. ECoST is also scalable, being within 8% of upper bound on an 8-node server.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132784496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Transfer Learning based Failure Prediction for Minority Disks in Large Data Centers of Heterogeneous Disk Systems 基于迁移学习的大型数据中心异构磁盘系统少数派磁盘故障预测

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337881

Ji Zhang, Ke Zhou, Ping Huang, Xubin He, Zhili Xiao, Bin Cheng, Yongguang Ji, Yinhu Wang

The storage system in large scale data centers is typically built upon thousands or even millions of disks, where disk failures constantly happen. A disk failure could lead to serious data loss and thus system unavailability or even catastrophic consequences if the lost data cannot be recovered. While replication and erasure coding techniques have been widely deployed to guarantee storage availability and reliability, disk failure prediction is gaining popularity as it has the potential to prevent disk failures from occurring in the first place. Recent trends have turned toward applying machine learning approaches based on disk SMART attributes for disk failure predictions. However, traditional machine learning (ML) approaches require a large set of training data in order to deliver good predictive performance. In large-scale storage systems, new disks enter gradually to augment the storage capacity or to replace failed disks, leading storage systems to consist of small amounts of new disks from different vendors and/or different models from the same vendor as time goes on. We refer to this relatively small amount of disks as minority disks. Due to the lack of sufficient training data, traditional ML approaches fail to deliver satisfactory predictive performance in evolving storage systems which consist of heterogeneous minority disks. To address this challenge and improve the predictive performance for minority disks in large data centers, we propose a minority disk failure prediction model named TLDFP based on a transfer learning approach. Our evaluation results on two realistic datasets have demonstrated that TLDFP can deliver much more precise results, compared to four popular prediction models based on traditional ML algorithms and two state-of-the-art transfer learning methods.

大型数据中心的存储系统通常建立在数千甚至数百万个磁盘上，磁盘故障经常发生。磁盘故障可能导致严重的数据丢失，从而导致系统不可用，如果丢失的数据无法恢复，甚至可能导致灾难性的后果。虽然复制和擦除编码技术已被广泛部署以保证存储可用性和可靠性，但磁盘故障预测也越来越受欢迎，因为它有可能从一开始就防止磁盘故障的发生。最近的趋势转向应用基于磁盘SMART属性的机器学习方法进行磁盘故障预测。然而，传统的机器学习(ML)方法需要大量的训练数据才能提供良好的预测性能。在大型存储系统中，由于扩容或更换故障硬盘的需要，新硬盘逐渐进入存储系统，导致随着时间的推移，存储系统中会出现少量不同厂商、不同型号的新硬盘。我们将这种数量相对较少的磁盘称为少数磁盘。由于缺乏足够的训练数据，传统的机器学习方法无法在由异构少数磁盘组成的不断发展的存储系统中提供令人满意的预测性能。为了解决这一挑战并提高大型数据中心中少数磁盘的预测性能，我们提出了一种基于迁移学习方法的少数磁盘故障预测模型TLDFP。我们在两个现实数据集上的评估结果表明，与基于传统ML算法的四种流行预测模型和两种最先进的迁移学习方法相比，TLDFP可以提供更精确的结果。

{"title":"Transfer Learning based Failure Prediction for Minority Disks in Large Data Centers of Heterogeneous Disk Systems","authors":"Ji Zhang, Ke Zhou, Ping Huang, Xubin He, Zhili Xiao, Bin Cheng, Yongguang Ji, Yinhu Wang","doi":"10.1145/3337821.3337881","DOIUrl":"https://doi.org/10.1145/3337821.3337881","url":null,"abstract":"The storage system in large scale data centers is typically built upon thousands or even millions of disks, where disk failures constantly happen. A disk failure could lead to serious data loss and thus system unavailability or even catastrophic consequences if the lost data cannot be recovered. While replication and erasure coding techniques have been widely deployed to guarantee storage availability and reliability, disk failure prediction is gaining popularity as it has the potential to prevent disk failures from occurring in the first place. Recent trends have turned toward applying machine learning approaches based on disk SMART attributes for disk failure predictions. However, traditional machine learning (ML) approaches require a large set of training data in order to deliver good predictive performance. In large-scale storage systems, new disks enter gradually to augment the storage capacity or to replace failed disks, leading storage systems to consist of small amounts of new disks from different vendors and/or different models from the same vendor as time goes on. We refer to this relatively small amount of disks as minority disks. Due to the lack of sufficient training data, traditional ML approaches fail to deliver satisfactory predictive performance in evolving storage systems which consist of heterogeneous minority disks. To address this challenge and improve the predictive performance for minority disks in large data centers, we propose a minority disk failure prediction model named TLDFP based on a transfer learning approach. Our evaluation results on two realistic datasets have demonstrated that TLDFP can deliver much more precise results, compared to four popular prediction models based on traditional ML algorithms and two state-of-the-art transfer learning methods.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127474226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

VScan

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337860

Chen Zhang, Q. Cao, Jie Yao, Yuanyuan Dong, Puyuan Yang

Identifying key scenes in massive surveillance videos is extremely challenging because these scenes occur rarely while automotive identification using full-feature neural network (NN) models consumes immense computational resources. This paper proposes VScan, an efficient model-joint mechanism that adaptively schedules streams on a light-weight NN model and a full-feature NN model for analyzing videos concurrently. These two combined models with overlapped detectable objects are generic and well-developed. The former model fast scans videos to seek potential interest scenes. Only the streams with identified scenes are further analyzed by the latter model. We provide a model selection approach to select a light-weight model with an appropriate accuracy and high throughput. VScan further determines key parameters to correct predictions at runtime, thus guaranteeing the recall of target scenes. The full-feature model is responsible for ensuring output precision. To maintain a high hardware efficiency and utilization dynamically, VScan uses automatic sampling to reduce unnecessary computations, proposes stream scheduling to maximize hardware usage, and designs GPU scheduling to optimize the data processing flow. Experimental results show that benefitting from the model-joint mechanism and runtime scheduling optimizations, VScan significantly boosts the video processing throughput by up to 15x without key scene loss.

{"title":"VScan","authors":"Chen Zhang, Q. Cao, Jie Yao, Yuanyuan Dong, Puyuan Yang","doi":"10.1145/3337821.3337860","DOIUrl":"https://doi.org/10.1145/3337821.3337860","url":null,"abstract":"Identifying key scenes in massive surveillance videos is extremely challenging because these scenes occur rarely while automotive identification using full-feature neural network (NN) models consumes immense computational resources. This paper proposes VScan, an efficient model-joint mechanism that adaptively schedules streams on a light-weight NN model and a full-feature NN model for analyzing videos concurrently. These two combined models with overlapped detectable objects are generic and well-developed. The former model fast scans videos to seek potential interest scenes. Only the streams with identified scenes are further analyzed by the latter model. We provide a model selection approach to select a light-weight model with an appropriate accuracy and high throughput. VScan further determines key parameters to correct predictions at runtime, thus guaranteeing the recall of target scenes. The full-feature model is responsible for ensuring output precision. To maintain a high hardware efficiency and utilization dynamically, VScan uses automatic sampling to reduce unnecessary computations, proposes stream scheduling to maximize hardware usage, and designs GPU scheduling to optimize the data processing flow. Experimental results show that benefitting from the model-joint mechanism and runtime scheduling optimizations, VScan significantly boosts the video processing throughput by up to 15x without key scene loss.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128882416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DLBooster

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337892

Yang Cheng, Dan Li, Z. Guo, Binyao Jiang, Jiaxin Lin, Xi Fan, Jinkun Geng, Xinyi Yu, Wei Bai, L. Qu, Ran Shu, Peng Cheng, Y. Xiong, Jianping Wu

In recent years, deep learning (DL) has prospered again due to improvements in both computing and learning theory. Emerging studies mostly focus on the acceleration of refining DL models but ignore data preprocessing issues. However, data preprocessing can significantly affect the overall performance of end-to-end DL workflows. Our studies on several image DL workloads show that existing preprocessing backends are quite inefficient: they either perform poorly in throughput (30% degradation) or burn too many (>10) CPU cores. Based on these observations, we propose DLBooster, a high-performance data preprocessing pipeline that selectively offloads key workloads to FPGAs, to fit the stringent demands on data preprocessing for cutting-edge DL applications. Our testbed experiments show that, compared with the existing baselines, DLBooster can achieve 1.35×~2.4× image processing throughput in several DL workloads, but consumes only 1/10 CPU cores. Besides, it also reduces the latency by 1/3 in online image inference.

引用次数: 2

Data and Thread Placement in NUMA Architectures: A Statistical Learning Approach NUMA架构中的数据和线程放置:一种统计学习方法

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337893

Nicolas Denoyelle, Brice Goglin, E. Jeannot, Thomas Ropars

Nowadays, NUMA architectures are common in compute-intensive systems. Achieving high performance for multi-threaded application requires both a careful placement of threads on computing units and a thorough allocation of data in memory. Finding such a placement is a hard problem to solve, because performance depends on complex interactions in several layers of the memory hierarchy. In this paper we propose a black-box approach to decide if an application execution time can be impacted by the placement of its threads and data, and in such a case, to choose the best placement strategy to adopt. We show that it is possible to reach near-optimal placement policy selection. Furthermore, solutions work across several recent processor architectures and decisions can be taken with a single run of low overhead profiling.

如今，NUMA架构在计算密集型系统中很常见。要实现多线程应用程序的高性能，既需要在计算单元上仔细地放置线程，又需要在内存中彻底地分配数据。找到这样的位置是一个很难解决的问题，因为性能取决于内存层次结构中多个层的复杂交互。在本文中，我们提出了一种黑盒方法来确定应用程序的执行时间是否会受到其线程和数据的放置的影响，并在这种情况下选择要采用的最佳放置策略。我们表明，有可能达到接近最优的安置政策选择。此外，解决方案可以跨几种最新的处理器体系结构工作，并且只需运行一次低开销的分析就可以做出决策。

引用次数: 16

Express Link Placement for NoC-Based Many-Core Platforms 基于noc的多核平台的快速链接放置

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337877

Yunfan Li, Di Zhu, Lizhong Chen

With the integration of up to hundreds of cores in recent general-purpose processors that can be used in parallel processing systems, it is critical to design scalable and low-latency networks-on-chip (NoCs) to support various on-chip communications. An effective way to reduce on-chip latency and improve network scalability is to add express links between pairs of non-adjacent routers. However, increasing the number of express links may result in smaller bandwidth per link due to the limited total bisection bandwidth on chip, thus leading to higher serialization latency of packets in the network. Unlike previous works on application-specific designs or on fixed placement of express links, this paper aims at finding effective placement of express links for general-purpose processors considering all the possible placement options. We formulate the problem mathematically and propose an efficient algorithm that utilizes an initial solution generation heuristic and enhanced candidate generator in simulated annealing. Evaluation on 4x4, 8x8 and 16x16 networks using multi-threaded PARSEC benchmarks and various synthetic traffic patterns shows significant reduction of average packet latency over previous works.

由于可以在并行处理系统中使用的最新通用处理器集成了多达数百个内核，因此设计可扩展和低延迟的片上网络(noc)以支持各种片上通信至关重要。在非相邻路由器对之间增加快速链路是降低片上时延和提高网络可扩展性的有效途径。但是，由于芯片上的总对分带宽有限，增加快速链路的数量可能会导致每条链路的带宽减少，从而导致网络中数据包的序列化延迟增加。不同于以往关于特定应用的设计或快速链接的固定位置的工作，本文的目的是考虑所有可能的放置选项，为通用处理器找到有效的快速链接放置。我们在模拟退火中利用初始解生成启发式和增强型候选生成器，提出了一种有效的算法。在使用多线程PARSEC基准测试和各种合成流量模式的4x4、8x8和16x16网络上进行评估显示，与以前的工作相比，平均数据包延迟显著降低。

{"title":"Express Link Placement for NoC-Based Many-Core Platforms","authors":"Yunfan Li, Di Zhu, Lizhong Chen","doi":"10.1145/3337821.3337877","DOIUrl":"https://doi.org/10.1145/3337821.3337877","url":null,"abstract":"With the integration of up to hundreds of cores in recent general-purpose processors that can be used in parallel processing systems, it is critical to design scalable and low-latency networks-on-chip (NoCs) to support various on-chip communications. An effective way to reduce on-chip latency and improve network scalability is to add express links between pairs of non-adjacent routers. However, increasing the number of express links may result in smaller bandwidth per link due to the limited total bisection bandwidth on chip, thus leading to higher serialization latency of packets in the network. Unlike previous works on application-specific designs or on fixed placement of express links, this paper aims at finding effective placement of express links for general-purpose processors considering all the possible placement options. We formulate the problem mathematically and propose an efficient algorithm that utilizes an initial solution generation heuristic and enhanced candidate generator in simulated annealing. Evaluation on 4x4, 8x8 and 16x16 networks using multi-threaded PARSEC benchmarks and various synthetic traffic patterns shows significant reduction of average packet latency over previous works.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"5 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113932378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0