It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the "head-of-line" blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9%, 54.5%, and 43.5% improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2%, 74.1%, and 85.3% improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.
{"title":"Improving Short Job Latency Performance in Hybrid Job Schedulers with Dice","authors":"Wei Zhou, K. White, Hongfeng Yu","doi":"10.1145/3337821.3337851","DOIUrl":"https://doi.org/10.1145/3337821.3337851","url":null,"abstract":"It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the \"head-of-line\" blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9%, 54.5%, and 43.5% improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2%, 74.1%, and 85.3% improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131757672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Wang, Antonino Tumeo, John D. Leidel, Jie Li, Yong Chen
Emerging data-intensive applications, such as graph analytics and data mining, exhibit irregular memory access patterns. Research has shown that with these memory-bound applications, traditional cache-based processor architectures, which exploit locality and regular patterns to mitigate the memory-wall issue, are inefficient. Meantime, novel 3D-stacked memory devices, such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), promise significant increases in bandwidth that appear extremely appealing for memory-bound applications. However, conventional memory interfaces designed for cache-based architectures and JEDEC DDR devices fit poorly with the 3D-stacked memory, which leads to significant under-utilization of the promised high bandwidth. As a response to these issues, in this paper we propose MAC (Memory Access Coalescer), a coalescing unit for the 3D-stacked memory. We discuss the design and implementation of MAC, in the context of a custom designed cache-less architecture targeted at data-intensive, irregular applications. Through a custom simulation infrastructure based on the RISC-V toolchain, we show that MAC achieves a coalescing efficiency of 52.85% on average. It improves the performance of the memory system by 60.73% on average for a large set of irregular workloads.
{"title":"MAC: Memory Access Coalescer for 3D-Stacked Memory","authors":"Xi Wang, Antonino Tumeo, John D. Leidel, Jie Li, Yong Chen","doi":"10.1145/3337821.3337867","DOIUrl":"https://doi.org/10.1145/3337821.3337867","url":null,"abstract":"Emerging data-intensive applications, such as graph analytics and data mining, exhibit irregular memory access patterns. Research has shown that with these memory-bound applications, traditional cache-based processor architectures, which exploit locality and regular patterns to mitigate the memory-wall issue, are inefficient. Meantime, novel 3D-stacked memory devices, such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), promise significant increases in bandwidth that appear extremely appealing for memory-bound applications. However, conventional memory interfaces designed for cache-based architectures and JEDEC DDR devices fit poorly with the 3D-stacked memory, which leads to significant under-utilization of the promised high bandwidth. As a response to these issues, in this paper we propose MAC (Memory Access Coalescer), a coalescing unit for the 3D-stacked memory. We discuss the design and implementation of MAC, in the context of a custom designed cache-less architecture targeted at data-intensive, irregular applications. Through a custom simulation infrastructure based on the RISC-V toolchain, we show that MAC achieves a coalescing efficiency of 52.85% on average. It improves the performance of the memory system by 60.73% on average for a large set of irregular workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129643745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sandeep Madireddy, Prasanna Balaprakash, P. Carns, R. Latham, Glenn K. Lockwood, R. Ross, S. Snyder, Stefan M. Wild
Supervised learning is a promising approach for modeling the performance of applications running on large HPC systems. A key assumption in supervised learning is that the training and testing data are obtained under the same conditions. However, in production HPC systems these conditions might not hold because the conditions of the platform can change over time as a result of hardware degradation, hardware replacement, software upgrade, and configuration updates. These changes could alter the data distribution in a way that affects the accuracy of the predictive performance models and render them less useful; this phenomenon is referred to as concept drift. Ignoring concept drift can lead to suboptimal resource usage and decreased efficiency when those performance models are deployed for tuning and job scheduling in production systems. To address this issue, we propose a concept-drift-aware predictive modeling approach that comprises two components: (1) an online Bayesian changepoint detection method that can automatically identify the location of events that lead to concept drift in near-real time and (2) a moment-matching transformation inspired by transfer learning that converts the training data collected before the drift to be useful for retraining. We use application input/output performance data collected on Cori, a production supercomputing system at the National Energy Research Scientific Computing Center, to demonstrate the effectiveness of our approach. The results show that concept-drift-aware models obtain significant improvement in accuracy; the median absolute error of the best-performing Gaussian process regression improved by 58.8% when the proposed approaches were used.
{"title":"Adaptive Learning for Concept Drift in Application Performance Modeling","authors":"Sandeep Madireddy, Prasanna Balaprakash, P. Carns, R. Latham, Glenn K. Lockwood, R. Ross, S. Snyder, Stefan M. Wild","doi":"10.1145/3337821.3337922","DOIUrl":"https://doi.org/10.1145/3337821.3337922","url":null,"abstract":"Supervised learning is a promising approach for modeling the performance of applications running on large HPC systems. A key assumption in supervised learning is that the training and testing data are obtained under the same conditions. However, in production HPC systems these conditions might not hold because the conditions of the platform can change over time as a result of hardware degradation, hardware replacement, software upgrade, and configuration updates. These changes could alter the data distribution in a way that affects the accuracy of the predictive performance models and render them less useful; this phenomenon is referred to as concept drift. Ignoring concept drift can lead to suboptimal resource usage and decreased efficiency when those performance models are deployed for tuning and job scheduling in production systems. To address this issue, we propose a concept-drift-aware predictive modeling approach that comprises two components: (1) an online Bayesian changepoint detection method that can automatically identify the location of events that lead to concept drift in near-real time and (2) a moment-matching transformation inspired by transfer learning that converts the training data collected before the drift to be useful for retraining. We use application input/output performance data collected on Cori, a production supercomputing system at the National Energy Research Scientific Computing Center, to demonstrate the effectiveness of our approach. The results show that concept-drift-aware models obtain significant improvement in accuracy; the median absolute error of the best-performing Gaussian process regression improved by 58.8% when the proposed approaches were used.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131156833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyunjun Kim, Sungin Hong, Hyeonsu Lee, Euiseong Seo, Hwansoo Han
Modern GPUs concurrently deploy thousands of threads to maximize thread level parallelism (TLP) for performance. For some applications, however, maximized TLP leads to significant performance degradation, as many concurrent threads compete for the limited amount of the data cache. In this paper, we propose a compiler-assisted thread throttling scheme, which limits the number of active thread groups to reduce cache contention and consequently improve the performance. A few dynamic thread throttling schemes have been proposed to alleviate cache contention by monitoring the cache behavior, but they often fail to provide timely responses to the dynamic changes in the cache behavior, as they adjust the parallelism afterwards in response to the monitored behavior. Our thread throttling scheme relies on compile-time adjustment of active thread groups to fit their memory footprints to the L1D capacity. We evaluated the proposed scheme with GPU programs that suffer from cache contention. Our approach improved the performance of original programs by 42.96% on average, and this is 8.97% performance boost in comparison to the static thread throttling schemes.
{"title":"Compiler-Assisted GPU Thread Throttling for Reduced Cache Contention","authors":"Hyunjun Kim, Sungin Hong, Hyeonsu Lee, Euiseong Seo, Hwansoo Han","doi":"10.1145/3337821.3337886","DOIUrl":"https://doi.org/10.1145/3337821.3337886","url":null,"abstract":"Modern GPUs concurrently deploy thousands of threads to maximize thread level parallelism (TLP) for performance. For some applications, however, maximized TLP leads to significant performance degradation, as many concurrent threads compete for the limited amount of the data cache. In this paper, we propose a compiler-assisted thread throttling scheme, which limits the number of active thread groups to reduce cache contention and consequently improve the performance. A few dynamic thread throttling schemes have been proposed to alleviate cache contention by monitoring the cache behavior, but they often fail to provide timely responses to the dynamic changes in the cache behavior, as they adjust the parallelism afterwards in response to the monitored behavior. Our thread throttling scheme relies on compile-time adjustment of active thread groups to fit their memory footprints to the L1D capacity. We evaluated the proposed scheme with GPU programs that suffer from cache contention. Our approach improved the performance of original programs by 42.96% on average, and this is 8.97% performance boost in comparison to the static thread throttling schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134158668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria Malik, Hassan Ghasemzadeh, T. Mohsenin, Rosario Cammarota, Liang Zhao, Avesta Sasan, H. Homayoun, S. Rafatirad
Datacenters provide high performance and flexibility for users and cost efficiency for operators. Hyperscale datacenters are harnessing massively scalable computer resources for large-scale data analysis. However, cloud/datacenter infrastructure does not scale as fast as the input data volume and computational requirements of big data and analytics technologies. Thus, more applications need to share CPU at the node level that could have large impact on performance and operational cost. To address this challenge, in this paper we show that, concurrently fine-tune parameters at the application, microarchitecture, and system levels are creating opportunities to co-locate applications at the node level and improve energy-efficiency of the server while maintaining performance. Co-locating and self-tuning of unknown applications are challenging problems, especially when co-locating multiple big data applications concurrently with many tuning knobs, potentially requiring exhaustive brute-force search to find the right settings. This research challenge upsurges an imminent need to develop a technique that co-locates applications at a node level and predict the optimal system, architecture and application level configure parameters to achieve the maximum energy efficiency. It promotes the scale-down of computational nodes by presenting the Energy-Efficient Co-Locating and Self-Tuning (ECoST) technique for data intensive applications. ECoST proof of concept was successfully tested on MapReduce platform. ECoST can also be deployed on other data-intensive frameworks where there are several parameters for power and performance tuning optimizations. ECoST collects run-time hardware performance counter data and implements various machine learning models from as simple as a lookup table or decision tree based to as complex as neural network based to predict the energy-efficiency of co-located applications. Experimental data show energy efficiency is achieved within 4% of the upper bound results when co-locating multiple applications at a node level. ECoST is also scalable, being within 8% of upper bound on an 8-node server.
{"title":"ECoST: Energy-Efficient Co-Locating and Self-Tuning MapReduce Applications","authors":"Maria Malik, Hassan Ghasemzadeh, T. Mohsenin, Rosario Cammarota, Liang Zhao, Avesta Sasan, H. Homayoun, S. Rafatirad","doi":"10.1145/3337821.3337834","DOIUrl":"https://doi.org/10.1145/3337821.3337834","url":null,"abstract":"Datacenters provide high performance and flexibility for users and cost efficiency for operators. Hyperscale datacenters are harnessing massively scalable computer resources for large-scale data analysis. However, cloud/datacenter infrastructure does not scale as fast as the input data volume and computational requirements of big data and analytics technologies. Thus, more applications need to share CPU at the node level that could have large impact on performance and operational cost. To address this challenge, in this paper we show that, concurrently fine-tune parameters at the application, microarchitecture, and system levels are creating opportunities to co-locate applications at the node level and improve energy-efficiency of the server while maintaining performance. Co-locating and self-tuning of unknown applications are challenging problems, especially when co-locating multiple big data applications concurrently with many tuning knobs, potentially requiring exhaustive brute-force search to find the right settings. This research challenge upsurges an imminent need to develop a technique that co-locates applications at a node level and predict the optimal system, architecture and application level configure parameters to achieve the maximum energy efficiency. It promotes the scale-down of computational nodes by presenting the Energy-Efficient Co-Locating and Self-Tuning (ECoST) technique for data intensive applications. ECoST proof of concept was successfully tested on MapReduce platform. ECoST can also be deployed on other data-intensive frameworks where there are several parameters for power and performance tuning optimizations. ECoST collects run-time hardware performance counter data and implements various machine learning models from as simple as a lookup table or decision tree based to as complex as neural network based to predict the energy-efficiency of co-located applications. Experimental data show energy efficiency is achieved within 4% of the upper bound results when co-locating multiple applications at a node level. ECoST is also scalable, being within 8% of upper bound on an 8-node server.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132784496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ji Zhang, Ke Zhou, Ping Huang, Xubin He, Zhili Xiao, Bin Cheng, Yongguang Ji, Yinhu Wang
The storage system in large scale data centers is typically built upon thousands or even millions of disks, where disk failures constantly happen. A disk failure could lead to serious data loss and thus system unavailability or even catastrophic consequences if the lost data cannot be recovered. While replication and erasure coding techniques have been widely deployed to guarantee storage availability and reliability, disk failure prediction is gaining popularity as it has the potential to prevent disk failures from occurring in the first place. Recent trends have turned toward applying machine learning approaches based on disk SMART attributes for disk failure predictions. However, traditional machine learning (ML) approaches require a large set of training data in order to deliver good predictive performance. In large-scale storage systems, new disks enter gradually to augment the storage capacity or to replace failed disks, leading storage systems to consist of small amounts of new disks from different vendors and/or different models from the same vendor as time goes on. We refer to this relatively small amount of disks as minority disks. Due to the lack of sufficient training data, traditional ML approaches fail to deliver satisfactory predictive performance in evolving storage systems which consist of heterogeneous minority disks. To address this challenge and improve the predictive performance for minority disks in large data centers, we propose a minority disk failure prediction model named TLDFP based on a transfer learning approach. Our evaluation results on two realistic datasets have demonstrated that TLDFP can deliver much more precise results, compared to four popular prediction models based on traditional ML algorithms and two state-of-the-art transfer learning methods.
{"title":"Transfer Learning based Failure Prediction for Minority Disks in Large Data Centers of Heterogeneous Disk Systems","authors":"Ji Zhang, Ke Zhou, Ping Huang, Xubin He, Zhili Xiao, Bin Cheng, Yongguang Ji, Yinhu Wang","doi":"10.1145/3337821.3337881","DOIUrl":"https://doi.org/10.1145/3337821.3337881","url":null,"abstract":"The storage system in large scale data centers is typically built upon thousands or even millions of disks, where disk failures constantly happen. A disk failure could lead to serious data loss and thus system unavailability or even catastrophic consequences if the lost data cannot be recovered. While replication and erasure coding techniques have been widely deployed to guarantee storage availability and reliability, disk failure prediction is gaining popularity as it has the potential to prevent disk failures from occurring in the first place. Recent trends have turned toward applying machine learning approaches based on disk SMART attributes for disk failure predictions. However, traditional machine learning (ML) approaches require a large set of training data in order to deliver good predictive performance. In large-scale storage systems, new disks enter gradually to augment the storage capacity or to replace failed disks, leading storage systems to consist of small amounts of new disks from different vendors and/or different models from the same vendor as time goes on. We refer to this relatively small amount of disks as minority disks. Due to the lack of sufficient training data, traditional ML approaches fail to deliver satisfactory predictive performance in evolving storage systems which consist of heterogeneous minority disks. To address this challenge and improve the predictive performance for minority disks in large data centers, we propose a minority disk failure prediction model named TLDFP based on a transfer learning approach. Our evaluation results on two realistic datasets have demonstrated that TLDFP can deliver much more precise results, compared to four popular prediction models based on traditional ML algorithms and two state-of-the-art transfer learning methods.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127474226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Zhang, Q. Cao, Jie Yao, Yuanyuan Dong, Puyuan Yang
Identifying key scenes in massive surveillance videos is extremely challenging because these scenes occur rarely while automotive identification using full-feature neural network (NN) models consumes immense computational resources. This paper proposes VScan, an efficient model-joint mechanism that adaptively schedules streams on a light-weight NN model and a full-feature NN model for analyzing videos concurrently. These two combined models with overlapped detectable objects are generic and well-developed. The former model fast scans videos to seek potential interest scenes. Only the streams with identified scenes are further analyzed by the latter model. We provide a model selection approach to select a light-weight model with an appropriate accuracy and high throughput. VScan further determines key parameters to correct predictions at runtime, thus guaranteeing the recall of target scenes. The full-feature model is responsible for ensuring output precision. To maintain a high hardware efficiency and utilization dynamically, VScan uses automatic sampling to reduce unnecessary computations, proposes stream scheduling to maximize hardware usage, and designs GPU scheduling to optimize the data processing flow. Experimental results show that benefitting from the model-joint mechanism and runtime scheduling optimizations, VScan significantly boosts the video processing throughput by up to 15x without key scene loss.
{"title":"VScan","authors":"Chen Zhang, Q. Cao, Jie Yao, Yuanyuan Dong, Puyuan Yang","doi":"10.1145/3337821.3337860","DOIUrl":"https://doi.org/10.1145/3337821.3337860","url":null,"abstract":"Identifying key scenes in massive surveillance videos is extremely challenging because these scenes occur rarely while automotive identification using full-feature neural network (NN) models consumes immense computational resources. This paper proposes VScan, an efficient model-joint mechanism that adaptively schedules streams on a light-weight NN model and a full-feature NN model for analyzing videos concurrently. These two combined models with overlapped detectable objects are generic and well-developed. The former model fast scans videos to seek potential interest scenes. Only the streams with identified scenes are further analyzed by the latter model. We provide a model selection approach to select a light-weight model with an appropriate accuracy and high throughput. VScan further determines key parameters to correct predictions at runtime, thus guaranteeing the recall of target scenes. The full-feature model is responsible for ensuring output precision. To maintain a high hardware efficiency and utilization dynamically, VScan uses automatic sampling to reduce unnecessary computations, proposes stream scheduling to maximize hardware usage, and designs GPU scheduling to optimize the data processing flow. Experimental results show that benefitting from the model-joint mechanism and runtime scheduling optimizations, VScan significantly boosts the video processing throughput by up to 15x without key scene loss.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128882416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Cheng, Dan Li, Z. Guo, Binyao Jiang, Jiaxin Lin, Xi Fan, Jinkun Geng, Xinyi Yu, Wei Bai, L. Qu, Ran Shu, Peng Cheng, Y. Xiong, Jianping Wu
In recent years, deep learning (DL) has prospered again due to improvements in both computing and learning theory. Emerging studies mostly focus on the acceleration of refining DL models but ignore data preprocessing issues. However, data preprocessing can significantly affect the overall performance of end-to-end DL workflows. Our studies on several image DL workloads show that existing preprocessing backends are quite inefficient: they either perform poorly in throughput (30% degradation) or burn too many (>10) CPU cores. Based on these observations, we propose DLBooster, a high-performance data preprocessing pipeline that selectively offloads key workloads to FPGAs, to fit the stringent demands on data preprocessing for cutting-edge DL applications. Our testbed experiments show that, compared with the existing baselines, DLBooster can achieve 1.35×~2.4× image processing throughput in several DL workloads, but consumes only 1/10 CPU cores. Besides, it also reduces the latency by 1/3 in online image inference.
{"title":"DLBooster","authors":"Yang Cheng, Dan Li, Z. Guo, Binyao Jiang, Jiaxin Lin, Xi Fan, Jinkun Geng, Xinyi Yu, Wei Bai, L. Qu, Ran Shu, Peng Cheng, Y. Xiong, Jianping Wu","doi":"10.1145/3337821.3337892","DOIUrl":"https://doi.org/10.1145/3337821.3337892","url":null,"abstract":"In recent years, deep learning (DL) has prospered again due to improvements in both computing and learning theory. Emerging studies mostly focus on the acceleration of refining DL models but ignore data preprocessing issues. However, data preprocessing can significantly affect the overall performance of end-to-end DL workflows. Our studies on several image DL workloads show that existing preprocessing backends are quite inefficient: they either perform poorly in throughput (30% degradation) or burn too many (>10) CPU cores. Based on these observations, we propose DLBooster, a high-performance data preprocessing pipeline that selectively offloads key workloads to FPGAs, to fit the stringent demands on data preprocessing for cutting-edge DL applications. Our testbed experiments show that, compared with the existing baselines, DLBooster can achieve 1.35×~2.4× image processing throughput in several DL workloads, but consumes only 1/10 CPU cores. Besides, it also reduces the latency by 1/3 in online image inference.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130451829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolas Denoyelle, Brice Goglin, E. Jeannot, Thomas Ropars
Nowadays, NUMA architectures are common in compute-intensive systems. Achieving high performance for multi-threaded application requires both a careful placement of threads on computing units and a thorough allocation of data in memory. Finding such a placement is a hard problem to solve, because performance depends on complex interactions in several layers of the memory hierarchy. In this paper we propose a black-box approach to decide if an application execution time can be impacted by the placement of its threads and data, and in such a case, to choose the best placement strategy to adopt. We show that it is possible to reach near-optimal placement policy selection. Furthermore, solutions work across several recent processor architectures and decisions can be taken with a single run of low overhead profiling.
{"title":"Data and Thread Placement in NUMA Architectures: A Statistical Learning Approach","authors":"Nicolas Denoyelle, Brice Goglin, E. Jeannot, Thomas Ropars","doi":"10.1145/3337821.3337893","DOIUrl":"https://doi.org/10.1145/3337821.3337893","url":null,"abstract":"Nowadays, NUMA architectures are common in compute-intensive systems. Achieving high performance for multi-threaded application requires both a careful placement of threads on computing units and a thorough allocation of data in memory. Finding such a placement is a hard problem to solve, because performance depends on complex interactions in several layers of the memory hierarchy. In this paper we propose a black-box approach to decide if an application execution time can be impacted by the placement of its threads and data, and in such a case, to choose the best placement strategy to adopt. We show that it is possible to reach near-optimal placement policy selection. Furthermore, solutions work across several recent processor architectures and decisions can be taken with a single run of low overhead profiling.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129708757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the integration of up to hundreds of cores in recent general-purpose processors that can be used in parallel processing systems, it is critical to design scalable and low-latency networks-on-chip (NoCs) to support various on-chip communications. An effective way to reduce on-chip latency and improve network scalability is to add express links between pairs of non-adjacent routers. However, increasing the number of express links may result in smaller bandwidth per link due to the limited total bisection bandwidth on chip, thus leading to higher serialization latency of packets in the network. Unlike previous works on application-specific designs or on fixed placement of express links, this paper aims at finding effective placement of express links for general-purpose processors considering all the possible placement options. We formulate the problem mathematically and propose an efficient algorithm that utilizes an initial solution generation heuristic and enhanced candidate generator in simulated annealing. Evaluation on 4x4, 8x8 and 16x16 networks using multi-threaded PARSEC benchmarks and various synthetic traffic patterns shows significant reduction of average packet latency over previous works.
{"title":"Express Link Placement for NoC-Based Many-Core Platforms","authors":"Yunfan Li, Di Zhu, Lizhong Chen","doi":"10.1145/3337821.3337877","DOIUrl":"https://doi.org/10.1145/3337821.3337877","url":null,"abstract":"With the integration of up to hundreds of cores in recent general-purpose processors that can be used in parallel processing systems, it is critical to design scalable and low-latency networks-on-chip (NoCs) to support various on-chip communications. An effective way to reduce on-chip latency and improve network scalability is to add express links between pairs of non-adjacent routers. However, increasing the number of express links may result in smaller bandwidth per link due to the limited total bisection bandwidth on chip, thus leading to higher serialization latency of packets in the network. Unlike previous works on application-specific designs or on fixed placement of express links, this paper aims at finding effective placement of express links for general-purpose processors considering all the possible placement options. We formulate the problem mathematically and propose an efficient algorithm that utilizes an initial solution generation heuristic and enhanced candidate generator in simulated annealing. Evaluation on 4x4, 8x8 and 16x16 networks using multi-threaded PARSEC benchmarks and various synthetic traffic patterns shows significant reduction of average packet latency over previous works.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"5 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113932378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}