Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An
Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.
{"title":"Improving Utilization of Dataflow Unit for Multi-Batch Processing.","authors":"Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An","doi":"10.1145/3637906","DOIUrl":"https://doi.org/10.1145/3637906","url":null,"abstract":"<p>Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138716786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, Yi Jiang
ZNS SSDs divide the storage space into sequential-write zones, reducing costs of DRAM utilization, garbage collection, and over-provisioning. The sequential-write feature of zones is well-suited for LSM-based databases, where random writes are organized into sequential writes to improve performance. However, the current compaction mechanism of LSM-tree results in widely varying access frequencies (i.e., hotness) of data and thus incurs an extreme imbalance in the distribution of erasure counts across zones. The imbalance significantly limits the lifetime of SSDs. Moreover, the current zone-reset method involves a large number of unnecessary erase operations on unused blocks, further shortening the SSD lifetime.
Considering the access pattern of LSM-tree, this paper proposes a wear-aware zone-management technique, termed WA-Zone, to effectively balance inter- and intra-zone wear in ZNS SSDs. In WA-Zone, a wear-aware zone allocator is first proposed to dynamically allocate data with different hotness to zones with corresponding lifetimes, enabling an even distribution of the erasure counts across zones. Then, a partial-erase-based zone-reset method is presented to avoid unnecessary erase operations. Furthermore, because the novel zone-reset method might lead to an unbalanced distribution of erasure counts across blocks in a zone, a wear-aware block allocator is proposed. Experimental results based on the FEMU emulator demonstrate the proposed WA-Zone enhances the ZNS-SSD lifetime by 5.23 ×, compared with the baseline scheme.
{"title":"WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs","authors":"Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, Yi Jiang","doi":"10.1145/3637488","DOIUrl":"https://doi.org/10.1145/3637488","url":null,"abstract":"<p>ZNS SSDs divide the storage space into sequential-write zones, reducing costs of DRAM utilization, garbage collection, and over-provisioning. The sequential-write feature of zones is well-suited for LSM-based databases, where random writes are organized into sequential writes to improve performance. However, the current compaction mechanism of LSM-tree results in widely varying access frequencies (i.e., hotness) of data and thus incurs an extreme imbalance in the distribution of erasure counts across zones. The imbalance significantly limits the lifetime of SSDs. Moreover, the current zone-reset method involves a large number of unnecessary erase operations on unused blocks, further shortening the SSD lifetime. </p><p>Considering the access pattern of LSM-tree, this paper proposes a wear-aware zone-management technique, termed <i>WA-Zone</i>, to effectively balance inter- and intra-zone wear in ZNS SSDs. In WA-Zone, a wear-aware zone allocator is first proposed to dynamically allocate data with different hotness to zones with corresponding lifetimes, enabling an even distribution of the erasure counts across zones. Then, a partial-erase-based zone-reset method is presented to avoid unnecessary erase operations. Furthermore, because the novel zone-reset method might lead to an unbalanced distribution of erasure counts across blocks in a zone, a wear-aware block allocator is proposed. Experimental results based on the <i>FEMU</i> emulator demonstrate the proposed WA-Zone enhances the ZNS-SSD lifetime by 5.23 ×, compared with the baseline scheme.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"104 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138631935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhonghua Wang, Chen Ding, Fengguang Song, Kai Lu, Jiguang Wan, Zhihu Tan, Changsheng Xie, Guokuan Li
Learned Index, which utilizes effective machine learning models to accelerate locating sorted data positions, has gained increasing attention in many big data scenarios. Using efficient learned models, the learned indexes build large nodes and flat structures, thereby greatly improving the performance. However, most of the state-of-the-art learned indexes are designed for DRAM, and there is hence an urgent need to enable high-performance learned indexes for emerging Non-Volatile Memory (NVM). In this paper, we first evaluate and analyze the performance of the existing learned indexes on NVM. We discover that these learned indexes encounter severe write amplification and write performance degradation due to the requirements of maintaining large sorted/semi-sorted data nodes. To tackle the problems, we propose a novel three-tiered architecture of write-optimized persistent learned index, which is named WIPE, by adopting unsorted fine-granularity data nodes to achieve high write performance on NVM. Thereinto, we devise a new root node construction algorithm to accelerate searching numerous small data nodes. The algorithm ensures stable flat structure and high read performance in large-size datasets by introducing an intermediate layer (i.e., index nodes) and achieving accurate prediction of index node positions from the root node. Our extensive experiments on Intel DCPMM show that WIPE can improve write throughput and read throughput by up to 3.9 × and 7 ×, respectively, compared to the state-of-the-art learned indexes. Also, WIPE can recover from a system crash in ∼ 18ms. WIPE is free as an open-source software package1.
{"title":"WIPE: a Write-Optimized Learned Index for Persistent Memory","authors":"Zhonghua Wang, Chen Ding, Fengguang Song, Kai Lu, Jiguang Wan, Zhihu Tan, Changsheng Xie, Guokuan Li","doi":"10.1145/3634915","DOIUrl":"https://doi.org/10.1145/3634915","url":null,"abstract":"<p>Learned Index, which utilizes effective machine learning models to accelerate locating sorted data positions, has gained increasing attention in many big data scenarios. Using efficient learned models, the learned indexes build large nodes and flat structures, thereby greatly improving the performance. However, most of the state-of-the-art learned indexes are designed for DRAM, and there is hence an urgent need to enable high-performance learned indexes for emerging Non-Volatile Memory (NVM). In this paper, we first evaluate and analyze the performance of the existing learned indexes on NVM. We discover that these learned indexes encounter severe write amplification and write performance degradation due to the requirements of maintaining large sorted/semi-sorted data nodes. To tackle the problems, we propose a novel three-tiered architecture of write-optimized persistent learned index, which is named <i>WIPE</i>, by adopting unsorted fine-granularity data nodes to achieve high write performance on NVM. Thereinto, we devise a new root node construction algorithm to accelerate searching numerous small data nodes. The algorithm ensures stable flat structure and high read performance in large-size datasets by introducing an intermediate layer (i.e., index nodes) and achieving accurate prediction of index node positions from the root node. Our extensive experiments on Intel DCPMM show that WIPE can improve write throughput and read throughput by up to 3.9 × and 7 ×, respectively, compared to the state-of-the-art learned indexes. Also, WIPE can recover from a system crash in ∼ 18<i>ms</i>. WIPE is free as an open-source software package<sup>1</sup>.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"77 1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu
General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine-learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors necessitating online error detection. Algorithm-based Error Detection techniques (ABED) is a powerful technique to detect errors in matrix multiplications. In this paper, we consider implementation of ABED that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will trigger false alarm while a loose bound will allow errors to escape detection. In this paper, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error free computation it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern-generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.
{"title":"Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators","authors":"Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu","doi":"10.1145/3633332","DOIUrl":"https://doi.org/10.1145/3633332","url":null,"abstract":"<p>General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine-learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors necessitating online error detection. Algorithm-based Error Detection techniques (ABED) is a powerful technique to detect errors in matrix multiplications. In this paper, we consider implementation of ABED that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will trigger false alarm while a loose bound will allow errors to escape detection. In this paper, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error free computation it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern-generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"14 2 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work seeks to leverage Processing-with-storage-technology (PWST) to accelerate a key bioinformatics kernel called k-mer counting, which involves processing large files of sequence data on the disk to build a histogram of fixed-size genome sequence substrings and thereby entails prohibitively high I/O overhead. In particular, this work proposes a set of accelerator designs called Abakus that offer varying degrees of tradeoffs in terms of performance, efficiency, and hardware implementation complexity. The key to these designs is a set of domain-specific hardware extensions to accelerate the key operations for k-mer counting at various levels of the SSD hierarchy, with the goal of enhancing the limited computing capabilities of conventional SSDs, while exploiting the parallelism of the multi-channel, multi-way SSDs. Our evaluation suggests that Abakus can achieve 8.42 ×, 6.91 ×, and 2.32 × speedup over the CPU-, GPU-, and near-data processing solutions.
{"title":"Abakus: Accelerating k-mer Counting With Storage Technology","authors":"Lingxi Wu, Minxuan Zhou, Weihong Xu, Ashish Venkat, Tajana Rosing, Kevin Skadron","doi":"10.1145/3632952","DOIUrl":"https://doi.org/10.1145/3632952","url":null,"abstract":"<p>This work seeks to leverage Processing-with-storage-technology (PWST) to accelerate a key bioinformatics kernel called <i>k</i>-mer counting, which involves processing large files of sequence data on the disk to build a histogram of fixed-size genome sequence substrings and thereby entails prohibitively high I/O overhead. In particular, this work proposes a set of accelerator designs called Abakus that offer varying degrees of tradeoffs in terms of performance, efficiency, and hardware implementation complexity. The key to these designs is a set of domain-specific hardware extensions to accelerate the key operations for <i>k</i>-mer counting at various levels of the SSD hierarchy, with the goal of enhancing the limited computing capabilities of conventional SSDs, while exploiting the parallelism of the multi-channel, multi-way SSDs. Our evaluation suggests that Abakus can achieve 8.42 ×, 6.91 ×, and 2.32 × speedup over the CPU-, GPU-, and near-data processing solutions.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"55 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gino A. Chacon, Charles Williams, Johann Knechtel, Ozgur Sinanoglu, Paul V. Gratz, Vassos Soteriou
Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors.
In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based systems. Second, we propose a novel scheme using an active interposer as a generic, secure-by-construction platform that forms a physical root of trust for modern 2.5D systems. The implementation of our scheme is confined to the interposer, resulting in little cost and leaving the chiplets and coherence system untouched. We show that our scheme prevents a range of coherence attacks with low overheads on system performance, ∼ 4%. Further, we demonstrate that our scheme scales efficiently as system size and memory capacities increase, resulting in reduced performance overheads.
{"title":"Coherence Attacks and Countermeasures in Interposer-Based Chiplet Systems","authors":"Gino A. Chacon, Charles Williams, Johann Knechtel, Ozgur Sinanoglu, Paul V. Gratz, Vassos Soteriou","doi":"10.1145/3633461","DOIUrl":"https://doi.org/10.1145/3633461","url":null,"abstract":"<p>Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors. </p><p>In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based systems. Second, we propose a novel scheme using an active interposer as a generic, secure-by-construction platform that forms a physical root of trust for modern 2.5D systems. The implementation of our scheme is confined to the interposer, resulting in little cost and leaving the chiplets and coherence system untouched. We show that our scheme prevents a range of coherence attacks with low overheads on system performance, ∼ 4%. Further, we demonstrate that our scheme scales efficiently as system size and memory capacities increase, resulting in reduced performance overheads.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP.
In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.
像OpenMP这样的并行库使用程序员指定的调度策略,在线程之间分发Parallel -for循环的迭代。虽然现有的调度策略在平衡工作负载的上下文中执行得相当好,但在涉及高度不平衡工作负载的计算中,获得有效的工作分配是非常重要的(即使使用动态和引导等非静态调度方法)。在本文中,我们提出了一种称为成本感知工作窃取(COst - aware Work Stealing, COWS)的方案,以有效地将工作窃取的思想扩展到OpenMP。与传统的偷取工作的调度器相比,COWS考虑到(i)并不是一个parallel-for-loop的所有迭代都可能花费相同的时间。(ii)确定合适的窃取对象对于负载平衡很重要,(iii)队列在传统的窃取工作中会导致很大的开销,应该避免。我们提出了奶牛的两种变体:wsrri(基于剩余迭代数量的简单工作窃取方案)和WSRW(基于剩余工作负载数量的工作窃取方案)。由于在图形分析中发现的不规则循环中,不可能静态地计算并行for循环迭代的成本,因此我们使用编译时+运行时组合方法,其中循环的剩余工作负载在运行时通过利用编译时组件生成的代码有效地计算。我们对七个不同的基准测试程序进行了评估,使用五个不同的输入数据集,在两个不同的硬件上,在不同数量的线程上;导致总共275个数的配置。我们显示,在275个配置中的225个配置中,与该配置的最佳OpenMP调度方案相比,我们的方法实现了明显的性能提升。
{"title":"COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loops: ACM Transactions on Architecture and Code Optimization: Vol 0, No ja","authors":"Prasoon Mishra, V. Krishna Nandivada","doi":"10.1145/3633331","DOIUrl":"https://doi.org/10.1145/3633331","url":null,"abstract":"<p>Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP. </p><p>In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"59 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, Yida Wang
Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this paper, we propose an effective quantized Winograd convolution, named LoWino, which employs an in-side quantization method in the Winograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate LoWino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84 × and 1.91 × operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.
{"title":"Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs","authors":"Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, Yida Wang","doi":"10.1145/3632956","DOIUrl":"https://doi.org/10.1145/3632956","url":null,"abstract":"<p>Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this paper, we propose an effective quantized Winograd convolution, named LoWino, which employs an in-side quantization method in the Winograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate LoWino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84 × and 1.91 × operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Fan, Yiliang Ye, Shadi Ibrahim, Zhuo Huang, Xingru Li, WeiBin Xue, Song Wu, Chen Yu, Xuanhua Shi, Hai Jin
Solid State Drives (SSDs) are widely used in data-intensive scenarios due to their high performance and decreasing cost. However, in shared environments, concurrent workloads can interfere with each other, leading to a violation of Quality of Service (QoS). While QoS mechanisms like fairness guarantees and latency constraints have been integrated into SSDs, existing transaction processing frameworks offer limited QoS guarantees and can significantly degrade overall performance in a shared environment. The reason is that the internal components of an SSD, originally designed to exploit parallelism, struggle to coordinate effectively when QoS mechanisms are applied to them. This paper proposes a novel QoS -enhanced transaction pro cessing framework, called QoS-pro, which enhances QoS guarantees for concurrent workloads while maintaining high parallelism for SSDs. QoS-pro achieves this by redesigning transaction processing procedures to fully exploit the parallelism of shared SSDs and enhancing QoS-oriented transaction translation and scheduling with parallelism features in mind. In terms of fairness guarantees, QoS-pro outperforms state-of-the-art methods by achieving 96% fairness improvement and 64% maximum latency reduction. QoS-pro also shows almost no loss in throughput when compared with parallelism-oriented methods. Additionally, QoS-pro triggers the fewest Garbage Collection (GC) operations and minimally affects concurrently running workloads during GC operations.
固态硬盘(Solid State Drives, ssd)以其高性能和低成本的特点在数据密集型场景中得到了广泛的应用。然而,在共享环境中,并发工作负载可能会相互干扰,从而导致违反服务质量(QoS)。虽然像公平保证和延迟限制这样的QoS机制已经集成到ssd中,但现有的事务处理框架提供的QoS保证有限,并且会显著降低共享环境中的整体性能。原因是SSD的内部组件(最初是为了利用并行性而设计的)在应用QoS机制时难以有效地协调。本文提出了一种新的QoS增强事务处理框架,称为QoS-pro,它在保持ssd的高并行性的同时增强了并发工作负载的QoS保证。QoS-pro通过重新设计事务处理过程来充分利用共享ssd的并行性,并考虑到并行性特性,增强面向qos的事务转换和调度,从而实现了这一点。就公平性保证而言,QoS-pro通过实现96%的公平性改进和64%的最大延迟减少,优于最先进的方法。与面向并行的方法相比,QoS-pro在吞吐量方面几乎没有损失。此外,QoS-pro触发的垃圾收集(GC)操作最少,并且在GC操作期间对并发运行的工作负载的影响最小。
{"title":"QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs","authors":"Hao Fan, Yiliang Ye, Shadi Ibrahim, Zhuo Huang, Xingru Li, WeiBin Xue, Song Wu, Chen Yu, Xuanhua Shi, Hai Jin","doi":"10.1145/3632955","DOIUrl":"https://doi.org/10.1145/3632955","url":null,"abstract":"Solid State Drives (SSDs) are widely used in data-intensive scenarios due to their high performance and decreasing cost. However, in shared environments, concurrent workloads can interfere with each other, leading to a violation of Quality of Service (QoS). While QoS mechanisms like fairness guarantees and latency constraints have been integrated into SSDs, existing transaction processing frameworks offer limited QoS guarantees and can significantly degrade overall performance in a shared environment. The reason is that the internal components of an SSD, originally designed to exploit parallelism, struggle to coordinate effectively when QoS mechanisms are applied to them. This paper proposes a novel QoS -enhanced transaction pro cessing framework, called QoS-pro, which enhances QoS guarantees for concurrent workloads while maintaining high parallelism for SSDs. QoS-pro achieves this by redesigning transaction processing procedures to fully exploit the parallelism of shared SSDs and enhancing QoS-oriented transaction translation and scheduling with parallelism features in mind. In terms of fairness guarantees, QoS-pro outperforms state-of-the-art methods by achieving 96% fairness improvement and 64% maximum latency reduction. QoS-pro also shows almost no loss in throughput when compared with parallelism-oriented methods. Additionally, QoS-pro triggers the fewest Garbage Collection (GC) operations and minimally affects concurrently running workloads during GC operations.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"47 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134901890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocacy for UVM and HMM motivates improvement of the underlying system. We focus on UVM-based systems and investigate root causes of UVM overhead, a non-trivial task due to complex interactions of multiple hardware and software constituents and the desired cost granularity. In our prior work, we delved deeply into UVM system architecture and showed internal behaviors of page fault servicing in batches. We provided quantitative evaluation of batch handling for various applications under different scenarios, including prefetching and oversubscription. We revealed the driver workload depends on the interactions among application access patterns, GPU hardware constraints, and host OS components. Host OS components have significant overhead present across implementations, warranting close attention. This extension furthers our prior study in three aspects: fine-grain cost analysis and breakdown, extension to multiple GPUs, and investigation of platforms with different GPU-GPU interconnects. We take a top-down approach to quantitative batch analysis and uncover how constituent component costs accumulate and overlap governed by synchronous and asynchronous operations. Our multi-GPU analysis shows reduced cost of GPU-GPU batch workloads compared to CPU-GPU workloads. We further demonstrate that while specialized interconnects, NVLink, can improve batch cost, their benefits are limited by host OS software overhead and GPU oversubscription. This study serves as a proxy for future shared memory systems, such as those that interface with HMM, and the development of interconnects.
{"title":"Fine-Grain Quantitative Analysis of Demand Paging in Unified Virtual Memory","authors":"Tyler Allen, Bennett Cooper, Rong Ge","doi":"10.1145/3632953","DOIUrl":"https://doi.org/10.1145/3632953","url":null,"abstract":"The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocacy for UVM and HMM motivates improvement of the underlying system. We focus on UVM-based systems and investigate root causes of UVM overhead, a non-trivial task due to complex interactions of multiple hardware and software constituents and the desired cost granularity. In our prior work, we delved deeply into UVM system architecture and showed internal behaviors of page fault servicing in batches. We provided quantitative evaluation of batch handling for various applications under different scenarios, including prefetching and oversubscription. We revealed the driver workload depends on the interactions among application access patterns, GPU hardware constraints, and host OS components. Host OS components have significant overhead present across implementations, warranting close attention. This extension furthers our prior study in three aspects: fine-grain cost analysis and breakdown, extension to multiple GPUs, and investigation of platforms with different GPU-GPU interconnects. We take a top-down approach to quantitative batch analysis and uncover how constituent component costs accumulate and overlap governed by synchronous and asynchronous operations. Our multi-GPU analysis shows reduced cost of GPU-GPU batch workloads compared to CPU-GPU workloads. We further demonstrate that while specialized interconnects, NVLink, can improve batch cost, their benefits are limited by host OS software overhead and GPU oversubscription. This study serves as a proxy for future shared memory systems, such as those that interface with HMM, and the development of interconnects.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}