首页 > 最新文献

2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献

英文 中文
Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance GPU上的图形着色及改善负载不平衡的一些技术
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.74
Shuai Che, Gregory P. Rodgers, Bradford M. Beckmann, S. Reinhardt
Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications is graph coloring/partitioning to obtain sets of independent vertices for subsequent parallel computations. However, parallelization and optimization of coloring for GPUs have been a challenge for programmers. This paper studies approaches to implementing graph coloring on a GPU and characterizes their program behaviors with different graph structures. We also investigate load imbalance, which can be the main cause for performance bottlenecks. We evaluate the effectiveness of different optimization techniques, including the use of work stealing and the design of a hybrid algorithm. We are able to improve graph coloring performance by approximately 25% compared to a baseline GPU implementation on an AMD Radeon HD 7950 GPU. We also analyze some important factors affecting performance.
图形处理单元(gpu)越来越多地用于加速图形和稀疏矩阵计算等不规则应用。图形着色是许多图形应用程序的关键组成部分。许多图形应用程序的第一步是图形着色/划分,以获得后续并行计算的独立顶点集。然而,gpu的并行化和着色优化一直是程序员面临的挑战。本文研究了在GPU上实现图着色的方法,并描述了它们在不同图结构下的程序行为。我们还研究了负载不平衡,这可能是导致性能瓶颈的主要原因。我们评估了不同优化技术的有效性,包括使用工作窃取和混合算法的设计。与AMD Radeon HD 7950 GPU上的基准GPU实现相比,我们能够将图形着色性能提高约25%。本文还分析了影响性能的一些重要因素。
{"title":"Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance","authors":"Shuai Che, Gregory P. Rodgers, Bradford M. Beckmann, S. Reinhardt","doi":"10.1109/IPDPSW.2015.74","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.74","url":null,"abstract":"Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications is graph coloring/partitioning to obtain sets of independent vertices for subsequent parallel computations. However, parallelization and optimization of coloring for GPUs have been a challenge for programmers. This paper studies approaches to implementing graph coloring on a GPU and characterizes their program behaviors with different graph structures. We also investigate load imbalance, which can be the main cause for performance bottlenecks. We evaluate the effectiveness of different optimization techniques, including the use of work stealing and the design of a hybrid algorithm. We are able to improve graph coloring performance by approximately 25% compared to a baseline GPU implementation on an AMD Radeon HD 7950 GPU. We also analyze some important factors affecting performance.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116518469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Performance Modeling of Multi-tiered Web Applications with Varying Service Demands 具有不同服务需求的多层Web应用程序的性能建模
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.28
A. Kattepur, M. Nambiar
Multi-tiered transactional web applications are frequently used in enterprise based systems. Due to their inherent distributed nature, pre-deployment testing for high-availability and varying concurrency are important for post-deployment performance. Accurate performance modeling of such applications can help estimate values for future deployment variations as well as validate experimental results. In order to theoretically model performance of multi-tiered applications, we use queuing networks and Mean Value Analysis (MVA) models. While MVA has been shown to work well with closed queuing networks, there are particular limitations in cases where the service demands vary with concurrency. This is further contrived by the use of multi-server queues in multi-core CPUs, that are not traditionally captured in MVA. We compare performance of a multi-server MVA model alongside actual performance testing measurements and demonstrate this deviation. Using spline interpolation of collected service demands, we show that a modified version of the MVA algorithm (called MVASD) that accepts an array of service demands, can provide superior estimates of maximum throughput and response time. Results are demonstrated over multi-tier vehicle insurance registration and e-commerce web applications. The mean deviations of predicted throughput and response time are shown to be less the 3% and 9%, respectively. Additionally, we analyze the effect of spline interpolation of service demands as a function of throughput on the prediction results.
多层事务性web应用程序经常用于基于企业的系统。由于其固有的分布式特性,高可用性和可变并发性的部署前测试对于部署后性能非常重要。对这些应用程序进行准确的性能建模可以帮助估计未来部署变化的值,并验证实验结果。为了从理论上模拟多层应用程序的性能,我们使用排队网络和均值分析(MVA)模型。虽然MVA已被证明可以很好地用于封闭排队网络,但在服务需求随并发性变化的情况下,它有特殊的局限性。通过在多核cpu中使用多服务器队列进一步实现了这一点,而传统上MVA不会捕获这些队列。我们将多服务器MVA模型的性能与实际性能测试结果进行比较,并演示这种偏差。使用收集到的服务需求的样条插值,我们展示了MVA算法(称为MVASD)的修改版本,它接受一系列服务需求,可以提供更好的最大吞吐量和响应时间估计。结果通过多层车辆保险登记和电子商务web应用程序进行了演示。预测吞吐量和响应时间的平均偏差分别小于3%和9%。此外,我们还分析了服务需求作为吞吐量函数的样条插值对预测结果的影响。
{"title":"Performance Modeling of Multi-tiered Web Applications with Varying Service Demands","authors":"A. Kattepur, M. Nambiar","doi":"10.1109/IPDPSW.2015.28","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.28","url":null,"abstract":"Multi-tiered transactional web applications are frequently used in enterprise based systems. Due to their inherent distributed nature, pre-deployment testing for high-availability and varying concurrency are important for post-deployment performance. Accurate performance modeling of such applications can help estimate values for future deployment variations as well as validate experimental results. In order to theoretically model performance of multi-tiered applications, we use queuing networks and Mean Value Analysis (MVA) models. While MVA has been shown to work well with closed queuing networks, there are particular limitations in cases where the service demands vary with concurrency. This is further contrived by the use of multi-server queues in multi-core CPUs, that are not traditionally captured in MVA. We compare performance of a multi-server MVA model alongside actual performance testing measurements and demonstrate this deviation. Using spline interpolation of collected service demands, we show that a modified version of the MVA algorithm (called MVASD) that accepts an array of service demands, can provide superior estimates of maximum throughput and response time. Results are demonstrated over multi-tier vehicle insurance registration and e-commerce web applications. The mean deviations of predicted throughput and response time are shown to be less the 3% and 9%, respectively. Additionally, we analyze the effect of spline interpolation of service demands as a function of throughput on the prediction results.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115285543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Performance and Energy Efficient Asymmetrically Reliable Caches for Multicore Architectures 多核架构的性能和能效非对称可靠缓存
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.113
Sanem Arslan, H. Topcuoglu, M. Kandemir, Oguz Tosun
Modern architectures are increasingly susceptible to transient and permanent faults due to continuously decreasing transistor sizes and faster operating frequencies. The probability of soft error occurrence is relatively high on cache structures due to the large area of the logic compared to other parts. Applying fault tolerance unselectively for all caches has a significant overhead on performance and energy. In this study, we propose asymmetrically reliable caches aiming to provide required reliability using just enough extra hardware under the performance and energy constraints. In our framework, a chip multiprocessor consists of one reliability-aware core which has ECC protection on its data cache for critical data and a set of less reliable cores with unprotected data caches to map noncritical data. The experimental results for selected applications show that our proposed technique provides 21% better reliability for only 6% more energy consumption compared to traditional caches.
由于晶体管尺寸的不断减小和工作频率的不断加快,现代架构越来越容易受到瞬态和永久故障的影响。由于与其他部分相比,缓存结构上的逻辑面积较大,因此发生软错误的概率相对较高。对所有缓存不选择性地应用容错会对性能和能量造成很大的开销。在本研究中,我们提出了非对称可靠缓存,旨在在性能和能量限制下使用足够的额外硬件提供所需的可靠性。在我们的框架中,一个芯片多处理器由一个可靠性感知核心组成,该核心对关键数据的数据缓存具有ECC保护,而一组可靠性较低的核心具有未受保护的数据缓存来映射非关键数据。选定应用的实验结果表明,与传统缓存相比,我们提出的技术在仅增加6%的能耗的情况下提供了21%的可靠性。
{"title":"Performance and Energy Efficient Asymmetrically Reliable Caches for Multicore Architectures","authors":"Sanem Arslan, H. Topcuoglu, M. Kandemir, Oguz Tosun","doi":"10.1109/IPDPSW.2015.113","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.113","url":null,"abstract":"Modern architectures are increasingly susceptible to transient and permanent faults due to continuously decreasing transistor sizes and faster operating frequencies. The probability of soft error occurrence is relatively high on cache structures due to the large area of the logic compared to other parts. Applying fault tolerance unselectively for all caches has a significant overhead on performance and energy. In this study, we propose asymmetrically reliable caches aiming to provide required reliability using just enough extra hardware under the performance and energy constraints. In our framework, a chip multiprocessor consists of one reliability-aware core which has ECC protection on its data cache for critical data and a set of less reliable cores with unprotected data caches to map noncritical data. The experimental results for selected applications show that our proposed technique provides 21% better reliability for only 6% more energy consumption compared to traditional caches.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115715127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Firefly Inspired Improved Distributed Proximity Algorithm for D2D Communication 萤火虫启发的D2D通信改进分布式接近算法
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.64
A. Pratap, R. Misra
Device-to-Device (i.e. D2D) communication under-laying cellular technology not only increases system capacity but also utilizes the advantage of physical proximity of communicating devices to support services like proximity services, offload traffic from Base Station (i.e. BS) etc. But proximity discovery and synchronization among devices efficiently poses new research challenges for cellular networks. Inspired by the synchronization behaviour of fire fly found in nature, the reported algorithms based on bio-inspired firefly heuristics for synchronization among devices as well as service interest among them having drawback of large convergence time and large message exchanges. Therefore, we propose an improved O (n log n) distributed firefly algorithm for D2D large scale networks using tree based topological mechanism using RSSI based ranging scheme.
基于蜂窝技术的设备对设备(即D2D)通信不仅增加了系统容量,而且利用通信设备物理接近的优势来支持诸如接近服务、从基站(即BS)卸载流量等服务。但是,设备间的近距离发现和同步对蜂窝网络的研究提出了新的挑战。受自然界萤火虫同步行为的启发,基于仿生萤火虫启发式的设备间同步及服务兴趣算法存在收敛时间大、消息交换量大的缺点。因此,我们提出了一种改进的O (n log n)分布式萤火虫算法,该算法采用基于RSSI的测距方案,采用基于树的拓扑机制,用于D2D大规模网络。
{"title":"Firefly Inspired Improved Distributed Proximity Algorithm for D2D Communication","authors":"A. Pratap, R. Misra","doi":"10.1109/IPDPSW.2015.64","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.64","url":null,"abstract":"Device-to-Device (i.e. D2D) communication under-laying cellular technology not only increases system capacity but also utilizes the advantage of physical proximity of communicating devices to support services like proximity services, offload traffic from Base Station (i.e. BS) etc. But proximity discovery and synchronization among devices efficiently poses new research challenges for cellular networks. Inspired by the synchronization behaviour of fire fly found in nature, the reported algorithms based on bio-inspired firefly heuristics for synchronization among devices as well as service interest among them having drawback of large convergence time and large message exchanges. Therefore, we propose an improved O (n log n) distributed firefly algorithm for D2D large scale networks using tree based topological mechanism using RSSI based ranging scheme.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121742933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Decoupling Contention with Victim Row-Buffer on Multicore Memory Systems 多核存储系统中受害者行缓冲区的解耦争用
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.30
Ke Gao, Dongrui Fan, Jie Wu, Zhiyong Liu
With continued performance scaling of many cores per chip, an on-chip, off-chip memory has increasingly become a system bottleneck due to inter-thread contention. The memory access streams emerging from many cores and the simultaneously executed threads, exhibit increasingly limited locality. Large and high-density DRAMs contribute significantly to system power consumption and data over fetch. We develop a fine-grained Victim Row-Buffer (VRB) memory system to increase performance of the memory system. The VRB mechanism helps reuse the data accessed from the memory banks, avoids unnecessary data transfers, mitigates memory contentions, and thus can improve system throughput and system fairness by decoupling row-buffer contentions. Through full-system cycle-accurate simulations of many threads applications, we demonstrate that our proposed VRB technique achieves an up to 19% (8.4% on average) system-level throughput improvement, an up to 20% (7.2% on average) system fairness improvement, and it saves 6.8% of power consumption across the whole suite.
随着每个芯片的性能不断扩展,由于线程间争用,片内、片外内存日益成为系统的瓶颈。来自多个内核和同时执行的线程的内存访问流表现出越来越有限的局部性。大内存和高密度内存对系统功耗和数据读取的影响很大。为了提高内存系统的性能,我们开发了一种细粒度的受害者行缓冲(VRB)内存系统。VRB机制有助于重用从内存库访问的数据,避免不必要的数据传输,减轻内存争用,从而通过解耦行缓冲区争用来提高系统吞吐量和系统公平性。通过对许多线程应用程序的全系统周期精确模拟,我们证明了我们提出的VRB技术实现了高达19%(平均8.4%)的系统级吞吐量改进,高达20%(平均7.2%)的系统公平性改进,并且在整个套件中节省了6.8%的功耗。
{"title":"Decoupling Contention with Victim Row-Buffer on Multicore Memory Systems","authors":"Ke Gao, Dongrui Fan, Jie Wu, Zhiyong Liu","doi":"10.1109/IPDPSW.2015.30","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.30","url":null,"abstract":"With continued performance scaling of many cores per chip, an on-chip, off-chip memory has increasingly become a system bottleneck due to inter-thread contention. The memory access streams emerging from many cores and the simultaneously executed threads, exhibit increasingly limited locality. Large and high-density DRAMs contribute significantly to system power consumption and data over fetch. We develop a fine-grained Victim Row-Buffer (VRB) memory system to increase performance of the memory system. The VRB mechanism helps reuse the data accessed from the memory banks, avoids unnecessary data transfers, mitigates memory contentions, and thus can improve system throughput and system fairness by decoupling row-buffer contentions. Through full-system cycle-accurate simulations of many threads applications, we demonstrate that our proposed VRB technique achieves an up to 19% (8.4% on average) system-level throughput improvement, an up to 20% (7.2% on average) system fairness improvement, and it saves 6.8% of power consumption across the whole suite.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Energy-Aware Server Provisioning by Introducing Middleware-Level Dynamic Green Scheduling 引入中间件级动态绿色调度的节能服务器配置
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.121
Daniel Balouek-Thomert, E. Caron, L. Lefèvre
Several approaches to reduce the power consumption of data enters have been described in the literature, most of which aim to improve energy efficiency by trading off performance for reducing power consumption. However, these approaches do not always provide means for administrators and users to specify how they want to explore such trade-offs. This work provides techniques for assigning jobs to distributed resources, exploring energy efficient resource provisioning. We use middleware-level mechanisms to adapt resource allocation according to energy-related events and user-defined rules. A proposed framework enables developers, users and system administrators to specify and explore energy efficiency and performance trade-offs without detailed knowledge of the underlying hardware platform. Evaluation of the proposed solution under three scheduling policies shows gains of 25% in energy-efficiency with minimal impact on the overall application performance. We also evaluate reactivity in the adaptive resource provisioning.
文献中描述了几种降低数据输入功耗的方法,其中大多数旨在通过牺牲性能来降低功耗来提高能源效率。然而,这些方法并不总是为管理员和用户提供方法来指定他们希望如何探索这种权衡。这项工作提供了将工作分配给分布式资源的技术,探索了能源效率的资源配置。我们使用中间件机制根据与能源相关的事件和用户定义的规则来调整资源分配。建议的框架使开发人员、用户和系统管理员能够指定和探索能源效率和性能权衡,而无需详细了解底层硬件平台。在三种调度策略下对所建议的解决方案进行的评估表明,在对整体应用程序性能影响最小的情况下,能源效率提高了25%。我们还评估了自适应资源供应中的反应性。
{"title":"Energy-Aware Server Provisioning by Introducing Middleware-Level Dynamic Green Scheduling","authors":"Daniel Balouek-Thomert, E. Caron, L. Lefèvre","doi":"10.1109/IPDPSW.2015.121","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.121","url":null,"abstract":"Several approaches to reduce the power consumption of data enters have been described in the literature, most of which aim to improve energy efficiency by trading off performance for reducing power consumption. However, these approaches do not always provide means for administrators and users to specify how they want to explore such trade-offs. This work provides techniques for assigning jobs to distributed resources, exploring energy efficient resource provisioning. We use middleware-level mechanisms to adapt resource allocation according to energy-related events and user-defined rules. A proposed framework enables developers, users and system administrators to specify and explore energy efficiency and performance trade-offs without detailed knowledge of the underlying hardware platform. Evaluation of the proposed solution under three scheduling policies shows gains of 25% in energy-efficiency with minimal impact on the overall application performance. We also evaluate reactivity in the adaptive resource provisioning.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129563381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Real-Time Multiprocessor Architecture for Sharing Stream Processing Accelerators 用于共享流处理加速器的实时多处理器架构
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.147
B. Dekens, M. Bekooij, G. Smit
Stream processing accelerators are often applied in MPSoCs for software defined radios. Sharing of these accelerators between different streams could improve their utilization and reduce thereby the hardware cost but is challenging under real-time constraints. In this paper we introduce entry- and exit-gateways that are responsible for multiplexing blocks of data over accelerators under real-time constraints. These gateways check for the availability of sufficient data and space and thereby enable the derivation of a dataflow model of the application. The dataflow model is used to verify the worst-case temporal behaviour based on the sizes of the blocks of data used for multiplexing. We demonstrate that required buffer capacities are non-monotone in the block size. Therefore, an ILP is presented to compute minimum block sizes and sufficient buffer capacities. The benefits of sharing accelerators are demonstrated using a multi-core system that is implemented on a Virtex 6 FPGA. A stereo audio stream from a PAL video signal is demodulated in this system in real-time where two accelerators are shared within and between two streams. In this system sharing reduces the number of accelerators by 75% and reduced the number of logic cells with 63%.
流处理加速器通常应用于mpsoc的软件定义无线电。在不同流之间共享这些加速器可以提高它们的利用率,从而降低硬件成本,但在实时性限制下具有挑战性。在本文中,我们介绍了入口和出口网关,它们负责在实时约束下通过加速器复用数据块。这些网关检查是否有足够的数据和空间可用,从而支持推导应用程序的数据流模型。数据流模型用于验证基于用于多路复用的数据块大小的最坏情况时间行为。我们证明了所需的缓冲容量在块大小上是非单调的。因此,提出了一个ILP来计算最小块大小和足够的缓冲容量。使用在Virtex 6 FPGA上实现的多核系统演示了共享加速器的好处。来自PAL视频信号的立体声音频流在该系统中实时解调,其中两个加速器在两个流内部和之间共享。在这个系统中,共享使加速器的数量减少了75%,逻辑单元的数量减少了63%。
{"title":"Real-Time Multiprocessor Architecture for Sharing Stream Processing Accelerators","authors":"B. Dekens, M. Bekooij, G. Smit","doi":"10.1109/IPDPSW.2015.147","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.147","url":null,"abstract":"Stream processing accelerators are often applied in MPSoCs for software defined radios. Sharing of these accelerators between different streams could improve their utilization and reduce thereby the hardware cost but is challenging under real-time constraints. In this paper we introduce entry- and exit-gateways that are responsible for multiplexing blocks of data over accelerators under real-time constraints. These gateways check for the availability of sufficient data and space and thereby enable the derivation of a dataflow model of the application. The dataflow model is used to verify the worst-case temporal behaviour based on the sizes of the blocks of data used for multiplexing. We demonstrate that required buffer capacities are non-monotone in the block size. Therefore, an ILP is presented to compute minimum block sizes and sufficient buffer capacities. The benefits of sharing accelerators are demonstrated using a multi-core system that is implemented on a Virtex 6 FPGA. A stereo audio stream from a PAL video signal is demodulated in this system in real-time where two accelerators are shared within and between two streams. In this system sharing reduces the number of accelerators by 75% and reduced the number of logic cells with 63%.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129720659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization Oakleaf-FX上特征解算器的性能评价:三对角化与五对角化
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.128
Takeshi Fukaya, Toshiyuki Imamura
The solution of real symmetric dense Eigen value problems is one of the fundamental matrix computations. To date, several new high-performance Eigen solvers have been developed for peta and postpeta scale systems. One of these, the Eigen Exa Eigen solver, has been developed in Japan. Eigen Exa provides two routines: eigens, which is based on traditional tridiagonalization, and eigensx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of Eigen Exa by using 4,800 nodes of the Oak leaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigensx over eigens, which will contribute to further performance improvement of Eigen Exa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to Eigen value problems.
实对称密集特征值问题的求解是矩阵计算的基本问题之一。到目前为止,已经为peta和postpeta尺度系统开发了几种新的高性能特征解算器。其中之一是日本开发的Eigen Exa Eigen求解器。Eigen Exa提供了两个例程:基于传统三对角化的eigens和通过五对角矩阵采用新方法的eigensx。最近,我们利用Oak leaf-FX超级计算机系统的4800个节点,对Eigen Exa进行了详细的性能评估。在本文中,我们报告了我们的评估结果,主要集中在调查两个例程之间的差异。结果清楚地表明了eigensx相对于eigensx的优缺点,这将有助于进一步提高eigensx的性能。所得结果也有望用于除特征值问题外的其他并行密集矩阵计算。
{"title":"Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization","authors":"Takeshi Fukaya, Toshiyuki Imamura","doi":"10.1109/IPDPSW.2015.128","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.128","url":null,"abstract":"The solution of real symmetric dense Eigen value problems is one of the fundamental matrix computations. To date, several new high-performance Eigen solvers have been developed for peta and postpeta scale systems. One of these, the Eigen Exa Eigen solver, has been developed in Japan. Eigen Exa provides two routines: eigens, which is based on traditional tridiagonalization, and eigensx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of Eigen Exa by using 4,800 nodes of the Oak leaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigensx over eigens, which will contribute to further performance improvement of Eigen Exa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to Eigen value problems.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"69 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130439804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism 基于自适应并行性的协作线程建模来规划GPU性能
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.55
Jiayuan Meng, T. Uram, V. Morozov, V. Vishwanath, Kalyan Kumaran
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.
大多数加速器,如图形处理单元(gpu)和矢量处理器,特别适合加速大规模并行工作负载。另一方面,传统的工作负载是为多核并行性开发的,通常只扩展到几十个OpenMP线程。当硬件线程的数量明显超过外部循环的并行度时,程序员就面临着如何有效利用硬件的挑战。一种常见的解决方案是进一步利用隐藏在代码结构深处的并行性。这样的并行性结构更少:并行循环和顺序循环可能不完美地嵌套在一起,相邻的内部循环可能表现出不同的并发模式(例如Reduction vs. Forall),但必须在相同的并行部分中并行化。必须探索许多依赖于输入的转换。程序员通常使用较大的硬件线程组来协作地遍历较小的外部循环分区,并自适应地利用任何遇到的并行性。此过程耗时且容易出错,但是对于此类工作负载,获得很少或没有性能的风险仍然很高。为了降低风险并指导实现,我们提出了一种技术来对具有有限并行性的工作负载进行建模,该技术可以自动探索和评估涉及合作线程的转换。最终,我们的框架可以实现最佳性能和最有希望的转换,而无需实现GPU代码或使用物理硬件。我们设想将我们的技术集成到未来的编译器或自动调优的优化框架中。
{"title":"Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism","authors":"Jiayuan Meng, T. Uram, V. Morozov, V. Vishwanath, Kalyan Kumaran","doi":"10.1109/IPDPSW.2015.55","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.55","url":null,"abstract":"Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130093885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Performance of Structured-Memory, Data-Intensive Applications on Multi-core Platforms via a Space-Filling Curve Memory Layout 通过空间填充曲线内存布局提高多核平台上结构化内存、数据密集型应用程序的性能
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.71
E. W. Bethel, David Camp, D. Donofrio, Mark Howison
Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.
许多数据密集型算法——特别是在可视化、图像处理和数据分析方面——操作结构化数据,即组织在多维数组中的数据。虽然这些算法中的许多都是数字密集型的,但总的来说,它们的性能受到内存访问成本的限制。随着我们向百亿级计算体系迈进,一个核心的研究挑战是找到最小化内存层次结构中的数据移动的方法,特别是在共享内存并行设置中的节点内。我们研究了另一种内存数据布局格式在运行时性能提升方面的影响,因为它减少了通过内存层次结构移动的数据量。我们重点研究了可视化和分析中常见的两种算法的共享内存并行实现:基于模板的卷积核,它使用结构化内存访问模式,以及射线投射体渲染,它使用半结构化内存访问模式。我们研究的问题是更好地理解,当这些关键算法使用替代内存布局时,将在多大程度上提高运行时性能和内存系统利用率。我们的方法使用基于Z-order (Morton-order)空间填充曲线数据组织的布局,我们测量和报告运行时以及与内存系统利用率相关的各种指标和计数器。我们的结果显示,在我们测试的应用程序的不同并发级别上,运行时性能和内存层次结构的利用率几乎一致地得到了改进。这种方法是对其他内存优化策略(如缓存阻塞)的补充,但也可能更通用,更广泛地适用于各种应用程序。
{"title":"Improving Performance of Structured-Memory, Data-Intensive Applications on Multi-core Platforms via a Space-Filling Curve Memory Layout","authors":"E. W. Bethel, David Camp, D. Donofrio, Mark Howison","doi":"10.1109/IPDPSW.2015.71","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.71","url":null,"abstract":"Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126726405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2015 IEEE International Parallel and Distributed Processing Symposium Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1