首页 > 最新文献

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文 中文
Resource-Management Study in HPC Runtime-Stacking Context HPC运行时堆栈环境下的资源管理研究
Arthur Loussert, Benoit Welterlen, Patrick Carribault, Julien Jaeger, Marc Pérache, R. Namyst
With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared-memory models (e.g., MPI+OpenMP), to better exploit shared-memory communications and reduce the overall memory footprint. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time and to share the underlying computing resources. This paper explores different configurations where this stacking may appear and introduces algorithms to detect the misuse of compute resources when running a hybrid parallel application. We have implemented our algorithms inside a dynamic tool that monitors applications and outputs resource usage to the user. We validated this tool on applications from CORAL benchmarks. This leads to relevant information which can be used to improve runtime placement, and to an average overhead lower than 1% of total execution time.
随着多核和多核处理器作为HPC超级计算机的构建模块的出现,许多应用程序从仅仅依赖分布式编程模型(例如MPI)转变为混合分布式和共享内存模型(例如MPI+OpenMP),以更好地利用共享内存通信并减少总体内存占用。这种编程方法的一个副作用是运行时堆叠:混合多个模型涉及到同时活动的各种运行时库,并共享底层计算资源。本文探讨了可能出现这种堆叠的不同配置,并介绍了在运行混合并行应用程序时检测计算资源滥用的算法。我们在一个动态工具中实现了算法,该工具监视应用程序并向用户输出资源使用情况。我们在来自CORAL基准测试的应用程序上验证了这个工具。这将产生相关信息,这些信息可用于改进运行时位置,并使平均开销低于总执行时间的1%。
{"title":"Resource-Management Study in HPC Runtime-Stacking Context","authors":"Arthur Loussert, Benoit Welterlen, Patrick Carribault, Julien Jaeger, Marc Pérache, R. Namyst","doi":"10.1109/SBAC-PAD.2017.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.30","url":null,"abstract":"With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared-memory models (e.g., MPI+OpenMP), to better exploit shared-memory communications and reduce the overall memory footprint. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time and to share the underlying computing resources. This paper explores different configurations where this stacking may appear and introduces algorithms to detect the misuse of compute resources when running a hybrid parallel application. We have implemented our algorithms inside a dynamic tool that monitors applications and outputs resource usage to the user. We validated this tool on applications from CORAL benchmarks. This leads to relevant information which can be used to improve runtime placement, and to an average overhead lower than 1% of total execution time.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126569861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cloud Workload Prediction and Generation Models 云工作负载预测和生成模型
Gilles Madi-Wamba, Yunbo Li, Anne-Cécile Orgerie, Nicolas Beldiceanu, Jean-Marc Menaud
Cloud computing allows for elasticity as users can dynamically benefit from new virtual resources when their workload increases. Such a feature requires highly reactive resource provisioning mechanisms. In this paper, we propose two new workload prediction models, based on constraint programming and neural networks, that can be used for dynamic resource provisioning in Cloud environments. We also present two workload trace generators that can help to extend an experimental dataset in order to test more widely resource optimization heuristics. Our models are validated using real traces from a small Cloud provider. Both approaches are shown to be complimentary as neural networks give better prediction results, while constraint programming is more suitable for trace generation.
云计算允许弹性,因为当用户的工作负载增加时,用户可以动态地从新的虚拟资源中获益。这样的特性需要高度反应性的资源供应机制。在本文中,我们提出了两个新的基于约束规划和神经网络的工作负载预测模型,可用于云环境中的动态资源配置。我们还提供了两个工作负载跟踪生成器,它们可以帮助扩展实验数据集,以便测试更广泛的资源优化启发式。我们的模型使用来自小型云提供商的真实痕迹进行验证。两种方法都是互补的,神经网络的预测效果更好,而约束规划更适合于轨迹生成。
{"title":"Cloud Workload Prediction and Generation Models","authors":"Gilles Madi-Wamba, Yunbo Li, Anne-Cécile Orgerie, Nicolas Beldiceanu, Jean-Marc Menaud","doi":"10.1109/SBAC-PAD.2017.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.19","url":null,"abstract":"Cloud computing allows for elasticity as users can dynamically benefit from new virtual resources when their workload increases. Such a feature requires highly reactive resource provisioning mechanisms. In this paper, we propose two new workload prediction models, based on constraint programming and neural networks, that can be used for dynamic resource provisioning in Cloud environments. We also present two workload trace generators that can help to extend an experimental dataset in order to test more widely resource optimization heuristics. Our models are validated using real traces from a small Cloud provider. Both approaches are shown to be complimentary as neural networks give better prediction results, while constraint programming is more suitable for trace generation.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131582284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Beyond the Fog: Bringing Cross-Platform Code Execution to Constrained IoT Devices 超越迷雾:将跨平台代码执行带到受限的物联网设备
F. Pisani, Jeferson Rech Brunetta, Vanderson Martins do Rosário, E. Borin
Considering the prediction that there will be over 50 billion devices connected to the Internet of Things (IoT) in the near future, the demand for efficient ways to process data streams generated by sensors grows ever larger, highlighting the necessity to re-evaluate current approaches, such as sending all data to the cloud for processing and analysis.In this paper, we explore one of the methods for improving this scenario: bringing the computation closer to data sources. By executing the code on the IoT devices themselves instead of on the network edge or the cloud, solutions can better meet the latency requirements of several applications, avoid problems with slow and intermittent network connections, prevent network congestion, and potentially save energy by reducing communication.To this end, we propose the LMC framework and compare it with Edgent, an open-source project that is under development by the Apache Incubator. By using a DragonBoard 410c to execute a simple filter, an outlier detector, and a program that calculates the FFT, we obtained results that indicate that LMC outperforms Edgent when dynamic translation is disabled for both of them and is more suitable for lightweight quick queries otherwise. More importantly, the LMC also enables us to perform cross-platform code execution on small, cheap devices that do not have enough resources to run Edgent, like the NodeMCU 1.0.
考虑到在不久的将来将有超过500亿台设备连接到物联网(IoT)的预测,对处理传感器生成的数据流的有效方法的需求越来越大,这突出了重新评估当前方法的必要性,例如将所有数据发送到云端进行处理和分析。在本文中,我们探讨了改善这种情况的一种方法:使计算更接近数据源。通过在物联网设备本身而不是在网络边缘或云上执行代码,解决方案可以更好地满足多个应用程序的延迟需求,避免网络连接缓慢和间歇性的问题,防止网络拥塞,并通过减少通信来节省能源。为此,我们提出了LMC框架,并将其与Apache Incubator正在开发的开源项目Edgent进行了比较。通过使用DragonBoard 410c来执行一个简单的过滤器、一个离群值检测器和一个计算FFT的程序,我们得到的结果表明,当LMC和Edgent都禁用动态翻译时,LMC的性能优于Edgent,并且更适合于轻量级的快速查询。更重要的是,LMC还使我们能够在小型廉价设备上执行跨平台代码执行,这些设备没有足够的资源来运行Edgent,比如NodeMCU 1.0。
{"title":"Beyond the Fog: Bringing Cross-Platform Code Execution to Constrained IoT Devices","authors":"F. Pisani, Jeferson Rech Brunetta, Vanderson Martins do Rosário, E. Borin","doi":"10.1109/SBAC-PAD.2017.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.10","url":null,"abstract":"Considering the prediction that there will be over 50 billion devices connected to the Internet of Things (IoT) in the near future, the demand for efficient ways to process data streams generated by sensors grows ever larger, highlighting the necessity to re-evaluate current approaches, such as sending all data to the cloud for processing and analysis.In this paper, we explore one of the methods for improving this scenario: bringing the computation closer to data sources. By executing the code on the IoT devices themselves instead of on the network edge or the cloud, solutions can better meet the latency requirements of several applications, avoid problems with slow and intermittent network connections, prevent network congestion, and potentially save energy by reducing communication.To this end, we propose the LMC framework and compare it with Edgent, an open-source project that is under development by the Apache Incubator. By using a DragonBoard 410c to execute a simple filter, an outlier detector, and a program that calculates the FFT, we obtained results that indicate that LMC outperforms Edgent when dynamic translation is disabled for both of them and is more suitable for lightweight quick queries otherwise. More importantly, the LMC also enables us to perform cross-platform code execution on small, cheap devices that do not have enough resources to run Edgent, like the NodeMCU 1.0.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116869302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
GC-CR: A Decentralized Garbage Collector Component for Checkpointing in Clouds GC-CR:用于云中检查点的分散式垃圾收集器组件
Thouraya Louati, Heithem Abbes, C. Cérin, M. Jemni
Infrastructure-as-a-Service container-based virtualization technology is gaining significant interest in industry as an alternative platform for running distributed applications. With increasing scale of Cloud Computing architectures, faults are becoming a frequent occurrence. Checkpoint-Restart is a key method to survive to failures in this context. However, there is a need to reduce the amount of checkpointing data as the Cloud is based on the pay-as-you-go model. This paper addresses the issue of garbage collection in LXCloud-CR and contributes with a novel decentralized garbage collection component GC-CR. LXCloud-CR, a decentralized Checkpoint-Restart model, is able to take snapshots of Linux Container instances and it uses replication to increase snapshots availability. LXCloud-CR contains a versioning scheme for each replica. The disadvantage refers to snapshots availability issues with versioning as the number of useless files grows. GC-CR is a decentralized garbage collector (checkpoint deletion) component that attempts to identify and eliminate old snapshots versions from the system in order to free storage space. Large scale experiments on the Grid5000 testbed demonstrate the benefits of our proposal. Obtained results validate our model and show significant reduction of storage space consumption
基于基础设施即服务(iaas)容器的虚拟化技术作为运行分布式应用程序的替代平台,在业界引起了极大的兴趣。随着云计算架构规模的不断扩大,故障的发生也越来越频繁。在这种情况下,检查点重新启动是在故障中存活下来的关键方法。然而,有必要减少检查点数据的数量,因为云是基于现收现付模式的。本文解决了LXCloud-CR中的垃圾收集问题,并提供了一个新的分散垃圾收集组件GC-CR。LXCloud-CR是一种去中心化的Checkpoint-Restart模型,它能够获取Linux Container实例的快照,并使用复制来提高快照的可用性。LXCloud-CR包含每个副本的版本控制方案。缺点是随着无用文件数量的增加,版本控制中的快照可用性问题。GC-CR是一个分散的垃圾收集器(检查点删除)组件,它试图从系统中识别和消除旧的快照版本,以释放存储空间。Grid5000测试平台上的大规模实验证明了我们的建议的好处。得到的结果验证了我们的模型,并显示了存储空间消耗的显著减少
{"title":"GC-CR: A Decentralized Garbage Collector Component for Checkpointing in Clouds","authors":"Thouraya Louati, Heithem Abbes, C. Cérin, M. Jemni","doi":"10.1109/SBAC-PAD.2017.20","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.20","url":null,"abstract":"Infrastructure-as-a-Service container-based virtualization technology is gaining significant interest in industry as an alternative platform for running distributed applications. With increasing scale of Cloud Computing architectures, faults are becoming a frequent occurrence. Checkpoint-Restart is a key method to survive to failures in this context. However, there is a need to reduce the amount of checkpointing data as the Cloud is based on the pay-as-you-go model. This paper addresses the issue of garbage collection in LXCloud-CR and contributes with a novel decentralized garbage collection component GC-CR. LXCloud-CR, a decentralized Checkpoint-Restart model, is able to take snapshots of Linux Container instances and it uses replication to increase snapshots availability. LXCloud-CR contains a versioning scheme for each replica. The disadvantage refers to snapshots availability issues with versioning as the number of useless files grows. GC-CR is a decentralized garbage collector (checkpoint deletion) component that attempts to identify and eliminate old snapshots versions from the system in order to free storage space. Large scale experiments on the Grid5000 testbed demonstrate the benefits of our proposal. Obtained results validate our model and show significant reduction of storage space consumption","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114378165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scalability of CPU and GPU Solutions of the Prime Elliptic Curve Discrete Logarithm Problem 素数椭圆曲线离散对数问题的CPU和GPU解的可扩展性
J. Panetta, P. S. Filho, Luiz A. F. Laranjeira, Carlos A. Teixeira
Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two hours using 80 GPUs and 94-bits system in about 15 hours using 256 GPUs. Extensive experimentation investigates scalability of CPU, GPU and CPU+GPU runs. The discussed results indicate that speed-up is not a good metric for parallel scalability. This paper proposes and evaluates a new metric that is better suited for this task.
椭圆曲线非对称密码术越来越受欢迎,因为它能够提供与其他现有密码系统相当的安全级别,同时需要更少的计算工作。Pollard Rho和Parallel Collision Search是已知最快的用于破解该密码系统的顺序和并行算法,随着时间的推移,它们已经成功地应用于破解不断增加的比特长度系统实例,使用针对可用硬件进行了大量优化的实现。这项工作提出了一个基于并行碰撞搜索的可移植的通用实现,用于素椭圆曲线非对称密码系统,该系统使用公开可用的大整数库,并且不假设素曲线的性质。它调查了哪些位长度的密钥可以在合理的时间内被用户破解,这些用户可以访问带有cpu和gpu的最先进的公共HPC设备。最终的实现使用80个gpu在大约2小时内破解79位系统,使用256个gpu在大约15小时内破解94位系统。广泛的实验研究了CPU, GPU和CPU+GPU运行的可扩展性。讨论的结果表明,加速并不是衡量并行可伸缩性的好指标。本文提出并评估了一个更适合此任务的新度量。
{"title":"Scalability of CPU and GPU Solutions of the Prime Elliptic Curve Discrete Logarithm Problem","authors":"J. Panetta, P. S. Filho, Luiz A. F. Laranjeira, Carlos A. Teixeira","doi":"10.1109/SBAC-PAD.2017.12","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.12","url":null,"abstract":"Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two hours using 80 GPUs and 94-bits system in about 15 hours using 256 GPUs. Extensive experimentation investigates scalability of CPU, GPU and CPU+GPU runs. The discussed results indicate that speed-up is not a good metric for parallel scalability. This paper proposes and evaluates a new metric that is better suited for this task.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124843895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extending OmpSs for OpenCL Kernel Co-Execution in Heterogeneous Systems 异构系统中OpenCL内核协同执行的扩展
Borja Pérez, Esteban Stafford, J. L. Bosque, R. Beivide, Sergi Mateo, Xavier Teruel, X. Martorell, E. Ayguadé
Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.
异构系统具有非常高的潜在性能,但在编程方面存在困难。OmpSs是基于任务的并行应用程序的著名框架,它是简化这些系统编程的有趣工具。但是,它不支持在多个计算设备上共同执行单个OpenCL内核实例。为了克服这一限制,本文提出了对OmpSs框架的扩展,该框架解决了两个主要目标:在多个设备之间自动划分数据集和管理它们的内存地址空间。为了适应不同类型的应用,数据划分可以通过新的HGuided负载均衡算法或众所周知的静态和动态负载均衡算法进行。所有这些对编程的影响可以忽略不计。实验结果表明,总有一种负载均衡算法能够提高系统的性能和能耗。
{"title":"Extending OmpSs for OpenCL Kernel Co-Execution in Heterogeneous Systems","authors":"Borja Pérez, Esteban Stafford, J. L. Bosque, R. Beivide, Sergi Mateo, Xavier Teruel, X. Martorell, E. Ayguadé","doi":"10.1109/SBAC-PAD.2017.8","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.8","url":null,"abstract":"Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129654589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploiting Data Compression to Mitigate Aging in GPU Register Files 利用数据压缩缓解GPU寄存器文件老化
F. Candel, A. Valero, S. Petit, D. S. Gracia, J. Sahuquillo
Nowadays, GPUs sit at the forefront of highperformance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance.At nanometer technologies, the SRAM cells that implement register files suffer the Negative Bias Temperature Instability (NBTI) effect, which degrades the transistor threshold voltage Vth and, in turn, can make cells faulty unreliable when they hold the same logic value for long periods of time.Fortunately, the GPU single-thread multiple-data execution model writes data in recognizable patterns. This work proposes mechanisms to detect those patterns, and to compress and shuffle the data, so that compressed register file entries can be safely powered off, mitigating NBTI aging.Experimental results show that a conventional GPU register file experiences the worst case for NBTI, since maintains cells with a single logic value during the entire application execution (i.e., a 100% 0 and 1 duty cycle distributions). On average, the proposal reduces these distributions by 61% and 72%, respectively, which translates into Vth degradation savings by 57% and 64%, respectively.
如今,gpu由于其巨大的计算能力而处于高性能计算的最前沿。在内部,成千上万的功能单元(架构由大型寄存器文件提供)为这样的性能提供了动力。在纳米技术中,实现寄存器文件的SRAM单元遭受负偏置温度不稳定性(NBTI)效应,这降低了晶体管阈值电压Vth,反过来,当它们长时间保持相同的逻辑值时,可能会使单元故障不可靠。幸运的是,GPU单线程多数据执行模型以可识别的模式写入数据。这项工作提出了检测这些模式的机制,并压缩和打乱数据,以便压缩的寄存器文件条目可以安全地关闭,减轻NBTI老化。实验结果表明,传统的GPU寄存器文件经历了NBTI的最坏情况,因为在整个应用程序执行期间(即100%的0和1占空比分布)保持单个逻辑值的单元。平均而言,该提案分别将这些分布减少了61%和72%,这分别转化为减少了57%和64%的Vth退化。
{"title":"Exploiting Data Compression to Mitigate Aging in GPU Register Files","authors":"F. Candel, A. Valero, S. Petit, D. S. Gracia, J. Sahuquillo","doi":"10.1109/SBAC-PAD.2017.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.15","url":null,"abstract":"Nowadays, GPUs sit at the forefront of highperformance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance.At nanometer technologies, the SRAM cells that implement register files suffer the Negative Bias Temperature Instability (NBTI) effect, which degrades the transistor threshold voltage Vth and, in turn, can make cells faulty unreliable when they hold the same logic value for long periods of time.Fortunately, the GPU single-thread multiple-data execution model writes data in recognizable patterns. This work proposes mechanisms to detect those patterns, and to compress and shuffle the data, so that compressed register file entries can be safely powered off, mitigating NBTI aging.Experimental results show that a conventional GPU register file experiences the worst case for NBTI, since maintains cells with a single logic value during the entire application execution (i.e., a 100% 0 and 1 duty cycle distributions). On average, the proposal reduces these distributions by 61% and 72%, respectively, which translates into Vth degradation savings by 57% and 64%, respectively.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130829937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Addressing Energy Challenges in Filter Caches 解决过滤器缓存中的能源挑战
Ricardo Alves, Nikos Nikoleris, S. Kaxiras, D. Black-Schaffer
Filter caches and way-predictors are common approaches to improve the efficiency and/or performance of first-level caches. Filter caches use a small L0 to provide more efficient and faster access to a small subset of the data, and work well for programs with high locality. Way-predictors improve efficiency by accessing only the way predicted, which alleviates the need to read all ways in parallel without increasing latency, but hurts performance due to mispredictions.In this work we examine how SRAM layout constraints (h-trees and data mapping inside the cache) affect way-predictors and filter caches. We show that accessing the smaller L0 array can be significantly more energy efficient than attempting to read fewer ways from a larger L1 cache; and that the main source of energy inefficiency in filter caches comes from L0 and L1 misses. We propose a filter cache optimization that shares the tag array between the L0 and the L1, which incurs the overhead of reading the larger tag array on every access, but in return allows us to directly access the correct L1 way on each L0 miss. This optimization does not add any extra latency and counter-intuitively, improves the filter caches overall energy efficiency beyond that of the way-predictor.By combining the low power benefits of a physically smaller L0 with the reduction in miss energy by reading L1 tags upfront in parallel with L0 data, we show that the optimized filter cache reduces the dynamic cache energy compared to a traditional filter cache by 26% while providing the same performance advantage. Compared to a way-predictor, the optimized cache improves performance by 6% and energy by 2%.
过滤器缓存和方式预测器是提高第一级缓存的效率和/或性能的常用方法。过滤器缓存使用较小的L0来提供对一小部分数据的更高效和更快的访问,并且对于具有高局部性的程序很好地工作。方式预测器通过只访问预测的方式来提高效率,这减少了并行读取所有方式而不增加延迟的需要,但由于错误预测而损害了性能。在这项工作中,我们研究了SRAM布局约束(h树和缓存内的数据映射)如何影响方式预测器和过滤器缓存。我们表明,访问较小的L0数组比尝试从较大的L1缓存中读取更少的方式要节能得多;滤波器缓存中能量效率低下的主要原因是L0和L1缺失。我们提出了一种过滤器缓存优化,它在L0和L1之间共享标签数组,这在每次访问时都会产生读取较大标签数组的开销,但作为回报,我们可以在每次L0 miss时直接访问正确的L1方式。这种优化不会增加任何额外的延迟,而且与直觉相反,它提高了过滤器缓存的整体能源效率,超过了way-predictor。通过将物理上较小的L0的低功耗优势与通过与L0数据并行预先读取L1标签来减少遗漏能量相结合,我们表明,与传统滤波器缓存相比,优化的滤波器缓存在提供相同性能优势的同时减少了26%的动态缓存能量。与方式预测器相比,优化后的缓存性能提高了6%,能耗提高了2%。
{"title":"Addressing Energy Challenges in Filter Caches","authors":"Ricardo Alves, Nikos Nikoleris, S. Kaxiras, D. Black-Schaffer","doi":"10.1109/SBAC-PAD.2017.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.14","url":null,"abstract":"Filter caches and way-predictors are common approaches to improve the efficiency and/or performance of first-level caches. Filter caches use a small L0 to provide more efficient and faster access to a small subset of the data, and work well for programs with high locality. Way-predictors improve efficiency by accessing only the way predicted, which alleviates the need to read all ways in parallel without increasing latency, but hurts performance due to mispredictions.In this work we examine how SRAM layout constraints (h-trees and data mapping inside the cache) affect way-predictors and filter caches. We show that accessing the smaller L0 array can be significantly more energy efficient than attempting to read fewer ways from a larger L1 cache; and that the main source of energy inefficiency in filter caches comes from L0 and L1 misses. We propose a filter cache optimization that shares the tag array between the L0 and the L1, which incurs the overhead of reading the larger tag array on every access, but in return allows us to directly access the correct L1 way on each L0 miss. This optimization does not add any extra latency and counter-intuitively, improves the filter caches overall energy efficiency beyond that of the way-predictor.By combining the low power benefits of a physically smaller L0 with the reduction in miss energy by reading L1 tags upfront in parallel with L0 data, we show that the optimized filter cache reduces the dynamic cache energy compared to a traditional filter cache by 26% while providing the same performance advantage. Compared to a way-predictor, the optimized cache improves performance by 6% and energy by 2%.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130163401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Exploring Heterogeneous Mobile Architectures with a High-Level Programming Model 用高级编程模型探索异构移动架构
W. D. C. Moreira, Guilherme Andrade, Pedro Caldeira, Renato Utsch Goncalves, R. Ferreira, L. Rocha, Renan de Carvalho Sousa, Millas Nasser Ramsses Avelar
The development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. This emerging scenario of heterogeneous mobile architectures brings challenging issues regarding the use of the available computing resources. Such issues are mainly related to the intrinsic complexity of coordinating these processors in order to increase application performance. In this sense, this paper presents a high-level programming model to implement parallel patterns that can be executed in a coordinate way by heterogeneous mobile architectures. A comparative analysis of performance and programming complexity is presented, contrasting code generated automatically from the proposed programming model with low-level manually-optimized implementations.
新技术的发展正在开创一个新时代,其特征之一是包含cpu和gpu的复杂移动设备的兴起。这种异质移动架构的新场景带来了关于使用可用计算资源的挑战性问题。这些问题主要与协调这些处理器以提高应用程序性能的内在复杂性有关。在这个意义上,本文提出了一个高级编程模型来实现并行模式,这些模式可以在异构移动架构中以协调的方式执行。对性能和编程复杂度进行了比较分析,并将所提出的编程模型自动生成的代码与低级别的人工优化实现进行了对比。
{"title":"Exploring Heterogeneous Mobile Architectures with a High-Level Programming Model","authors":"W. D. C. Moreira, Guilherme Andrade, Pedro Caldeira, Renato Utsch Goncalves, R. Ferreira, L. Rocha, Renan de Carvalho Sousa, Millas Nasser Ramsses Avelar","doi":"10.1109/SBAC-PAD.2017.11","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.11","url":null,"abstract":"The development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. This emerging scenario of heterogeneous mobile architectures brings challenging issues regarding the use of the available computing resources. Such issues are mainly related to the intrinsic complexity of coordinating these processors in order to increase application performance. In this sense, this paper presents a high-level programming model to implement parallel patterns that can be executed in a coordinate way by heterogeneous mobile architectures. A comparative analysis of performance and programming complexity is presented, contrasting code generated automatically from the proposed programming model with low-level manually-optimized implementations.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126424520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Online Multimedia Similarity Search with Response Time-Aware Parallelism and Task Granularity Auto-Tuning 具有响应时间感知并行性和任务粒度自动调优的在线多媒体相似度搜索
Guilherme Andrade, George Teodoro, R. Ferreira
This paper presents an efficient parallel implementation of the Product Quantization based approximate nearest neighbor multimedia similarity search indexing (PQANNS). The parallel PQANNS efficiently answers nearest neighbor queries by exploiting the ability of the quantization approach to reduce the data dimensionality (and memory demand) and by leveraging parallelism to speed up the search capabilities of the application. Our solution is also optimized to minimize query response times under scenarios with fluctuating query rates (load) as observed in online services. To achieve this goal, we have developed strategies to dynamically select the parallelism configuration and task granularity that minimizes the query response times during the execution. The proposed strategies (ADAPT and ADAPT+G) were thoroughly evaluated and have shown, for instance, to reduce the query response times in 6.4x as compared to the best static configuration of parallelism and task granularity.
提出了一种基于产品量化的近似最近邻多媒体相似度搜索索引(PQANNS)的高效并行实现方法。并行pqann通过利用量化方法的能力来降低数据维数(和内存需求),并利用并行性来加快应用程序的搜索能力,从而有效地回答最近邻查询。我们的解决方案还进行了优化,以便在在线服务中观察到的查询率(负载)波动的情况下最大限度地减少查询响应时间。为了实现这一目标,我们开发了一些策略来动态选择并行配置和任务粒度,从而最大限度地减少执行期间的查询响应时间。所提出的策略(ADAPT和ADAPT+G)经过了全面的评估,并且显示,与并行性和任务粒度的最佳静态配置相比,查询响应时间减少了6.4倍。
{"title":"Online Multimedia Similarity Search with Response Time-Aware Parallelism and Task Granularity Auto-Tuning","authors":"Guilherme Andrade, George Teodoro, R. Ferreira","doi":"10.1109/SBAC-PAD.2017.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.27","url":null,"abstract":"This paper presents an efficient parallel implementation of the Product Quantization based approximate nearest neighbor multimedia similarity search indexing (PQANNS). The parallel PQANNS efficiently answers nearest neighbor queries by exploiting the ability of the quantization approach to reduce the data dimensionality (and memory demand) and by leveraging parallelism to speed up the search capabilities of the application. Our solution is also optimized to minimize query response times under scenarios with fluctuating query rates (load) as observed in online services. To achieve this goal, we have developed strategies to dynamically select the parallelism configuration and task granularity that minimizes the query response times during the execution. The proposed strategies (ADAPT and ADAPT+G) were thoroughly evaluated and have shown, for instance, to reduce the query response times in 6.4x as compared to the best static configuration of parallelism and task granularity.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130969413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1