2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文中文

FGSCM: A Fine-Grained Approach to Transactional Lock Elision FGSCM:事务锁省略的细粒度方法

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PAD.2017.22

Gustavo José de Sousa, A. Baldassin

Speculative Lock Elision (SLE) is a technique that allows critical sections to be executed optimistically by eliding the lock operation and enabling multiple threads to execute concurrently. In case of inconsistencies, the hardware automatically rolls back the execution and pessimistically acquires the original lock during runtime. The decision to elide the lock in SLE is performed transparently at the microarchitecture level and, although being convenient, it may sometimes hurt performance. To avoid that case, researchers have investigated Transactional Lock Elision (TLE), in which software-controlled hardware transactions are used instead, allowing the creation of policies and heuristics to manage lock elision. Typical implementations of TLE make use of a single lock to serialize the execution in case the original lock cannot be elided, which can potentially degrade performance. In order to improve on such cases, this paper proposes the Fine-Grained Software-assisted Conflict Management (FGSCM) scheme, a TLE technique that employs multiple locks so as to avoid unnecessary serialization of the code. The main idea of FGSCM is that not all threads that conflict inside a critical section are acessing the same region of shared memory. By automatically assigning distinct locks to these threads according to the memory section they access, the level of concurrency can be increased. In this paper we formalize FGSCM and provide an in-depth performance evaluation using a microbenchmark to stress several conflict behaviors. Our initial results with a prototype implementation using Intels Restricted Transactional Memory (RTM) are encouraging. With a quadcore machine, we observed an average performance gain of 11% compared to the single-auxiliary-lock SCM and 36% compared to a standard lock scheme, both for typical read-dominated workloads.

推测锁省略(Speculative Lock省略，SLE)是一种技术，它通过省略锁操作并允许多个线程并发执行，从而允许乐观地执行临界区。如果不一致，硬件会自动回滚执行，并在运行时悲观地获取原始锁。在SLE中省略锁的决定是在微体系结构级别透明地执行的，尽管这很方便，但有时可能会损害性能。为了避免这种情况，研究人员研究了事务性锁省略(TLE)，其中使用软件控制的硬件事务，允许创建策略和启发式方法来管理锁省略。TLE的典型实现使用单个锁来序列化执行，以防无法省略原始锁，这可能会降低性能。为了改善这种情况，本文提出了细粒度软件辅助冲突管理(FGSCM)方案，这是一种使用多个锁以避免不必要的代码序列化的TLE技术。FGSCM的主要思想是，并不是所有在临界区内发生冲突的线程都访问共享内存的同一区域。通过根据这些线程访问的内存区自动为它们分配不同的锁，可以提高并发级别。在本文中，我们形式化了FGSCM，并提供了一个深入的性能评估，使用微基准来强调几种冲突行为。我们使用英特尔受限事务性内存(RTM)的原型实现的初步结果令人鼓舞。在四核机器上，我们观察到与单辅助锁SCM相比，它的平均性能提高了11%，与标准锁方案相比，性能提高了36%，这两种方案都适用于典型的以读为主的工作负载。

{"title":"FGSCM: A Fine-Grained Approach to Transactional Lock Elision","authors":"Gustavo José de Sousa, A. Baldassin","doi":"10.1109/SBAC-PAD.2017.22","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.22","url":null,"abstract":"Speculative Lock Elision (SLE) is a technique that allows critical sections to be executed optimistically by eliding the lock operation and enabling multiple threads to execute concurrently. In case of inconsistencies, the hardware automatically rolls back the execution and pessimistically acquires the original lock during runtime. The decision to elide the lock in SLE is performed transparently at the microarchitecture level and, although being convenient, it may sometimes hurt performance. To avoid that case, researchers have investigated Transactional Lock Elision (TLE), in which software-controlled hardware transactions are used instead, allowing the creation of policies and heuristics to manage lock elision. Typical implementations of TLE make use of a single lock to serialize the execution in case the original lock cannot be elided, which can potentially degrade performance. In order to improve on such cases, this paper proposes the Fine-Grained Software-assisted Conflict Management (FGSCM) scheme, a TLE technique that employs multiple locks so as to avoid unnecessary serialization of the code. The main idea of FGSCM is that not all threads that conflict inside a critical section are acessing the same region of shared memory. By automatically assigning distinct locks to these threads according to the memory section they access, the level of concurrency can be increased. In this paper we formalize FGSCM and provide an in-depth performance evaluation using a microbenchmark to stress several conflict behaviors. Our initial results with a prototype implementation using Intels Restricted Transactional Memory (RTM) are encouraging. With a quadcore machine, we observed an average performance gain of 11% compared to the single-auxiliary-lock SCM and 36% compared to a standard lock scheme, both for typical read-dominated workloads.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121311886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Publish/Subscribe System Using Causal Broadcast over Dynamically Built Spanning Trees 动态生成树上使用因果广播的发布/订阅系统

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2017-06-26 DOI: 10.1109/SBAC-PAD.2017.28

J. Araujo, L. Arantes, E. P. Duarte, L. A. Rodrigues, Pierre Sens

In this paper we present VCube-PS, a topic-based Publish/Subscribe system built on the top of a virtual hypercubelike topology. Membership information and published messages to subscribers (members) of a topic group are broadcast over dynamically built spanning trees rooted at the message’s source. For a given topic, delivery of published messages respects causal order. Performance results of experiments conducted on the PeerSim simulator confirm the efficiency of VCube-PS in terms of scalability, latency, number, and size of messages when compared to a single rooted, not dynamically, tree built approach.

在本文中，我们提出了一个基于主题的发布/订阅系统VCube-PS，它建立在一个虚拟的超立方体拓扑之上。成员信息和发布到主题组订阅者(成员)的消息通过扎根于消息源的动态构建的生成树进行广播。对于给定的主题，发布消息的传递遵循因果顺序。在PeerSim模拟器上进行的性能实验结果证实了VCube-PS在可伸缩性、延迟、消息数量和大小方面的效率，与单根、非动态树构建方法相比。

引用次数: 9

Object Placement for High Bandwidth Memory Augmented with High Capacity Memory 用高容量存储器增强高带宽存储器的对象放置

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 1900-01-01 DOI: 10.1109/SBAC-PAD.2017.24

M. Laghari, D. Unat

High bandwidth memory (HBM) is a new emerging technology that aims to improve the performance of bandwidth limited applications. Even though it provides high bandwidth, it must be augmented with DRAM to meet the memory capacity requirement of any applications. Due to this limitation, objects in an application should be optimally placed on the heterogeneous memory subsystems. In this study, we propose an object placement algorithm that places program objects to fast or slow memories in case the capacity of fast memory is insufficient to hold all the objects to increase the overall application performance. Our algorithm uses the reference counts and type of references (read or write) to make an initial placement of data. In addition, we perform various memory bandwidth benchmarks to be used in our placement algorithm on Intel Knights Landing (KNL) architecture. Not surprisingly high bandwidth memory sustains higher read bandwidth than write bandwidth, however, placing write-intensive data on HBM results in better overall performance because write-intensive data is punished by the DRAM speed more severely compared to read intensive data. Moreover, our benchmarks demonstrate that if a basic block makes references to both types of memories, it performs worse than if it makes references to only one type of memory in some cases. We test our proposed placement algorithm with 6 applications under various system configurations. By allocating objects according to our placement scheme, we are able to achieve a speedup of up to 2x.

高带宽内存(HBM)是一种新兴的技术，旨在提高带宽受限应用的性能。尽管它提供了高带宽，但它必须增加DRAM以满足任何应用程序对内存容量的需求。由于这种限制，应用程序中的对象应该最佳地放在异构内存子系统上。在本研究中，我们提出了一种对象放置算法，在快速存储器容量不足以容纳所有对象的情况下，将程序对象放置到快速存储器或慢速存储器中，以提高整体应用程序性能。我们的算法使用引用计数和引用类型(读或写)来对数据进行初始放置。此外，我们还执行了各种内存带宽基准测试，以用于我们在Intel Knights Landing (KNL)架构上的放置算法。毫不奇怪，高带宽内存可以维持比写带宽更高的读带宽，但是，将写密集型数据放在HBM上会带来更好的整体性能，因为与读密集型数据相比，写密集型数据受到DRAM速度的影响更大。此外，我们的基准测试表明，如果一个基本块引用两种类型的内存，在某些情况下，它的性能会比只引用一种类型的内存差。我们在不同系统配置下的6个应用程序中测试了我们提出的放置算法。通过根据我们的放置方案分配对象，我们能够实现高达2倍的加速。

{"title":"Object Placement for High Bandwidth Memory Augmented with High Capacity Memory","authors":"M. Laghari, D. Unat","doi":"10.1109/SBAC-PAD.2017.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.24","url":null,"abstract":"High bandwidth memory (HBM) is a new emerging technology that aims to improve the performance of bandwidth limited applications. Even though it provides high bandwidth, it must be augmented with DRAM to meet the memory capacity requirement of any applications. Due to this limitation, objects in an application should be optimally placed on the heterogeneous memory subsystems. In this study, we propose an object placement algorithm that places program objects to fast or slow memories in case the capacity of fast memory is insufficient to hold all the objects to increase the overall application performance. Our algorithm uses the reference counts and type of references (read or write) to make an initial placement of data. In addition, we perform various memory bandwidth benchmarks to be used in our placement algorithm on Intel Knights Landing (KNL) architecture. Not surprisingly high bandwidth memory sustains higher read bandwidth than write bandwidth, however, placing write-intensive data on HBM results in better overall performance because write-intensive data is punished by the DRAM speed more severely compared to read intensive data. Moreover, our benchmarks demonstrate that if a basic block makes references to both types of memories, it performs worse than if it makes references to only one type of memory in some cases. We test our proposed placement algorithm with 6 applications under various system configurations. By allocating objects according to our placement scheme, we are able to achieve a speedup of up to 2x.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127046654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀