Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.22
Gustavo José de Sousa, A. Baldassin
Speculative Lock Elision (SLE) is a technique that allows critical sections to be executed optimistically by eliding the lock operation and enabling multiple threads to execute concurrently. In case of inconsistencies, the hardware automatically rolls back the execution and pessimistically acquires the original lock during runtime. The decision to elide the lock in SLE is performed transparently at the microarchitecture level and, although being convenient, it may sometimes hurt performance. To avoid that case, researchers have investigated Transactional Lock Elision (TLE), in which software-controlled hardware transactions are used instead, allowing the creation of policies and heuristics to manage lock elision. Typical implementations of TLE make use of a single lock to serialize the execution in case the original lock cannot be elided, which can potentially degrade performance. In order to improve on such cases, this paper proposes the Fine-Grained Software-assisted Conflict Management (FGSCM) scheme, a TLE technique that employs multiple locks so as to avoid unnecessary serialization of the code. The main idea of FGSCM is that not all threads that conflict inside a critical section are acessing the same region of shared memory. By automatically assigning distinct locks to these threads according to the memory section they access, the level of concurrency can be increased. In this paper we formalize FGSCM and provide an in-depth performance evaluation using a microbenchmark to stress several conflict behaviors. Our initial results with a prototype implementation using Intels Restricted Transactional Memory (RTM) are encouraging. With a quadcore machine, we observed an average performance gain of 11% compared to the single-auxiliary-lock SCM and 36% compared to a standard lock scheme, both for typical read-dominated workloads.
{"title":"FGSCM: A Fine-Grained Approach to Transactional Lock Elision","authors":"Gustavo José de Sousa, A. Baldassin","doi":"10.1109/SBAC-PAD.2017.22","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.22","url":null,"abstract":"Speculative Lock Elision (SLE) is a technique that allows critical sections to be executed optimistically by eliding the lock operation and enabling multiple threads to execute concurrently. In case of inconsistencies, the hardware automatically rolls back the execution and pessimistically acquires the original lock during runtime. The decision to elide the lock in SLE is performed transparently at the microarchitecture level and, although being convenient, it may sometimes hurt performance. To avoid that case, researchers have investigated Transactional Lock Elision (TLE), in which software-controlled hardware transactions are used instead, allowing the creation of policies and heuristics to manage lock elision. Typical implementations of TLE make use of a single lock to serialize the execution in case the original lock cannot be elided, which can potentially degrade performance. In order to improve on such cases, this paper proposes the Fine-Grained Software-assisted Conflict Management (FGSCM) scheme, a TLE technique that employs multiple locks so as to avoid unnecessary serialization of the code. The main idea of FGSCM is that not all threads that conflict inside a critical section are acessing the same region of shared memory. By automatically assigning distinct locks to these threads according to the memory section they access, the level of concurrency can be increased. In this paper we formalize FGSCM and provide an in-depth performance evaluation using a microbenchmark to stress several conflict behaviors. Our initial results with a prototype implementation using Intels Restricted Transactional Memory (RTM) are encouraging. With a quadcore machine, we observed an average performance gain of 11% compared to the single-auxiliary-lock SCM and 36% compared to a standard lock scheme, both for typical read-dominated workloads.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121311886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-06-26DOI: 10.1109/SBAC-PAD.2017.28
J. Araujo, L. Arantes, E. P. Duarte, L. A. Rodrigues, Pierre Sens
In this paper we present VCube-PS, a topic-based Publish/Subscribe system built on the top of a virtual hypercubelike topology. Membership information and published messages to subscribers (members) of a topic group are broadcast over dynamically built spanning trees rooted at the message’s source. For a given topic, delivery of published messages respects causal order. Performance results of experiments conducted on the PeerSim simulator confirm the efficiency of VCube-PS in terms of scalability, latency, number, and size of messages when compared to a single rooted, not dynamically, tree built approach.
{"title":"A Publish/Subscribe System Using Causal Broadcast over Dynamically Built Spanning Trees","authors":"J. Araujo, L. Arantes, E. P. Duarte, L. A. Rodrigues, Pierre Sens","doi":"10.1109/SBAC-PAD.2017.28","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.28","url":null,"abstract":"In this paper we present VCube-PS, a topic-based Publish/Subscribe system built on the top of a virtual hypercubelike topology. Membership information and published messages to subscribers (members) of a topic group are broadcast over dynamically built spanning trees rooted at the message’s source. For a given topic, delivery of published messages respects causal order. Performance results of experiments conducted on the PeerSim simulator confirm the efficiency of VCube-PS in terms of scalability, latency, number, and size of messages when compared to a single rooted, not dynamically, tree built approach.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125022867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SBAC-PAD.2017.24
M. Laghari, D. Unat
High bandwidth memory (HBM) is a new emerging technology that aims to improve the performance of bandwidth limited applications. Even though it provides high bandwidth, it must be augmented with DRAM to meet the memory capacity requirement of any applications. Due to this limitation, objects in an application should be optimally placed on the heterogeneous memory subsystems. In this study, we propose an object placement algorithm that places program objects to fast or slow memories in case the capacity of fast memory is insufficient to hold all the objects to increase the overall application performance. Our algorithm uses the reference counts and type of references (read or write) to make an initial placement of data. In addition, we perform various memory bandwidth benchmarks to be used in our placement algorithm on Intel Knights Landing (KNL) architecture. Not surprisingly high bandwidth memory sustains higher read bandwidth than write bandwidth, however, placing write-intensive data on HBM results in better overall performance because write-intensive data is punished by the DRAM speed more severely compared to read intensive data. Moreover, our benchmarks demonstrate that if a basic block makes references to both types of memories, it performs worse than if it makes references to only one type of memory in some cases. We test our proposed placement algorithm with 6 applications under various system configurations. By allocating objects according to our placement scheme, we are able to achieve a speedup of up to 2x.
{"title":"Object Placement for High Bandwidth Memory Augmented with High Capacity Memory","authors":"M. Laghari, D. Unat","doi":"10.1109/SBAC-PAD.2017.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.24","url":null,"abstract":"High bandwidth memory (HBM) is a new emerging technology that aims to improve the performance of bandwidth limited applications. Even though it provides high bandwidth, it must be augmented with DRAM to meet the memory capacity requirement of any applications. Due to this limitation, objects in an application should be optimally placed on the heterogeneous memory subsystems. In this study, we propose an object placement algorithm that places program objects to fast or slow memories in case the capacity of fast memory is insufficient to hold all the objects to increase the overall application performance. Our algorithm uses the reference counts and type of references (read or write) to make an initial placement of data. In addition, we perform various memory bandwidth benchmarks to be used in our placement algorithm on Intel Knights Landing (KNL) architecture. Not surprisingly high bandwidth memory sustains higher read bandwidth than write bandwidth, however, placing write-intensive data on HBM results in better overall performance because write-intensive data is punished by the DRAM speed more severely compared to read intensive data. Moreover, our benchmarks demonstrate that if a basic block makes references to both types of memories, it performs worse than if it makes references to only one type of memory in some cases. We test our proposed placement algorithm with 6 applications under various system configurations. By allocating objects according to our placement scheme, we are able to achieve a speedup of up to 2x.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127046654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}