This paper presents a new frontier where future computer systems can continue to evolve as CMOS technology reaches its fundamental performance and density scaling limits. Our idea adopts freespace circuit-switched optical interconnect in massive multicore networking on chips and modules to flexibly configure private cache-coherent networks for allocated groups of cores in a software-defined manner. The proposed scheme can avoid networking inefficiencies due to the core resource fragmentation by providing deterministically lower latencies and higher bandwidth while advancing the technology roadmap with lower power consumption and improved cooling. We also discuss implementation plan and challenges for our proposal.
{"title":"Software-defined massive multicore networking via freespace optical interconnect","authors":"Y. Katayama, A. Okazaki, N. Ohba","doi":"10.1145/2482767.2482802","DOIUrl":"https://doi.org/10.1145/2482767.2482802","url":null,"abstract":"This paper presents a new frontier where future computer systems can continue to evolve as CMOS technology reaches its fundamental performance and density scaling limits. Our idea adopts freespace circuit-switched optical interconnect in massive multicore networking on chips and modules to flexibly configure private cache-coherent networks for allocated groups of cores in a software-defined manner. The proposed scheme can avoid networking inefficiencies due to the core resource fragmentation by providing deterministically lower latencies and higher bandwidth while advancing the technology roadmap with lower power consumption and improved cooling. We also discuss implementation plan and challenges for our proposal.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"14 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123446649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frank Hannig, Moritz Schmid, Vahid Lari, Srinivas Boppu, J. Teich
As data locality is a key factor for the acceleration of loop programs on processor arrays, we propose a buffer architecture that can be configured at run-time to select between different schemes for memory access. In addition to traditional address-based memory banks, the buffer architecture can deliver data in a streaming manner to the processing elements of the array, which supports dense and sparse stencil operations. Moreover, to minimize data transfers to the buffers, the design contains an interlinked mode, which is especially targeted at 2-D kernel computations. The buffers can be used individually to achieve high data throughput by utilizing a maximum number of I/O channels to the array, or concatenated to provide higher storage capacity at a reduced amount of I/O channels.
{"title":"System integration of tightly-coupled processor arrays using reconfigurable buffer structures","authors":"Frank Hannig, Moritz Schmid, Vahid Lari, Srinivas Boppu, J. Teich","doi":"10.1145/2482767.2482770","DOIUrl":"https://doi.org/10.1145/2482767.2482770","url":null,"abstract":"As data locality is a key factor for the acceleration of loop programs on processor arrays, we propose a buffer architecture that can be configured at run-time to select between different schemes for memory access. In addition to traditional address-based memory banks, the buffer architecture can deliver data in a streaming manner to the processing elements of the array, which supports dense and sparse stencil operations. Moreover, to minimize data transfers to the buffers, the design contains an interlinked mode, which is especially targeted at 2-D kernel computations. The buffers can be used individually to achieve high data throughput by utilizing a maximum number of I/O channels to the array, or concatenated to provide higher storage capacity at a reduced amount of I/O channels.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116723807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Haas, Michael Lippautz, T. Henzinger, H. Payer, A. Sokolova, C. Kirsch, A. Sezgin
A prominent remedy to multicore scalability issues in concurrent data structure implementations is to relax the sequential specification of the data structure. We present distributed queues (DQ), a new family of relaxed concurrent queue implementations. DQs implement relaxed queues with linearizable emptiness check and either configurable or bounded out-of-order behavior or pool behavior. Our experiments show that DQs outperform and outscale in micro- and macrobenchmarks all strict and relaxed queue as well as pool implementations that we considered.
{"title":"Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation","authors":"Andreas Haas, Michael Lippautz, T. Henzinger, H. Payer, A. Sokolova, C. Kirsch, A. Sezgin","doi":"10.1145/2482767.2482789","DOIUrl":"https://doi.org/10.1145/2482767.2482789","url":null,"abstract":"A prominent remedy to multicore scalability issues in concurrent data structure implementations is to relax the sequential specification of the data structure. We present distributed queues (DQ), a new family of relaxed concurrent queue implementations. DQs implement relaxed queues with linearizable emptiness check and either configurable or bounded out-of-order behavior or pool behavior. Our experiments show that DQs outperform and outscale in micro- and macrobenchmarks all strict and relaxed queue as well as pool implementations that we considered.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134241496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.
{"title":"A divide-and-conquer approach for solving singular value decomposition on a heterogeneous system","authors":"Ding Liu, Ruixuan Li, D. Lilja, Weijun Xiao","doi":"10.1145/2482767.2482813","DOIUrl":"https://doi.org/10.1145/2482767.2482813","url":null,"abstract":"Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126078727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joe Meehean, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, M. Livny
We introduce Harmony, a system for extracting the multiprocessor scheduling policies from commodity operating systems. Harmony can be used to unearth many aspects of multiprocessor scheduling policy, including the nuanced behaviors of core scheduling mechanisms and policies. We demonstrate the effectiveness of Harmony by applying it to the analysis of the load-balancing behavior of three Linux schedulers: O(1), CFS, and BFS. Our analysis uncovers the strengths and weaknesses of each of these schedulers, and more generally shows how to utilize Harmony to perform detailed analyses of complex scheduling systems.
{"title":"Uncovering CPU load balancing policies with harmony","authors":"Joe Meehean, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, M. Livny","doi":"10.1145/2482767.2482784","DOIUrl":"https://doi.org/10.1145/2482767.2482784","url":null,"abstract":"We introduce Harmony, a system for extracting the multiprocessor scheduling policies from commodity operating systems. Harmony can be used to unearth many aspects of multiprocessor scheduling policy, including the nuanced behaviors of core scheduling mechanisms and policies. We demonstrate the effectiveness of Harmony by applying it to the analysis of the load-balancing behavior of three Linux schedulers: O(1), CFS, and BFS. Our analysis uncovers the strengths and weaknesses of each of these schedulers, and more generally shows how to utilize Harmony to perform detailed analyses of complex scheduling systems.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130511957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Wen, Jong-Hyuk Lee, Ziyi Liu, Qingji Zheng, W. Shi, Shouhuai Xu, Taeweon Suh
Virtualization is fundamental to cloud computing because it allows multiple operating systems to run simultaneously on a physical machine. However, it also brings a range of security/privacy problems. One particularly challenging and important problem is: how can we protect the Virtual Machines (VMs) from being attacked by Virtual Machine Monitors (VMMs) and/or by the cloud vendors when they are not trusted? In this paper, we propose an architectural solution to the above problem in multi-processor cloud environments. Our key idea is to exploit hardware mechanisms to enforce access control over the shared resources (e.g., memory spaces), while protecting VM memory integrity as well as inter-processor communications and data sharing. We evaluate the solution using full-system emulation and cycle-based architecture models. Experiments based on 20 benchmark applications show that the performance overhead is 1.5%--10% when access control is enforced, and 9%--19% when VM memory is encrypted.
{"title":"Multi-processor architectural support for protecting virtual machine privacy in untrusted cloud environment","authors":"Y. Wen, Jong-Hyuk Lee, Ziyi Liu, Qingji Zheng, W. Shi, Shouhuai Xu, Taeweon Suh","doi":"10.1145/2482767.2482799","DOIUrl":"https://doi.org/10.1145/2482767.2482799","url":null,"abstract":"Virtualization is fundamental to cloud computing because it allows multiple operating systems to run simultaneously on a physical machine. However, it also brings a range of security/privacy problems. One particularly challenging and important problem is: how can we protect the Virtual Machines (VMs) from being attacked by Virtual Machine Monitors (VMMs) and/or by the cloud vendors when they are not trusted? In this paper, we propose an architectural solution to the above problem in multi-processor cloud environments. Our key idea is to exploit hardware mechanisms to enforce access control over the shared resources (e.g., memory spaces), while protecting VM memory integrity as well as inter-processor communications and data sharing. We evaluate the solution using full-system emulation and cycle-based architecture models. Experiments based on 20 benchmark applications show that the performance overhead is 1.5%--10% when access control is enforced, and 9%--19% when VM memory is encrypted.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123110347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vishakha Gupta, Rob C. Knauerhase, P. Brett, K. Schwan
On-chip heterogeneity has become key to balancing performance and power constraints, resulting in disparate (functionally overlapping but not equivalent) cores on a single die. Requiring developers to deal with such heterogeneity can impede adoption through increased programming effort and result in cross-platform incompatibility. We propose that systems software must evolve to dynamically accommodate heterogeneity and to automatically choose task-to-resource mappings to best use these features. We describe the kinship approach for mapping workloads to heterogeneous cores. A hypervisor-level realization of the approach on a variety of experimental heterogeneous platforms demonstrates the general applicability and utility of kinship-based scheduling, matching dynamic workloads to available resources as well as scaling with the number of processes and with different types/configurations of compute resources. Performance advantages of kinship based scheduling are evident for runs across multiple generations of heterogeneous platforms.
{"title":"Kinship: efficient resource management for performance and functionally asymmetric platforms","authors":"Vishakha Gupta, Rob C. Knauerhase, P. Brett, K. Schwan","doi":"10.1145/2482767.2482787","DOIUrl":"https://doi.org/10.1145/2482767.2482787","url":null,"abstract":"On-chip heterogeneity has become key to balancing performance and power constraints, resulting in disparate (functionally overlapping but not equivalent) cores on a single die. Requiring developers to deal with such heterogeneity can impede adoption through increased programming effort and result in cross-platform incompatibility. We propose that systems software must evolve to dynamically accommodate heterogeneity and to automatically choose task-to-resource mappings to best use these features. We describe the kinship approach for mapping workloads to heterogeneous cores. A hypervisor-level realization of the approach on a variety of experimental heterogeneous platforms demonstrates the general applicability and utility of kinship-based scheduling, matching dynamic workloads to available resources as well as scaling with the number of processes and with different types/configurations of compute resources. Performance advantages of kinship based scheduling are evident for runs across multiple generations of heterogeneous platforms.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115431386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The general-purpose computing on graphics processing units (GPGPUs) are increasingly used to accelerate parallel applications. This makes reliability a growing concern in GPUs as they are originally designed for graphics processing with relaxed requirements for execution correctness. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper aims to enhance the GPGPU reliability in light of soft errors. We leverage the GPGPU microarchitecture characteristics, and propose energy-efficient protection mechanisms for two typical SRAM-based structures (i.e. instruction buffer and registers) which suffer high susceptibility. We develop Similarity-AWare Protection (SAWP) scheme that leverages the instruction similarity to provide the near-full ECC protection to the instruction buffer with quite little area and power overhead. Based on the observation that shared memory usually exhibits low utilization, we propose SHAred memory to Register Protection (SHARP) scheme, it intelligently leverages shared memory to hold the ECCs of registers. Experimental results show that our techniques have the strong capability of substantially improving the structure vulnerability, and significantly reducing the power consumption compared to the full ECC protection mechanism.
{"title":"Cost-effective soft-error protection for SRAM-based structures in GPGPUs","authors":"Jingweijia Tan, Zhi Li, Xin Fu","doi":"10.1145/2482767.2482804","DOIUrl":"https://doi.org/10.1145/2482767.2482804","url":null,"abstract":"The general-purpose computing on graphics processing units (GPGPUs) are increasingly used to accelerate parallel applications. This makes reliability a growing concern in GPUs as they are originally designed for graphics processing with relaxed requirements for execution correctness. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper aims to enhance the GPGPU reliability in light of soft errors. We leverage the GPGPU microarchitecture characteristics, and propose energy-efficient protection mechanisms for two typical SRAM-based structures (i.e. instruction buffer and registers) which suffer high susceptibility. We develop Similarity-AWare Protection (SAWP) scheme that leverages the instruction similarity to provide the near-full ECC protection to the instruction buffer with quite little area and power overhead. Based on the observation that shared memory usually exhibits low utilization, we propose SHAred memory to Register Protection (SHARP) scheme, it intelligently leverages shared memory to hold the ECCs of registers. Experimental results show that our techniques have the strong capability of substantially improving the structure vulnerability, and significantly reducing the power consumption compared to the full ECC protection mechanism.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114672034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a hardware assisted approach for constructing software traces in binary translation systems. The new approach leverages an enhanced performance monitoring hardware (PMU) with a combination of hardware techniques: branch prediction information, branch trace collection, and a hardware signature representing the calling context. Overall, the combined approach significantly reduces the time and overhead in building traces, while capturing high-quality traces. Our approach significantly reduces the time to build traces, compared to previous research which exploited a sampling PMU mechanism. The calling context signature could also be used in other applications such as debugging, program understanding, security and other optimizations.
{"title":"Trace construction using enhanced performance monitoring","authors":"M. Serrano","doi":"10.1145/2482767.2482811","DOIUrl":"https://doi.org/10.1145/2482767.2482811","url":null,"abstract":"We present a hardware assisted approach for constructing software traces in binary translation systems. The new approach leverages an enhanced performance monitoring hardware (PMU) with a combination of hardware techniques: branch prediction information, branch trace collection, and a hardware signature representing the calling context. Overall, the combined approach significantly reduces the time and overhead in building traces, while capturing high-quality traces. Our approach significantly reduces the time to build traces, compared to previous research which exploited a sampling PMU mechanism. The calling context signature could also be used in other applications such as debugging, program understanding, security and other optimizations.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116049961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work we propose a shared floating point unit (FPU) architecture for ultra low power (ULP) system on chips operating at near threshold voltage (NTV). Since high-performance FP units (FPUs) are large and complex, but their utilization is relatively low, adding one FPU per each core in a ULP multicore is costly and power hungry. In our approach, we share a few FPUs among all the cores in the system. This increases the utilization of FPUs leading to an energy-efficient design. As a part of our approach, we propose two different FPU allocation techniques: optimal and random. Experimental results demonstrate that compared to a traditional private-FPU approach, our technique in a multicore system with 8 processors and 2 shared FPUs can increase the performance/(area*power) by 5x for applications with 10% FP operations and by 2.5x for applications with 25% FP operations.
{"title":"A shared-FPU architecture for ultra-low power MPSoCs","authors":"M. R. Kakoee, Igor Loi, L. Benini","doi":"10.1145/2482767.2482772","DOIUrl":"https://doi.org/10.1145/2482767.2482772","url":null,"abstract":"In this work we propose a shared floating point unit (FPU) architecture for ultra low power (ULP) system on chips operating at near threshold voltage (NTV). Since high-performance FP units (FPUs) are large and complex, but their utilization is relatively low, adding one FPU per each core in a ULP multicore is costly and power hungry. In our approach, we share a few FPUs among all the cores in the system. This increases the utilization of FPUs leading to an energy-efficient design. As a part of our approach, we propose two different FPU allocation techniques: optimal and random.\u0000 Experimental results demonstrate that compared to a traditional private-FPU approach, our technique in a multicore system with 8 processors and 2 shared FPUs can increase the performance/(area*power) by 5x for applications with 10% FP operations and by 2.5x for applications with 25% FP operations.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129802258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}