Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408662
Steve Kirsch
The HPC community has struggled to find a parallel programming model or language that can efficiently expose algorithmic parallelism in a sequential program and automate the implementation of a highly efficient parallel program. A plethora of parallel programming languages have been developed along with sophisticated compilers and runtimes, but none of these approaches have been successful enough to became a defacto standard. Graph Programming Model has the capability and efficiencies to become that ubiquitous standard for the signal processing domain.
{"title":"Graph programming model: An efficient approach for sensor signal processing","authors":"Steve Kirsch","doi":"10.1109/HPEC.2012.6408662","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408662","url":null,"abstract":"The HPC community has struggled to find a parallel programming model or language that can efficiently expose algorithmic parallelism in a sequential program and automate the implementation of a highly efficient parallel program. A plethora of parallel programming languages have been developed along with sophisticated compilers and runtimes, but none of these approaches have been successful enough to became a defacto standard. Graph Programming Model has the capability and efficiencies to become that ubiquitous standard for the signal processing domain.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125544233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408672
D. Cousins, K. Rohloff, Chris Peikert, R. Schantz
Accelerating the development of a practical Fully Homomorphic Encryption (FHE) scheme is the goal of the DARPA PROCEED program. For the past year, this program has had as its focus the acceleration of various aspects of the FHE concept toward practical implementation and use. FHE would be a game-changing technology to enable secure, general computation on encrypted data, e.g., on untrusted off-site hardware. However, FHE will still require several orders of magnitude improvement in computation before it will be practical for widespread use. Recent theoretical breakthroughs demonstrated the existence of FHE schemes [1, 2], and to date much progress has been made in both algorithmic and implementation improvements. Specifically our contribution to the Proceed program has been the development of FPGA based hardware primitives to accelerate the computation on encrypted data using FHE based on lattice techniques [3]. Our project, SIPHER, has been using a state of the art tool-chain developed by Mathworks to implement VHDL code for FPGA circuits directly from Simulink models. Our baseline Homomorphic Encryption prototypes are developed directly in Matlab using the fixed point toolbox to perform the required integer arithmetic. Constant improvements in algorithms require us to be able to quickly implement them in a high level language such as Matlab. We reported on our initial results at HPEC 2011 [4]. In the past year, increases in algorithm complexity have introduced several new design requirements for our FPGA implementation. This report presents new Simulink primitives that had to be developed to deal with these new requirements.
{"title":"An update on SIPHER (Scalable Implementation of Primitives for Homomorphic EncRyption) — FPGA implementation using Simulink","authors":"D. Cousins, K. Rohloff, Chris Peikert, R. Schantz","doi":"10.1109/HPEC.2012.6408672","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408672","url":null,"abstract":"Accelerating the development of a practical Fully Homomorphic Encryption (FHE) scheme is the goal of the DARPA PROCEED program. For the past year, this program has had as its focus the acceleration of various aspects of the FHE concept toward practical implementation and use. FHE would be a game-changing technology to enable secure, general computation on encrypted data, e.g., on untrusted off-site hardware. However, FHE will still require several orders of magnitude improvement in computation before it will be practical for widespread use. Recent theoretical breakthroughs demonstrated the existence of FHE schemes [1, 2], and to date much progress has been made in both algorithmic and implementation improvements. Specifically our contribution to the Proceed program has been the development of FPGA based hardware primitives to accelerate the computation on encrypted data using FHE based on lattice techniques [3]. Our project, SIPHER, has been using a state of the art tool-chain developed by Mathworks to implement VHDL code for FPGA circuits directly from Simulink models. Our baseline Homomorphic Encryption prototypes are developed directly in Matlab using the fixed point toolbox to perform the required integer arithmetic. Constant improvements in algorithms require us to be able to quickly implement them in a high level language such as Matlab. We reported on our initial results at HPEC 2011 [4]. In the past year, increases in algorithm complexity have introduced several new design requirements for our FPGA implementation. This report presents new Simulink primitives that had to be developed to deal with these new requirements.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114945728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408673
Quinn Martin, A. George
Reconfigurable computing with FPGAs can be highly effective in terms of performance, adaptability, and power for accelerating space applications, but their configuration memory must be scrubbed to prevent the accumulation of single-event upsets. Many scrubbing techniques currently exist, each with different advantages, making it difficult for the system designer to choose the optimal scrubbing strategy for a given mission. This paper surveys the currently available scrubbing techniques and introduces the SOAP method for predicting system availability for various scrubbing strategies using Markov models. We then apply the method to compare hypothetical Virtex-5 and Virtex-6 systems for blind, CRC-32, and Frame ECC scrubbing strategies in LEO and HEO. We show that availability in excess of 5 nines can be obtained with modern, FPGA-based systems using scrubbing. Furthermore, we show the value of the SOAP method by observing that different scrubbing strategies are optimal for different types of missions.
{"title":"Scrubbing optimization via availability prediction (SOAP) for reconfigurable space computing","authors":"Quinn Martin, A. George","doi":"10.1109/HPEC.2012.6408673","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408673","url":null,"abstract":"Reconfigurable computing with FPGAs can be highly effective in terms of performance, adaptability, and power for accelerating space applications, but their configuration memory must be scrubbed to prevent the accumulation of single-event upsets. Many scrubbing techniques currently exist, each with different advantages, making it difficult for the system designer to choose the optimal scrubbing strategy for a given mission. This paper surveys the currently available scrubbing techniques and introduces the SOAP method for predicting system availability for various scrubbing strategies using Markov models. We then apply the method to compare hypothetical Virtex-5 and Virtex-6 systems for blind, CRC-32, and Frame ECC scrubbing strategies in LEO and HEO. We show that availability in excess of 5 nines can be obtained with modern, FPGA-based systems using scrubbing. Furthermore, we show the value of the SOAP method by observing that different scrubbing strategies are optimal for different types of missions.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127752836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408667
N. Sismanis, N. Pitsianis, Xiaobai Sun
We present a new study of parallel algorithms for locating k-nearest neighbors (kNN) of each single query in a high dimensional (feature) space on a many-core processor or accelerator that favors synchronous operations, such as on a graphics processing unit. Exploiting the intimate relationships between two primitive operations, select and sort, we introduce a cohort of truncated sort algorithms for parallel kNN search. The truncated bitonic sort (TBiS) in particular has desirable data locality, synchronous concurrency and simple data and program structures. Its implementation on a graphics processing unit outperforms the other existing implementations for kNN search based on either sort or select operations. We provide algorithm analysis and experimental results.
{"title":"Parallel search of k-nearest neighbors with synchronous operations","authors":"N. Sismanis, N. Pitsianis, Xiaobai Sun","doi":"10.1109/HPEC.2012.6408667","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408667","url":null,"abstract":"We present a new study of parallel algorithms for locating k-nearest neighbors (kNN) of each single query in a high dimensional (feature) space on a many-core processor or accelerator that favors synchronous operations, such as on a graphics processing unit. Exploiting the intimate relationships between two primitive operations, select and sort, we introduce a cohort of truncated sort algorithms for parallel kNN search. The truncated bitonic sort (TBiS) in particular has desirable data locality, synchronous concurrency and simple data and program structures. Its implementation on a graphics processing unit outperforms the other existing implementations for kNN search based on either sort or select operations. We provide algorithm analysis and experimental results.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114815318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408680
David Ediger, R. McColl, E. J. Riedy, David A. Bader
The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may be changing. In this paper, we present our high performance, scalable and portable software, Spatio-Temporal Interaction Networks and Graphs Extensible Representation (STINGER), that includes a graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on semantic graphs with skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER on the Cray XMT processes over 3 million updates per second on a scale-free graph with 537 million edges.
{"title":"STINGER: High performance data structure for streaming graphs","authors":"David Ediger, R. McColl, E. J. Riedy, David A. Bader","doi":"10.1109/HPEC.2012.6408680","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408680","url":null,"abstract":"The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may be changing. In this paper, we present our high performance, scalable and portable software, Spatio-Temporal Interaction Networks and Graphs Extensible Representation (STINGER), that includes a graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on semantic graphs with skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER on the Cray XMT processes over 3 million updates per second on a scale-free graph with 537 million edges.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121930638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408665
Dan Wang, Murtaza Ali
Commercial off-the-self (COTS) components have recently gained popularity in Synthetic Aperture Radar (SAR) applications. The compute capabilities of these devices have advanced to a level where real time processing of complex SAR algorithms have become feasible. In this paper, we focus on a low power multi-core Digital Signal Processor (DSP) from Texas Instruments Inc. and evaluate its capability for SAR signal processing. The specific DSP studied here is an eight-core device, codenamed TMS320C6678, that provides a peak performance of 128 GFLOPS (single precision) for only 10 watts. We describe how the basic SAR operations can be implemented efficiently in such a device. Our results indicate that a baseline SAR range-Doppler algorithm takes around 0.25 second for a 16 M (4K × 4K) image, achieving real-time performance.
商用自组装(COTS)组件最近在合成孔径雷达(SAR)应用中得到了广泛的应用。这些设备的计算能力已经发展到可以实时处理复杂SAR算法的水平。本文以德州仪器公司的一款低功耗多核数字信号处理器(DSP)为研究对象,对其SAR信号处理能力进行了评估。这里研究的具体DSP是一个代号为TMS320C6678的八核器件,仅为10瓦提供128 GFLOPS(单精度)的峰值性能。我们描述了如何在这样的设备中有效地实现基本的SAR操作。我们的研究结果表明,基线SAR距离-多普勒算法对于16 M (4K × 4K)图像大约需要0.25秒,实现了实时性能。
{"title":"Synthetic Aperture Radar on low power multi-core Digital Signal Processor","authors":"Dan Wang, Murtaza Ali","doi":"10.1109/HPEC.2012.6408665","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408665","url":null,"abstract":"Commercial off-the-self (COTS) components have recently gained popularity in Synthetic Aperture Radar (SAR) applications. The compute capabilities of these devices have advanced to a level where real time processing of complex SAR algorithms have become feasible. In this paper, we focus on a low power multi-core Digital Signal Processor (DSP) from Texas Instruments Inc. and evaluate its capability for SAR signal processing. The specific DSP studied here is an eight-core device, codenamed TMS320C6678, that provides a peak performance of 128 GFLOPS (single precision) for only 10 watts. We describe how the basic SAR operations can be implemented efficiently in such a device. Our results indicate that a baseline SAR range-Doppler algorithm takes around 0.25 second for a 16 M (4K × 4K) image, achieving real-time performance.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114078962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408657
John Irza, Michael Doerr, Michael Solka
As compute-intensive products proliferate, there is an ever growing need to provide security features to detect tampering, identify cloned or counterfeit hardware, and deter cybersecurity threats. This paper describes the security features of the third generation 100-core HyperX™ processor which addresses these needs. Programmable security barriers allow the processor to implement a red-black System on Chip solution. The implementation of Physically Unclonable Functions (PUFs), encryption/decryption engines, a secure boot controller, and anti-tamper features enable the engineer to realize a secure embedded computing solution in an ultra-low power, many-core, C programmable processor-memory network.
{"title":"A third generation many-core processor for secure embedded computing systems","authors":"John Irza, Michael Doerr, Michael Solka","doi":"10.1109/HPEC.2012.6408657","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408657","url":null,"abstract":"As compute-intensive products proliferate, there is an ever growing need to provide security features to detect tampering, identify cloned or counterfeit hardware, and deter cybersecurity threats. This paper describes the security features of the third generation 100-core HyperX™ processor which addresses these needs. Programmable security barriers allow the processor to implement a red-black System on Chip solution. The implementation of Physically Unclonable Functions (PUFs), encryption/decryption engines, a secure boot controller, and anti-tamper features enable the engineer to realize a secure embedded computing solution in an ultra-low power, many-core, C programmable processor-memory network.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133759362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408660
Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, B. Sunar
As a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme. FHE allows the evaluation of arbitrary functions directly on encrypted data on untwisted servers. In 2010, Gentry and Halevi presented the first FHE implementation on an IBM x3500 server. However, this implementation remains impractical due to the high latency of encryption and recryption. The Gentry-Halevi (GH) FHE primitives utilize multi-million-bit modular multiplications and additions which are time-consuming tasks for a general purpose computer. In the GH-FHE implementation, the most computationally intensive arithmetic operation is modular multiplication. In this paper, the million-bit modular multiplication is computed in two steps. For large number multiplication, Strassen's FFT based algorithm is employed and accelerated on a graphics processing unit (GPU) through its massive parallelism. Subsequently, Barrett modular reduction algorithm is applied to implement modular reduction. As an experimental study, we implement the GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU. The experimental results show the speedup factors of 7.68, 7.4 and 6.59 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation.
{"title":"Accelerating fully homomorphic encryption using GPU","authors":"Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, B. Sunar","doi":"10.1109/HPEC.2012.6408660","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408660","url":null,"abstract":"As a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme. FHE allows the evaluation of arbitrary functions directly on encrypted data on untwisted servers. In 2010, Gentry and Halevi presented the first FHE implementation on an IBM x3500 server. However, this implementation remains impractical due to the high latency of encryption and recryption. The Gentry-Halevi (GH) FHE primitives utilize multi-million-bit modular multiplications and additions which are time-consuming tasks for a general purpose computer. In the GH-FHE implementation, the most computationally intensive arithmetic operation is modular multiplication. In this paper, the million-bit modular multiplication is computed in two steps. For large number multiplication, Strassen's FFT based algorithm is employed and accelerated on a graphics processing unit (GPU) through its massive parallelism. Subsequently, Barrett modular reduction algorithm is applied to implement modular reduction. As an experimental study, we implement the GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU. The experimental results show the speedup factors of 7.68, 7.4 and 6.59 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"25 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114118080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408678
C. Byun, W. Arcand, David Bestor, Bill Bergeron, M. Hubbell, J. Kepner, A. McCabe, P. Michaleas, J. Mullen, David O'Gwynn, Andrew Prout, A. Reuther, Antonio Rosa, Charles Yee
Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M (Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 4M inserts/second has been achieved on an 8 node cluster.
{"title":"Driving big data with big compute","authors":"C. Byun, W. Arcand, David Bestor, Bill Bergeron, M. Hubbell, J. Kepner, A. McCabe, P. Michaleas, J. Mullen, David O'Gwynn, Andrew Prout, A. Reuther, Antonio Rosa, Charles Yee","doi":"10.1109/HPEC.2012.6408678","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408678","url":null,"abstract":"Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M (Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 4M inserts/second has been achieved on an 8 node cluster.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122225700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.1109/HPEC.2012.6408664
C. Steele, J. Bonn
Simulation of large computational systems-on-a-chip (SoCs) is increasing challenging as the number and complexity of components is scaled up. With the ubiquity of programmable components in computational SoCs, fast functional instruction-set simulation (ISS) is increasingly important. Much ISS has been done with straightforward functional models of a non-pipelined fetch-decode-execute iteration written in a low-to-mid-level C-family static language, delivering mid-level efficiency. Some ISS programs, such as QEMU, perform dynamic binary translation to allow software emulation to reach more usable speeds. This relatively complex methodology has not been widely adopted for system modeling. We demonstrate a fresh approach to ISS that achieves performance comparable to a fast dynamic binary translator by exploiting recent advances in just-in-time (JIT) compilers for dynamic languages, such as JavaScript and Lua, together with a specific programming idiom inspired by pipelined processor design. We believe that this approach is relatively accessible to system designers familiar with C-family functional simulator coding styles, and may be generally useful for fast modeling of complex SoC components.
{"title":"Fast functional simulation with a dynamic language","authors":"C. Steele, J. Bonn","doi":"10.1109/HPEC.2012.6408664","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408664","url":null,"abstract":"Simulation of large computational systems-on-a-chip (SoCs) is increasing challenging as the number and complexity of components is scaled up. With the ubiquity of programmable components in computational SoCs, fast functional instruction-set simulation (ISS) is increasingly important. Much ISS has been done with straightforward functional models of a non-pipelined fetch-decode-execute iteration written in a low-to-mid-level C-family static language, delivering mid-level efficiency. Some ISS programs, such as QEMU, perform dynamic binary translation to allow software emulation to reach more usable speeds. This relatively complex methodology has not been widely adopted for system modeling. We demonstrate a fresh approach to ISS that achieves performance comparable to a fast dynamic binary translator by exploiting recent advances in just-in-time (JIT) compilers for dynamic languages, such as JavaScript and Lua, together with a specific programming idiom inspired by pipelined processor design. We believe that this approach is relatively accessible to system designers familiar with C-family functional simulator coding styles, and may be generally useful for fast modeling of complex SoC components.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130079906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}