Power modeling and estimation has become one of the most defining aspects in designing modern embedded systems. In this context, DDR SDRAM memories contribute significantly to system power consumption, but lack accurate and generic power models. The most popular SDRAM power model provided by Micron, is found to be inaccurate or insufficient for several reasons. First, it does not consider the power consumed when transitioning to power-down and self-refresh modes. Second, it employs the minimal timing constraints between commands from the SDRAM datasheets and not the actual duration between the commands as issued by an SDRAM memory controller. Finally, without adaptations, it can only be applied to a memory controller that employs a close-page policy and accesses a single SDRAM bank at a time. These critical issues with Micron's power model impact the accuracy and the validity of the power values reported by it and resolving them, forms the focus of our work. In this paper, we propose an improved SDRAM power model that estimates power consumption during the state transitions to power-saving states, employs an SDRAM command trace to get the actual timings between the commands issued and is generic and applicable to all DDRx SDRAMs and all memory controller policies and all degrees of bank interleaving. We quantitatively compare the proposed model against the unmodified Micron model on power and energy for DDR3-800. We show differences of up to 60% in energy-savings for the precharge power-down mode for a power-down duration of 14 cycles and up to 80% for the self-refresh mode for a self-refresh duration of 560 cycles.
{"title":"Improved Power Modeling of DDR SDRAMs","authors":"K. Chandrasekar, B. Akesson, K. Goossens","doi":"10.1109/DSD.2011.17","DOIUrl":"https://doi.org/10.1109/DSD.2011.17","url":null,"abstract":"Power modeling and estimation has become one of the most defining aspects in designing modern embedded systems. In this context, DDR SDRAM memories contribute significantly to system power consumption, but lack accurate and generic power models. The most popular SDRAM power model provided by Micron, is found to be inaccurate or insufficient for several reasons. First, it does not consider the power consumed when transitioning to power-down and self-refresh modes. Second, it employs the minimal timing constraints between commands from the SDRAM datasheets and not the actual duration between the commands as issued by an SDRAM memory controller. Finally, without adaptations, it can only be applied to a memory controller that employs a close-page policy and accesses a single SDRAM bank at a time. These critical issues with Micron's power model impact the accuracy and the validity of the power values reported by it and resolving them, forms the focus of our work. In this paper, we propose an improved SDRAM power model that estimates power consumption during the state transitions to power-saving states, employs an SDRAM command trace to get the actual timings between the commands issued and is generic and applicable to all DDRx SDRAMs and all memory controller policies and all degrees of bank interleaving. We quantitatively compare the proposed model against the unmodified Micron model on power and energy for DDR3-800. We show differences of up to 60% in energy-savings for the precharge power-down mode for a power-down duration of 14 cycles and up to 80% for the self-refresh mode for a self-refresh duration of 560 cycles.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132067228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a new low-level interfacing scheme for connecting custom accelerators to processors that tolerates latencies that usually occur when accessing hardware accelerators from software. The scheme is based on the Self-adaptive Virtual Processor (SVP) architecture and on the micro-threading concept. Our presentation is based on a sample implementation of the SVP architecture in an extended version of the LEON3 processor called UTLEON3. The SVP concurrency paradigm makes data dependencies explicit in the dynamic tree of threads. This enables a system to execute threads concurrently in different processor cores. Previous SVP work presumed the cores are homogeneous, for example an array of micro threaded processors sharing a dynamic pool of micro threads. In this work we propose a heterogeneous system of general-purpose processor cores and custom hardware accelerators. The accelerators dynamically pick families of threads from the pool and execute them concurrently. We introduce the Thread Mapping Table (TMT) hardware unit that couples the software and hardware implementations of the user computations. The TMT unit allows to realize the coupling scheme seamlessly without modifications of the processor ISA. The advantage of the described scheme is in decoupling application programming from specific details of the hardware accelerator architecture (identical behaviour of a software create and hardware create), and in eliminating the influence of hardware access latencies. Our simulation and FPGA implementation results prove that the additional hardware access latencies in the processor are tolerated by the SVP architecture.
{"title":"Microthreading as a Novel Method for Close Coupling of Custom Hardware Accelerators to SVP Processors","authors":"J. Sykora, Leos Kafka, M. Danek, L. Kohout","doi":"10.1109/DSD.2011.73","DOIUrl":"https://doi.org/10.1109/DSD.2011.73","url":null,"abstract":"We present a new low-level interfacing scheme for connecting custom accelerators to processors that tolerates latencies that usually occur when accessing hardware accelerators from software. The scheme is based on the Self-adaptive Virtual Processor (SVP) architecture and on the micro-threading concept. Our presentation is based on a sample implementation of the SVP architecture in an extended version of the LEON3 processor called UTLEON3. The SVP concurrency paradigm makes data dependencies explicit in the dynamic tree of threads. This enables a system to execute threads concurrently in different processor cores. Previous SVP work presumed the cores are homogeneous, for example an array of micro threaded processors sharing a dynamic pool of micro threads. In this work we propose a heterogeneous system of general-purpose processor cores and custom hardware accelerators. The accelerators dynamically pick families of threads from the pool and execute them concurrently. We introduce the Thread Mapping Table (TMT) hardware unit that couples the software and hardware implementations of the user computations. The TMT unit allows to realize the coupling scheme seamlessly without modifications of the processor ISA. The advantage of the described scheme is in decoupling application programming from specific details of the hardware accelerator architecture (identical behaviour of a software create and hardware create), and in eliminating the influence of hardware access latencies. Our simulation and FPGA implementation results prove that the additional hardware access latencies in the processor are tolerated by the SVP architecture.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131519001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pierre-Henri Horrein, Christine Hennebert, F. Pétrot
This paper presents the Flexible Radio Kernel (FRK), a configuration and execution management environment for hybrid hardware/software flexible radio platform. The aim of FRK is to manage platform reconfiguration for multi-mode, multi-standard operation, with different levels of abstraction. A high level framework is described, to manage multiple MAC layers, and to enable MAC cooperation algorithms for cognitive radio. A low-level environment is also available to manage platform reconfiguration for radio operations. Radio can be implemented using hardware or software elements. Configuration state is hidden to the high-level layers, offering pseudo concurrency (time sharing) properties. This study presents a global view of FRK, with details on some specific parts of the environment. A practical study with algorithmic description is presented.
{"title":"An Environment for (re)configuration and Execution Managenment of Flexible Radio Platforms","authors":"Pierre-Henri Horrein, Christine Hennebert, F. Pétrot","doi":"10.1109/DSD.2011.47","DOIUrl":"https://doi.org/10.1109/DSD.2011.47","url":null,"abstract":"This paper presents the Flexible Radio Kernel (FRK), a configuration and execution management environment for hybrid hardware/software flexible radio platform. The aim of FRK is to manage platform reconfiguration for multi-mode, multi-standard operation, with different levels of abstraction. A high level framework is described, to manage multiple MAC layers, and to enable MAC cooperation algorithms for cognitive radio. A low-level environment is also available to manage platform reconfiguration for radio operations. Radio can be implemented using hardware or software elements. Configuration state is hidden to the high-level layers, offering pseudo concurrency (time sharing) properties. This study presents a global view of FRK, with details on some specific parts of the environment. A practical study with algorithmic description is presented.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131162774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Esterel is a synchronous language suited for describing reactive embedded systems. It combines fine-grained parallelism with precise timing control for the execution of threads. Due to this, Esterel programs have typically been compiled into sequential code in software implementations, as tight synchronization between a large number of threads cannot be efficiently managed with an operating system (OS). This has enabled concurrent Esterel programs to be executed directly on single-core processors. Recently, however, multi-core processors have been increasingly used to achieve better performance in embedded applications. The conventional approach of generating sequential code from Esterel programs is unable to take advantage of multi-core processors. We overcome this limitation by compiling Esterel into a limited number of thread partitions (up to the number of available cores) to avoid the large overheads of implementing each Esterel thread separately within a conventional multithreading scheme. These partitions are then distributed onto separate cores using a static load balancing heuristic. The Esterel threads within a partition may then be dynamically scheduled with or without an OS. To evaluate the viability of this approach, we present experimental results comparing the execution of a set of benchmarks using one to four cores on the Intel Core 2 Quad with Linux, and one to two cores on the Xilinx Micro blaze without any OS. We have performed extensive benchmarking over large Esterel programs to illustrate that achieving throughput with parallel execution of Esterel is benchmark dependent.
{"title":"Compiling Esterel for Multi-core Execution","authors":"S. Yuan, L. Yoong, P. Roop","doi":"10.1109/DSD.2011.97","DOIUrl":"https://doi.org/10.1109/DSD.2011.97","url":null,"abstract":"Esterel is a synchronous language suited for describing reactive embedded systems. It combines fine-grained parallelism with precise timing control for the execution of threads. Due to this, Esterel programs have typically been compiled into sequential code in software implementations, as tight synchronization between a large number of threads cannot be efficiently managed with an operating system (OS). This has enabled concurrent Esterel programs to be executed directly on single-core processors. Recently, however, multi-core processors have been increasingly used to achieve better performance in embedded applications. The conventional approach of generating sequential code from Esterel programs is unable to take advantage of multi-core processors. We overcome this limitation by compiling Esterel into a limited number of thread partitions (up to the number of available cores) to avoid the large overheads of implementing each Esterel thread separately within a conventional multithreading scheme. These partitions are then distributed onto separate cores using a static load balancing heuristic. The Esterel threads within a partition may then be dynamically scheduled with or without an OS. To evaluate the viability of this approach, we present experimental results comparing the execution of a set of benchmarks using one to four cores on the Intel Core 2 Quad with Linux, and one to two cores on the Xilinx Micro blaze without any OS. We have performed extensive benchmarking over large Esterel programs to illustrate that achieving throughput with parallel execution of Esterel is benchmark dependent.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128878258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper a hardware implementation of ZUC stream cipher is presented. ZUC is a stream cipher that forms the heart of the 3GPP confidentiality algorithm 128-EEA3 and the 3GPP integrity algorithm 128-EIA3, offering reliable security services in Long Term Evolution networks (LTE). A detailed hardware implementation is presented in order to reach satisfactory performance results in LTE systems. The design was coded using VHDL language and for the hardware implementation, a XILINX Virtex-5 FPGA was used. Experimental results in terms of performance and hardware resources are presented.
{"title":"An FPGA Implementation of the ZUC Stream Cipher","authors":"P. Kitsos, N. Sklavos, A. Skodras","doi":"10.1109/DSD.2011.109","DOIUrl":"https://doi.org/10.1109/DSD.2011.109","url":null,"abstract":"In this paper a hardware implementation of ZUC stream cipher is presented. ZUC is a stream cipher that forms the heart of the 3GPP confidentiality algorithm 128-EEA3 and the 3GPP integrity algorithm 128-EIA3, offering reliable security services in Long Term Evolution networks (LTE). A detailed hardware implementation is presented in order to reach satisfactory performance results in LTE systems. The design was coded using VHDL language and for the hardware implementation, a XILINX Virtex-5 FPGA was used. Experimental results in terms of performance and hardware resources are presented.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123582890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces four path-based DVFS algorithms for embedded multimedia applications. Application model consists of multiprocessor scheduled task-graphs and input class probability distributions. Design constraints are a soft delay deadline and a minimum completion ratio. The algorithms target four scenarios that correspond to systems with various DVFS and quality of service monitoring capabilities. In the first scenario, all inputs must be timely processed, voltage/frequency level can be adjusted in the beginning of application execution and must be the same for all processors. In the second scenario, the voltage/frequency level of a processor can be individually adjusted when a task execution starts, inputs of particular classes can be discarded without processing. In the third scenario, a processor voltage can be adjusted to the class of the input received. The fourth scenario aims at compensating for online changes of input class distribution in a system with the same capabilities as required for the third scenario.
{"title":"Path-Based Dynamic Voltage and Frequency Scaling Algorithms for Multiprocessor Embedded Applications with Soft Delay Deadlines","authors":"A. Tokarnia, Pedro C. F. Pepe, Leandro D. Pagotto","doi":"10.1109/DSD.2011.18","DOIUrl":"https://doi.org/10.1109/DSD.2011.18","url":null,"abstract":"This paper introduces four path-based DVFS algorithms for embedded multimedia applications. Application model consists of multiprocessor scheduled task-graphs and input class probability distributions. Design constraints are a soft delay deadline and a minimum completion ratio. The algorithms target four scenarios that correspond to systems with various DVFS and quality of service monitoring capabilities. In the first scenario, all inputs must be timely processed, voltage/frequency level can be adjusted in the beginning of application execution and must be the same for all processors. In the second scenario, the voltage/frequency level of a processor can be individually adjusted when a task execution starts, inputs of particular classes can be discarded without processing. In the third scenario, a processor voltage can be adjusted to the class of the input received. The fourth scenario aims at compensating for online changes of input class distribution in a system with the same capabilities as required for the third scenario.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"42 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123729483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Barenghi, G. Bertoni, F. D. Santis, F. Melzani
Side-channel attacks are a realistic threat to the security of real world implementations of cryptographic algorithms. In order to evaluate the resistance of designs against power analysis attacks, power values obtained from circuit simulations in early design phases offer two distinct advantages: First, they offer fast feedback loops to designers, second the number of redesigns can be reduced. This work investigates the accuracy of design time power estimation tools in assessing the security level of a device against differential power attacks.
{"title":"On the Efficiency of Design Time Evaluation of the Resistance to Power Attacks","authors":"Alessandro Barenghi, G. Bertoni, F. D. Santis, F. Melzani","doi":"10.1109/DSD.2011.103","DOIUrl":"https://doi.org/10.1109/DSD.2011.103","url":null,"abstract":"Side-channel attacks are a realistic threat to the security of real world implementations of cryptographic algorithms. In order to evaluate the resistance of designs against power analysis attacks, power values obtained from circuit simulations in early design phases offer two distinct advantages: First, they offer fast feedback loops to designers, second the number of redesigns can be reduced. This work investigates the accuracy of design time power estimation tools in assessing the security level of a device against differential power attacks.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129447574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Maghsoudloo, H. Zarandi, S. Pour-Mozafari, N. Khoshavi
This paper presents a software-based error detection technique through monitoring flow of the programs in multithreaded architectures. This technique is based on the analysis of two key ideas: 1) Modifying the structure of traditional control-flow graphs used by control-flow checking methods so that they can be applied on multi-core and multi-threaded architectures. These achievements in designing control-flow error detectors lead to increase their applicability in current architectures. 2) Adjusting the locations of additional checking assertions in a given program in order to increase the ability of detecting possible control-flow errors along with significant reduction in overheads. The experimental results, through taking into account both detection coverage and overheads, demonstrate that on average about 94% of the control-flow errors can be detected by the proposed technique, more efficient compared to previous works.
{"title":"Soft Error Detection Technique in Multi-threaded Architectures Using Control-Flow Monitoring","authors":"M. Maghsoudloo, H. Zarandi, S. Pour-Mozafari, N. Khoshavi","doi":"10.1109/DSD.2011.104","DOIUrl":"https://doi.org/10.1109/DSD.2011.104","url":null,"abstract":"This paper presents a software-based error detection technique through monitoring flow of the programs in multithreaded architectures. This technique is based on the analysis of two key ideas: 1) Modifying the structure of traditional control-flow graphs used by control-flow checking methods so that they can be applied on multi-core and multi-threaded architectures. These achievements in designing control-flow error detectors lead to increase their applicability in current architectures. 2) Adjusting the locations of additional checking assertions in a given program in order to increase the ability of detecting possible control-flow errors along with significant reduction in overheads. The experimental results, through taking into account both detection coverage and overheads, demonstrate that on average about 94% of the control-flow errors can be detected by the proposed technique, more efficient compared to previous works.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128509194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional trace-driven memory system simulation is a very time consuming process while the advent of multicores simply exacerbates the problem. We propose a framework for accelerating trace-driven multicore cache simulations by utilizing the capabilities of the modern many core GPUs. A straightforward way towards this direction is to rely on the inherent parallelism in cache simulations: communicating cache sets can be simulated independently and concurrently to other sets. Based on this, we map collections of communicating cache sets (each belonging to a different target cache) on the same GPU block so that the simulated coherence traffic is local traffic in the GPU. However, this is not enough due to the great imbalance in the activity in the different cache sets: some sets receive a flurry of activity while others do not. Our solution is to load balance the simulated sets (based on activity) on the computing element (host-CPU or GPU) that can manage them in the most efficient way. We propose a heterogeneous computing approach in which the host-CPU simulates the few but most active sets, while the GPU is responsible for the many more but less active sets. Our experimental findings using the SPLASH-2 suite demonstrate that our cache simulator based on the CPU-GPU cooperation achieves on average 5.88x speedup over alternative implementations running on CPU, speedups which scale well with the size of the simulated system.
{"title":"Multicore Cache Simulations Using Heterogeneous Computing on General Purpose and Graphics Processors","authors":"G. Keramidas, Nikolaos Strikos, S. Kaxiras","doi":"10.1109/DSD.2011.38","DOIUrl":"https://doi.org/10.1109/DSD.2011.38","url":null,"abstract":"Traditional trace-driven memory system simulation is a very time consuming process while the advent of multicores simply exacerbates the problem. We propose a framework for accelerating trace-driven multicore cache simulations by utilizing the capabilities of the modern many core GPUs. A straightforward way towards this direction is to rely on the inherent parallelism in cache simulations: communicating cache sets can be simulated independently and concurrently to other sets. Based on this, we map collections of communicating cache sets (each belonging to a different target cache) on the same GPU block so that the simulated coherence traffic is local traffic in the GPU. However, this is not enough due to the great imbalance in the activity in the different cache sets: some sets receive a flurry of activity while others do not. Our solution is to load balance the simulated sets (based on activity) on the computing element (host-CPU or GPU) that can manage them in the most efficient way. We propose a heterogeneous computing approach in which the host-CPU simulates the few but most active sets, while the GPU is responsible for the many more but less active sets. Our experimental findings using the SPLASH-2 suite demonstrate that our cache simulator based on the CPU-GPU cooperation achieves on average 5.88x speedup over alternative implementations running on CPU, speedups which scale well with the size of the simulated system.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126549171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdul Naeem, A. Jantsch, Xiaowen Chen, Zhonghai Lu
This paper studies the realization and scalability of release and protected release consistency models in Network-on-Chip (NoC) based Distributed Shared Memory (DSM) multi-core systems. The protected release consistency (PRC) model is proposed as an extension of the release consistency (RC) model and provides further relaxation in the shared memory operations. The realization schemes of RC and PRC models use a transaction counter in each node of the NoC based multi-core (McNoC) systems. Further, we study the scalability of these RC and PRC models and evaluate their performance in the McNoC platform. A configurable NoC based platform with 2D mesh topology and deflection routing algorithm is used in the tests. We experiment both with synthetic and application workloads. The performance of the RC and PRC models are compared using sequential consistency (SC) as the baseline. The experiments show that the average code execution time for the PRC model in 8x8 network (64 cores) is reduced by 30.5% over SC, and by 6.5% over RC model. Average data execution time in the 8x8 network for the PRC model is reduced by almost 37% over SC and by 8.8% over RC. The increase in area for the PRC of RC is about 880 gates in the network interface ( 1.7% ).
{"title":"Realization and Scalability of Release and Protected Release Consistency Models in NoC Based Systems","authors":"Abdul Naeem, A. Jantsch, Xiaowen Chen, Zhonghai Lu","doi":"10.1109/DSD.2011.11","DOIUrl":"https://doi.org/10.1109/DSD.2011.11","url":null,"abstract":"This paper studies the realization and scalability of release and protected release consistency models in Network-on-Chip (NoC) based Distributed Shared Memory (DSM) multi-core systems. The protected release consistency (PRC) model is proposed as an extension of the release consistency (RC) model and provides further relaxation in the shared memory operations. The realization schemes of RC and PRC models use a transaction counter in each node of the NoC based multi-core (McNoC) systems. Further, we study the scalability of these RC and PRC models and evaluate their performance in the McNoC platform. A configurable NoC based platform with 2D mesh topology and deflection routing algorithm is used in the tests. We experiment both with synthetic and application workloads. The performance of the RC and PRC models are compared using sequential consistency (SC) as the baseline. The experiments show that the average code execution time for the PRC model in 8x8 network (64 cores) is reduced by 30.5% over SC, and by 6.5% over RC model. Average data execution time in the 8x8 network for the PRC model is reduced by almost 37% over SC and by 8.8% over RC. The increase in area for the PRC of RC is about 880 gates in the network interface ( 1.7% ).","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123020636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}