2011 14th Euromicro Conference on Digital System Design最新文献

英文中文

Improved Power Modeling of DDR SDRAMs 改进的DDR dram功率建模

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.17

K. Chandrasekar, B. Akesson, K. Goossens

Power modeling and estimation has become one of the most defining aspects in designing modern embedded systems. In this context, DDR SDRAM memories contribute significantly to system power consumption, but lack accurate and generic power models. The most popular SDRAM power model provided by Micron, is found to be inaccurate or insufficient for several reasons. First, it does not consider the power consumed when transitioning to power-down and self-refresh modes. Second, it employs the minimal timing constraints between commands from the SDRAM datasheets and not the actual duration between the commands as issued by an SDRAM memory controller. Finally, without adaptations, it can only be applied to a memory controller that employs a close-page policy and accesses a single SDRAM bank at a time. These critical issues with Micron's power model impact the accuracy and the validity of the power values reported by it and resolving them, forms the focus of our work. In this paper, we propose an improved SDRAM power model that estimates power consumption during the state transitions to power-saving states, employs an SDRAM command trace to get the actual timings between the commands issued and is generic and applicable to all DDRx SDRAMs and all memory controller policies and all degrees of bank interleaving. We quantitatively compare the proposed model against the unmodified Micron model on power and energy for DDR3-800. We show differences of up to 60% in energy-savings for the precharge power-down mode for a power-down duration of 14 cycles and up to 80% for the self-refresh mode for a self-refresh duration of 560 cycles.

功率建模和估计已经成为现代嵌入式系统设计中最重要的方面之一。在这种情况下，DDR SDRAM存储器对系统功耗贡献很大，但缺乏准确和通用的功耗模型。由美光提供的最流行的SDRAM电源模型，由于几个原因被发现是不准确或不足的。首先，它不考虑转换到断电和自刷新模式时所消耗的功率。其次，它采用来自SDRAM数据表的命令之间的最小时间约束，而不是由SDRAM内存控制器发出的命令之间的实际持续时间。最后，如果没有调整，它只能应用于使用关闭页面策略并一次访问单个SDRAM库的内存控制器。这些与美光功率模型有关的关键问题影响了其所报告的功率值的准确性和有效性，解决这些问题是我们工作的重点。在本文中，我们提出了一种改进的SDRAM功耗模型，该模型估计状态转换到节电状态期间的功耗，使用SDRAM命令跟踪来获得发出命令之间的实际时间，并且是通用的，适用于所有DDRx SDRAM和所有内存控制器策略以及所有程度的银行交错。在DDR3-800的功率和能量方面，我们将提出的模型与未修改的美光模型进行了定量比较。我们发现，在14个周期的预充断电模式下，节能差异高达60%，而在560个周期的自刷新模式下，节能差异高达80%。

{"title":"Improved Power Modeling of DDR SDRAMs","authors":"K. Chandrasekar, B. Akesson, K. Goossens","doi":"10.1109/DSD.2011.17","DOIUrl":"https://doi.org/10.1109/DSD.2011.17","url":null,"abstract":"Power modeling and estimation has become one of the most defining aspects in designing modern embedded systems. In this context, DDR SDRAM memories contribute significantly to system power consumption, but lack accurate and generic power models. The most popular SDRAM power model provided by Micron, is found to be inaccurate or insufficient for several reasons. First, it does not consider the power consumed when transitioning to power-down and self-refresh modes. Second, it employs the minimal timing constraints between commands from the SDRAM datasheets and not the actual duration between the commands as issued by an SDRAM memory controller. Finally, without adaptations, it can only be applied to a memory controller that employs a close-page policy and accesses a single SDRAM bank at a time. These critical issues with Micron's power model impact the accuracy and the validity of the power values reported by it and resolving them, forms the focus of our work. In this paper, we propose an improved SDRAM power model that estimates power consumption during the state transitions to power-saving states, employs an SDRAM command trace to get the actual timings between the commands issued and is generic and applicable to all DDRx SDRAMs and all memory controller policies and all degrees of bank interleaving. We quantitatively compare the proposed model against the unmodified Micron model on power and energy for DDR3-800. We show differences of up to 60% in energy-savings for the precharge power-down mode for a power-down duration of 14 cycles and up to 80% for the self-refresh mode for a self-refresh duration of 560 cycles.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132067228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

Microthreading as a Novel Method for Close Coupling of Custom Hardware Accelerators to SVP Processors 微线程作为定制硬件加速器与SVP处理器紧密耦合的一种新方法

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.73

J. Sykora, Leos Kafka, M. Danek, L. Kohout

We present a new low-level interfacing scheme for connecting custom accelerators to processors that tolerates latencies that usually occur when accessing hardware accelerators from software. The scheme is based on the Self-adaptive Virtual Processor (SVP) architecture and on the micro-threading concept. Our presentation is based on a sample implementation of the SVP architecture in an extended version of the LEON3 processor called UTLEON3. The SVP concurrency paradigm makes data dependencies explicit in the dynamic tree of threads. This enables a system to execute threads concurrently in different processor cores. Previous SVP work presumed the cores are homogeneous, for example an array of micro threaded processors sharing a dynamic pool of micro threads. In this work we propose a heterogeneous system of general-purpose processor cores and custom hardware accelerators. The accelerators dynamically pick families of threads from the pool and execute them concurrently. We introduce the Thread Mapping Table (TMT) hardware unit that couples the software and hardware implementations of the user computations. The TMT unit allows to realize the coupling scheme seamlessly without modifications of the processor ISA. The advantage of the described scheme is in decoupling application programming from specific details of the hardware accelerator architecture (identical behaviour of a software create and hardware create), and in eliminating the influence of hardware access latencies. Our simulation and FPGA implementation results prove that the additional hardware access latencies in the processor are tolerated by the SVP architecture.

我们提出了一种新的低级接口方案，用于将自定义加速器连接到处理器，该方案可以容忍从软件访问硬件加速器时通常发生的延迟。该方案基于自适应虚拟处理器(SVP)体系结构和微线程概念。我们的演示是基于SVP体系结构的一个示例实现，该实现是在称为UTLEON3的LEON3处理器的扩展版本中实现的。SVP并发范型在动态线程树中明确显示数据依赖关系。这使得系统可以在不同的处理器内核中并发地执行线程。以前的SVP工作假定内核是同构的，例如，一组微线程处理器共享一个动态微线程池。在这项工作中，我们提出了一个通用处理器核心和定制硬件加速器的异构系统。加速器动态地从池中挑选线程族并并发地执行它们。我们引入了线程映射表(TMT)硬件单元，它耦合了用户计算的软件和硬件实现。TMT单元允许在不修改处理器ISA的情况下无缝地实现耦合方案。所描述的方案的优点是将应用程序编程与硬件加速器体系结构的具体细节(软件创建和硬件创建的相同行为)分离，并消除硬件访问延迟的影响。我们的仿真和FPGA实现结果证明，SVP架构可以容忍处理器中额外的硬件访问延迟。

{"title":"Microthreading as a Novel Method for Close Coupling of Custom Hardware Accelerators to SVP Processors","authors":"J. Sykora, Leos Kafka, M. Danek, L. Kohout","doi":"10.1109/DSD.2011.73","DOIUrl":"https://doi.org/10.1109/DSD.2011.73","url":null,"abstract":"We present a new low-level interfacing scheme for connecting custom accelerators to processors that tolerates latencies that usually occur when accessing hardware accelerators from software. The scheme is based on the Self-adaptive Virtual Processor (SVP) architecture and on the micro-threading concept. Our presentation is based on a sample implementation of the SVP architecture in an extended version of the LEON3 processor called UTLEON3. The SVP concurrency paradigm makes data dependencies explicit in the dynamic tree of threads. This enables a system to execute threads concurrently in different processor cores. Previous SVP work presumed the cores are homogeneous, for example an array of micro threaded processors sharing a dynamic pool of micro threads. In this work we propose a heterogeneous system of general-purpose processor cores and custom hardware accelerators. The accelerators dynamically pick families of threads from the pool and execute them concurrently. We introduce the Thread Mapping Table (TMT) hardware unit that couples the software and hardware implementations of the user computations. The TMT unit allows to realize the coupling scheme seamlessly without modifications of the processor ISA. The advantage of the described scheme is in decoupling application programming from specific details of the hardware accelerator architecture (identical behaviour of a software create and hardware create), and in eliminating the influence of hardware access latencies. Our simulation and FPGA implementation results prove that the additional hardware access latencies in the processor are tolerated by the SVP architecture.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131519001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Environment for (re)configuration and Execution Managenment of Flexible Radio Platforms 柔性无线电平台的(重)配置与执行管理环境

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.47

Pierre-Henri Horrein, Christine Hennebert, F. Pétrot

This paper presents the Flexible Radio Kernel (FRK), a configuration and execution management environment for hybrid hardware/software flexible radio platform. The aim of FRK is to manage platform reconfiguration for multi-mode, multi-standard operation, with different levels of abstraction. A high level framework is described, to manage multiple MAC layers, and to enable MAC cooperation algorithms for cognitive radio. A low-level environment is also available to manage platform reconfiguration for radio operations. Radio can be implemented using hardware or software elements. Configuration state is hidden to the high-level layers, offering pseudo concurrency (time sharing) properties. This study presents a global view of FRK, with details on some specific parts of the environment. A practical study with algorithmic description is presented.

本文提出了柔性无线电内核(FRK)，这是一种用于硬件/软件混合柔性无线电平台的配置和执行管理环境。FRK的目的是用不同的抽象层次来管理多模式、多标准操作的平台重构。描述了一个高级框架，用于管理多个MAC层，并使MAC协作算法能够用于认知无线电。低级环境也可用于管理无线电操作的平台重新配置。无线电可以使用硬件或软件元素来实现。配置状态对高层隐藏，提供伪并发性(分时)属性。这项研究提出了FRK的全球观点，并详细介绍了环境的某些特定部分。给出了一种基于算法描述的实际研究。

引用次数: 0

Compiling Esterel for Multi-core Execution 编译estel用于多核执行

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.97

S. Yuan, L. Yoong, P. Roop

Esterel is a synchronous language suited for describing reactive embedded systems. It combines fine-grained parallelism with precise timing control for the execution of threads. Due to this, Esterel programs have typically been compiled into sequential code in software implementations, as tight synchronization between a large number of threads cannot be efficiently managed with an operating system (OS). This has enabled concurrent Esterel programs to be executed directly on single-core processors. Recently, however, multi-core processors have been increasingly used to achieve better performance in embedded applications. The conventional approach of generating sequential code from Esterel programs is unable to take advantage of multi-core processors. We overcome this limitation by compiling Esterel into a limited number of thread partitions (up to the number of available cores) to avoid the large overheads of implementing each Esterel thread separately within a conventional multithreading scheme. These partitions are then distributed onto separate cores using a static load balancing heuristic. The Esterel threads within a partition may then be dynamically scheduled with or without an OS. To evaluate the viability of this approach, we present experimental results comparing the execution of a set of benchmarks using one to four cores on the Intel Core 2 Quad with Linux, and one to two cores on the Xilinx Micro blaze without any OS. We have performed extensive benchmarking over large Esterel programs to illustrate that achieving throughput with parallel execution of Esterel is benchmark dependent.

Esterel是一种同步语言，适合于描述响应式嵌入式系统。它结合了细粒度的并行性和线程执行的精确定时控制。因此，在软件实现中，Esterel程序通常被编译成顺序代码，因为在操作系统(OS)中无法有效地管理大量线程之间的紧密同步。这使得并发的Esterel程序可以直接在单核处理器上执行。然而，最近，多核处理器越来越多地用于在嵌入式应用程序中实现更好的性能。从Esterel程序生成顺序代码的传统方法无法利用多核处理器的优势。我们通过将Esterel编译成有限数量的线程分区(最多可使用的内核数量)来克服这一限制，以避免在传统多线程方案中单独实现每个Esterel线程的巨大开销。然后使用静态负载平衡启发式将这些分区分布到单独的内核上。分区内的Esterel线程可以在有或没有操作系统的情况下动态调度。为了评估这种方法的可行性，我们提供了实验结果，比较了在带有Linux的Intel Core 2 Quad上使用1到4核和在没有任何操作系统的Xilinx Micro blaze上使用1到2核的一组基准测试的执行情况。我们对大型Esterel程序进行了广泛的基准测试，以说明通过并行执行Esterel来实现吞吐量依赖于基准测试。

{"title":"Compiling Esterel for Multi-core Execution","authors":"S. Yuan, L. Yoong, P. Roop","doi":"10.1109/DSD.2011.97","DOIUrl":"https://doi.org/10.1109/DSD.2011.97","url":null,"abstract":"Esterel is a synchronous language suited for describing reactive embedded systems. It combines fine-grained parallelism with precise timing control for the execution of threads. Due to this, Esterel programs have typically been compiled into sequential code in software implementations, as tight synchronization between a large number of threads cannot be efficiently managed with an operating system (OS). This has enabled concurrent Esterel programs to be executed directly on single-core processors. Recently, however, multi-core processors have been increasingly used to achieve better performance in embedded applications. The conventional approach of generating sequential code from Esterel programs is unable to take advantage of multi-core processors. We overcome this limitation by compiling Esterel into a limited number of thread partitions (up to the number of available cores) to avoid the large overheads of implementing each Esterel thread separately within a conventional multithreading scheme. These partitions are then distributed onto separate cores using a static load balancing heuristic. The Esterel threads within a partition may then be dynamically scheduled with or without an OS. To evaluate the viability of this approach, we present experimental results comparing the execution of a set of benchmarks using one to four cores on the Intel Core 2 Quad with Linux, and one to two cores on the Xilinx Micro blaze without any OS. We have performed extensive benchmarking over large Esterel programs to illustrate that achieving throughput with parallel execution of Esterel is benchmark dependent.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128878258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

An FPGA Implementation of the ZUC Stream Cipher ZUC流密码的FPGA实现

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.109

P. Kitsos, N. Sklavos, A. Skodras

In this paper a hardware implementation of ZUC stream cipher is presented. ZUC is a stream cipher that forms the heart of the 3GPP confidentiality algorithm 128-EEA3 and the 3GPP integrity algorithm 128-EIA3, offering reliable security services in Long Term Evolution networks (LTE). A detailed hardware implementation is presented in order to reach satisfactory performance results in LTE systems. The design was coded using VHDL language and for the hardware implementation, a XILINX Virtex-5 FPGA was used. Experimental results in terms of performance and hardware resources are presented.

本文给出了一种ZUC流密码的硬件实现。ZUC是构成3GPP保密算法128-EEA3和3GPP完整性算法128-EIA3核心的流密码，可在长期演进网络(LTE)中提供可靠的安全服务。为了在LTE系统中达到令人满意的性能结果，给出了详细的硬件实现。设计采用VHDL语言编码，硬件采用XILINX Virtex-5 FPGA实现。给出了性能和硬件资源方面的实验结果。

引用次数: 29

Path-Based Dynamic Voltage and Frequency Scaling Algorithms for Multiprocessor Embedded Applications with Soft Delay Deadlines 具有软延迟时限的多处理器嵌入式应用中基于路径的动态电压和频率缩放算法

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.18

A. Tokarnia, Pedro C. F. Pepe, Leandro D. Pagotto

This paper introduces four path-based DVFS algorithms for embedded multimedia applications. Application model consists of multiprocessor scheduled task-graphs and input class probability distributions. Design constraints are a soft delay deadline and a minimum completion ratio. The algorithms target four scenarios that correspond to systems with various DVFS and quality of service monitoring capabilities. In the first scenario, all inputs must be timely processed, voltage/frequency level can be adjusted in the beginning of application execution and must be the same for all processors. In the second scenario, the voltage/frequency level of a processor can be individually adjusted when a task execution starts, inputs of particular classes can be discarded without processing. In the third scenario, a processor voltage can be adjusted to the class of the input received. The fourth scenario aims at compensating for online changes of input class distribution in a system with the same capabilities as required for the third scenario.

介绍了用于嵌入式多媒体应用的四种基于路径的DVFS算法。应用模型由多处理器调度任务图和输入类概率分布组成。设计约束是软延迟截止日期和最小完成率。这些算法针对四种场景，这些场景对应于具有各种DVFS和服务质量监控功能的系统。在第一个场景中，所有输入必须及时处理，电压/频率水平可以在应用程序执行开始时调整，并且对于所有处理器必须相同。在第二种情况下，可以在任务执行开始时单独调整处理器的电压/频率水平，可以丢弃特定类的输入而不进行处理。在第三种情况中，处理器电压可以调整到所接收的输入的类别。第四个场景旨在补偿系统中输入类分布的在线变化，该系统具有第三个场景所需的相同功能。

引用次数: 1

On the Efficiency of Design Time Evaluation of the Resistance to Power Attacks 抗功率攻击设计时间评估的有效性研究

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.103

Alessandro Barenghi, G. Bertoni, F. D. Santis, F. Melzani

Side-channel attacks are a realistic threat to the security of real world implementations of cryptographic algorithms. In order to evaluate the resistance of designs against power analysis attacks, power values obtained from circuit simulations in early design phases offer two distinct advantages: First, they offer fast feedback loops to designers, second the number of redesigns can be reduced. This work investigates the accuracy of design time power estimation tools in assessing the security level of a device against differential power attacks.

侧信道攻击是对现实世界中加密算法实现安全性的现实威胁。为了评估设计对功率分析攻击的抵抗力，在早期设计阶段从电路模拟中获得的功率值提供了两个明显的优势:首先，它们为设计人员提供了快速的反馈循环，其次可以减少重新设计的次数。这项工作调查了设计时功率估计工具在评估设备对差分功率攻击的安全级别时的准确性。

引用次数: 4

Soft Error Detection Technique in Multi-threaded Architectures Using Control-Flow Monitoring 基于控制流监控的多线程体系结构软错误检测技术

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.104

M. Maghsoudloo, H. Zarandi, S. Pour-Mozafari, N. Khoshavi

This paper presents a software-based error detection technique through monitoring flow of the programs in multithreaded architectures. This technique is based on the analysis of two key ideas: 1) Modifying the structure of traditional control-flow graphs used by control-flow checking methods so that they can be applied on multi-core and multi-threaded architectures. These achievements in designing control-flow error detectors lead to increase their applicability in current architectures. 2) Adjusting the locations of additional checking assertions in a given program in order to increase the ability of detecting possible control-flow errors along with significant reduction in overheads. The experimental results, through taking into account both detection coverage and overheads, demonstrate that on average about 94% of the control-flow errors can be detected by the proposed technique, more efficient compared to previous works.

本文提出了一种基于软件的错误检测技术，通过对多线程体系结构中的程序流程进行监控。该技术基于对两个关键思想的分析:1)修改控制流检查方法使用的传统控制流图的结构，使其能够应用于多核和多线程架构。在设计控制流错误检测器方面取得的这些成就使其在当前体系结构中的适用性得到了提高。2)调整给定程序中附加检查断言的位置，以增加检测可能的控制流错误的能力，同时显著减少开销。实验结果表明，通过考虑检测覆盖率和开销，该技术平均可以检测到约94%的控制流误差，与以前的工作相比效率更高。

引用次数: 5

Multicore Cache Simulations Using Heterogeneous Computing on General Purpose and Graphics Processors 在通用和图形处理器上使用异构计算的多核缓存模拟

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.38

G. Keramidas, Nikolaos Strikos, S. Kaxiras

Traditional trace-driven memory system simulation is a very time consuming process while the advent of multicores simply exacerbates the problem. We propose a framework for accelerating trace-driven multicore cache simulations by utilizing the capabilities of the modern many core GPUs. A straightforward way towards this direction is to rely on the inherent parallelism in cache simulations: communicating cache sets can be simulated independently and concurrently to other sets. Based on this, we map collections of communicating cache sets (each belonging to a different target cache) on the same GPU block so that the simulated coherence traffic is local traffic in the GPU. However, this is not enough due to the great imbalance in the activity in the different cache sets: some sets receive a flurry of activity while others do not. Our solution is to load balance the simulated sets (based on activity) on the computing element (host-CPU or GPU) that can manage them in the most efficient way. We propose a heterogeneous computing approach in which the host-CPU simulates the few but most active sets, while the GPU is responsible for the many more but less active sets. Our experimental findings using the SPLASH-2 suite demonstrate that our cache simulator based on the CPU-GPU cooperation achieves on average 5.88x speedup over alternative implementations running on CPU, speedups which scale well with the size of the simulated system.

传统的轨迹驱动内存系统仿真是一个非常耗时的过程，而多核的出现只会加剧这个问题。我们提出了一个框架，通过利用现代多核gpu的功能来加速跟踪驱动的多核缓存模拟。实现这一方向的一个直接方法是依赖缓存模拟中的固有并行性:可以独立地、并发地模拟通信缓存集和其他集。在此基础上，我们在同一GPU块上映射通信缓存集的集合(每个缓存集属于不同的目标缓存)，以便模拟的相干流量是GPU中的本地流量。然而，这是不够的，因为不同缓存集中的活动存在很大的不平衡:一些集接收大量的活动，而另一些则没有。我们的解决方案是在能够以最有效的方式管理模拟集的计算元素(主机- cpu或GPU)上对模拟集进行负载平衡(基于活动)。我们提出了一种异构计算方法，其中主机- cpu模拟少数但最活跃的集合，而GPU负责许多但较少活跃的集合。我们使用SPLASH-2套件的实验结果表明，我们基于CPU- gpu合作的缓存模拟器比在CPU上运行的其他实现平均实现了5.88倍的加速，这种加速可以很好地扩展模拟系统的大小。

{"title":"Multicore Cache Simulations Using Heterogeneous Computing on General Purpose and Graphics Processors","authors":"G. Keramidas, Nikolaos Strikos, S. Kaxiras","doi":"10.1109/DSD.2011.38","DOIUrl":"https://doi.org/10.1109/DSD.2011.38","url":null,"abstract":"Traditional trace-driven memory system simulation is a very time consuming process while the advent of multicores simply exacerbates the problem. We propose a framework for accelerating trace-driven multicore cache simulations by utilizing the capabilities of the modern many core GPUs. A straightforward way towards this direction is to rely on the inherent parallelism in cache simulations: communicating cache sets can be simulated independently and concurrently to other sets. Based on this, we map collections of communicating cache sets (each belonging to a different target cache) on the same GPU block so that the simulated coherence traffic is local traffic in the GPU. However, this is not enough due to the great imbalance in the activity in the different cache sets: some sets receive a flurry of activity while others do not. Our solution is to load balance the simulated sets (based on activity) on the computing element (host-CPU or GPU) that can manage them in the most efficient way. We propose a heterogeneous computing approach in which the host-CPU simulates the few but most active sets, while the GPU is responsible for the many more but less active sets. Our experimental findings using the SPLASH-2 suite demonstrate that our cache simulator based on the CPU-GPU cooperation achieves on average 5.88x speedup over alternative implementations running on CPU, speedups which scale well with the size of the simulated system.","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126549171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Realization and Scalability of Release and Protected Release Consistency Models in NoC Based Systems 基于NoC系统的发布与保护发布一致性模型的实现与可扩展性

2011 14th Euromicro Conference on Digital System Design

Pub Date : 2011-08-31 DOI: 10.1109/DSD.2011.11

Abdul Naeem, A. Jantsch, Xiaowen Chen, Zhonghai Lu

This paper studies the realization and scalability of release and protected release consistency models in Network-on-Chip (NoC) based Distributed Shared Memory (DSM) multi-core systems. The protected release consistency (PRC) model is proposed as an extension of the release consistency (RC) model and provides further relaxation in the shared memory operations. The realization schemes of RC and PRC models use a transaction counter in each node of the NoC based multi-core (McNoC) systems. Further, we study the scalability of these RC and PRC models and evaluate their performance in the McNoC platform. A configurable NoC based platform with 2D mesh topology and deflection routing algorithm is used in the tests. We experiment both with synthetic and application workloads. The performance of the RC and PRC models are compared using sequential consistency (SC) as the baseline. The experiments show that the average code execution time for the PRC model in 8x8 network (64 cores) is reduced by 30.5% over SC, and by 6.5% over RC model. Average data execution time in the 8x8 network for the PRC model is reduced by almost 37% over SC and by 8.8% over RC. The increase in area for the PRC of RC is about 880 gates in the network interface ( 1.7% ).

本文研究了基于片上网络(NoC)的分布式共享内存(DSM)多核系统中发布和保护发布一致性模型的实现及其可扩展性。保护释放一致性(PRC)模型作为释放一致性(RC)模型的扩展，在共享内存操作中提供了进一步的放松。RC和PRC模型的实现方案在基于NoC的多核(McNoC)系统的每个节点上使用一个事务计数器。此外，我们研究了这些RC和PRC模型的可扩展性，并评估了它们在McNoC平台上的性能。试验采用了基于二维网格拓扑和偏转路由算法的可配置NoC平台。我们对合成工作负载和应用程序工作负载进行了实验。以序列一致性(SC)为基准，比较了RC模型和PRC模型的性能。实验表明，在8x8网络(64核)中，PRC模型的平均代码执行时间比SC模型减少30.5%，比RC模型减少6.5%。在8x8网络中，PRC模型的平均数据执行时间比SC减少了近37%，比RC减少了8.8%。RC的PRC在网络接口中增加了约880个栅极(1.7%)。

{"title":"Realization and Scalability of Release and Protected Release Consistency Models in NoC Based Systems","authors":"Abdul Naeem, A. Jantsch, Xiaowen Chen, Zhonghai Lu","doi":"10.1109/DSD.2011.11","DOIUrl":"https://doi.org/10.1109/DSD.2011.11","url":null,"abstract":"This paper studies the realization and scalability of release and protected release consistency models in Network-on-Chip (NoC) based Distributed Shared Memory (DSM) multi-core systems. The protected release consistency (PRC) model is proposed as an extension of the release consistency (RC) model and provides further relaxation in the shared memory operations. The realization schemes of RC and PRC models use a transaction counter in each node of the NoC based multi-core (McNoC) systems. Further, we study the scalability of these RC and PRC models and evaluate their performance in the McNoC platform. A configurable NoC based platform with 2D mesh topology and deflection routing algorithm is used in the tests. We experiment both with synthetic and application workloads. The performance of the RC and PRC models are compared using sequential consistency (SC) as the baseline. The experiments show that the average code execution time for the PRC model in 8x8 network (64 cores) is reduced by 30.5% over SC, and by 6.5% over RC model. Average data execution time in the 8x8 network for the PRC model is reduced by almost 37% over SC and by 8.8% over RC. The increase in area for the PRC of RC is about 880 gates in the network interface ( 1.7% ).","PeriodicalId":267187,"journal":{"name":"2011 14th Euromicro Conference on Digital System Design","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123020636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 14th Euromicro Conference on Digital System Design

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀