首页 > 最新文献

2012 International Conference on Embedded Computer Systems (SAMOS)最新文献

英文 中文
A tightly-coupled multi-core cluster with shared-memory HW accelerators 具有共享内存硬件加速器的紧密耦合多核集群
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404162
M. Dehyadegari, A. Marongiu, M. R. Kakoee, L. Benini, S. Mohammadi, N. Yazdani
Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs.
紧耦合硬件加速器与处理器是一种众所周知的提高MPSoC平台效率的方法。这个领域的主要设计挑战是:(i)简化加速器的定义和实例化;(ii)开发架构模板和运行时技术,以最小化处理器和加速器之间的通信和同步成本。在本文中,我们提出了一个具有紧耦合处理器和硬件处理单元(HWPU)的架构,具有零拷贝通信。我们还提供了一个简单的编程API,简化了将任务卸载到hwpu的过程。
{"title":"A tightly-coupled multi-core cluster with shared-memory HW accelerators","authors":"M. Dehyadegari, A. Marongiu, M. R. Kakoee, L. Benini, S. Mohammadi, N. Yazdani","doi":"10.1109/SAMOS.2012.6404162","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404162","url":null,"abstract":"Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134070980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Adaptive dynamic memory allocators by estimating application workloads 通过估计应用程序工作负载来自适应动态内存分配器
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404182
Ioannis Koutras, A. Bartzas, D. Soudris
Modern applications are becoming more complex and dynamic and try to efficiently utilize the amount of available resources on the computing platforms. Efficient memory utilization is a key challenge for application developers, especially since memory is a scarce resource and often becomes systems bottleneck. Thus, the developers can resort to dynamic memory management, i.e., dynamic memory allocation and de-allocation, to efficiently utilize the memory resources. A high-performance adaptive memory allocator is presented in this paper. A memory allocator helps applications to manage more efficiently the memory space that operating systems bestow to them. In our approach, we tune the memory allocator at runtime by predicting the amount of memory to be requested. Experimental results obtained using applications from the PARSEC benchmark suite and dmmlib, a memory allocator framework written in C. Results show that adaptive memory allocators can improve the fragmentation problems leading to a more efficient memory usage.
现代应用程序正变得越来越复杂和动态,并试图有效地利用计算平台上的可用资源。有效的内存利用是应用程序开发人员面临的一个关键挑战,特别是因为内存是一种稀缺资源,经常成为系统瓶颈。因此,开发人员可以求助于动态内存管理,即动态内存分配和回收,以有效地利用内存资源。提出了一种高性能的自适应内存分配器。内存分配器帮助应用程序更有效地管理操作系统赋予它们的内存空间。在我们的方法中,我们在运行时通过预测需要请求的内存量来调优内存分配器。使用PARSEC基准测试套件和dmmlib(用c编写的内存分配器框架)的应用程序获得的实验结果表明,自适应内存分配器可以改善碎片问题,从而更有效地使用内存。
{"title":"Adaptive dynamic memory allocators by estimating application workloads","authors":"Ioannis Koutras, A. Bartzas, D. Soudris","doi":"10.1109/SAMOS.2012.6404182","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404182","url":null,"abstract":"Modern applications are becoming more complex and dynamic and try to efficiently utilize the amount of available resources on the computing platforms. Efficient memory utilization is a key challenge for application developers, especially since memory is a scarce resource and often becomes systems bottleneck. Thus, the developers can resort to dynamic memory management, i.e., dynamic memory allocation and de-allocation, to efficiently utilize the memory resources. A high-performance adaptive memory allocator is presented in this paper. A memory allocator helps applications to manage more efficiently the memory space that operating systems bestow to them. In our approach, we tune the memory allocator at runtime by predicting the amount of memory to be requested. Experimental results obtained using applications from the PARSEC benchmark suite and dmmlib, a memory allocator framework written in C. Results show that adaptive memory allocators can improve the fragmentation problems leading to a more efficient memory usage.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132199072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Memory bounds for the distributed execution of a hierarchical Synchronous Data-Flow graph 分层同步数据流图分布式执行的内存边界
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404170
K. Desnos, M. Pelcat, J. Nezan, Slaheddine Aridhi
This paper presents an application analysis technique to define the boundary of shared memory requirements of Multiprocessor System-on-Chip (MPSoC) in early stages of development. This technique is part of a rapid prototyping process and is based on the analysis of a hierarchical Synchronous Data-Flow (SDF) graph description of the system application. The analysis does not require any knowledge of the system architecture, the mapping or the scheduling of the system application tasks. The initial step of the method consists of applying a set of transformations to the SDF graph so as to reveal its memory characteristics. These transformations produce a weighted graph that represents the different memory objects of the application as well as the memory allocation constraints due to their relationships. The memory boundaries are then derived from this weighted graph using analogous graph theory problems, in particular the Maximum-Weight Clique (MWC) problem. State-of-the-art algorithms to solve these problems are presented and a heuristic approach is proposed to provide a near-optimal solution of the MWC problem. A performance evaluation of the heuristic approach is presented, and is based on hierarchical SDF graphs of realistic applications. This evaluation shows the efficiency of proposed heuristic approach in finding near optimal solutions.
本文提出了一种应用分析技术,用于确定多处理器片上系统(MPSoC)在开发初期的共享内存需求边界。该技术是快速原型过程的一部分,并基于对系统应用程序的分层同步数据流(SDF)图描述的分析。分析不需要系统架构、映射或系统应用程序任务调度的任何知识。该方法的初始步骤包括对SDF图应用一组转换,以揭示其记忆特性。这些转换产生一个加权图,表示应用程序的不同内存对象以及由于它们之间的关系而产生的内存分配约束。然后使用类似的图论问题,特别是最大权重团(MWC)问题,从这个加权图中导出内存边界。提出了解决这些问题的最先进算法,并提出了一种启发式方法来提供MWC问题的近最优解。提出了一种基于实际应用的分层SDF图的启发式方法的性能评价方法。这个评价显示了所提出的启发式方法在寻找接近最优解方面的效率。
{"title":"Memory bounds for the distributed execution of a hierarchical Synchronous Data-Flow graph","authors":"K. Desnos, M. Pelcat, J. Nezan, Slaheddine Aridhi","doi":"10.1109/SAMOS.2012.6404170","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404170","url":null,"abstract":"This paper presents an application analysis technique to define the boundary of shared memory requirements of Multiprocessor System-on-Chip (MPSoC) in early stages of development. This technique is part of a rapid prototyping process and is based on the analysis of a hierarchical Synchronous Data-Flow (SDF) graph description of the system application. The analysis does not require any knowledge of the system architecture, the mapping or the scheduling of the system application tasks. The initial step of the method consists of applying a set of transformations to the SDF graph so as to reveal its memory characteristics. These transformations produce a weighted graph that represents the different memory objects of the application as well as the memory allocation constraints due to their relationships. The memory boundaries are then derived from this weighted graph using analogous graph theory problems, in particular the Maximum-Weight Clique (MWC) problem. State-of-the-art algorithms to solve these problems are presented and a heuristic approach is proposed to provide a near-optimal solution of the MWC problem. A performance evaluation of the heuristic approach is presented, and is based on hierarchical SDF graphs of realistic applications. This evaluation shows the efficiency of proposed heuristic approach in finding near optimal solutions.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126930621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
OSR-Lite: Fast and deadlock-free NoC reconfiguration framework OSR-Lite:快速且无死锁的NoC重构框架
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404161
Alessandro Strano, D. Bertozzi, F. Triviño, J. L. Sánchez, F. J. Alfaro, J. Flich
Current and future on-chip networks will feature an enhanced degree of reconfigurability. Power management and virtualization strategies as well as the need to survive to the progressive onset of wear-out faults are root causes for that. In all these cases, a non-intrusive and efficient reconfiguration method is needed to allow the network to function uninterruptedly over the course of the reconfiguration process while remaining deadlock-free. This paper is inspired by the overlapped static reconfiguration (OSR) protocol developed for off-chip networks. However, in its native form its implementation in NoCs is out-of-reach. Therefore, we provide a careful engineering of the NoC switch architecture and of the system-level infrastructure to support a cost-effective, complete and transparent reconfiguration process. Performance during the reconfiguration process is not affected and implementation costs (critical path and area overhead) are proved to be fully affordable for a constrained system. Less than 250 cycles are needed for the reconfiguration process of an 8×8 2D mesh with marginal impact on system performance.
当前和未来的片上网络将具有更高程度的可重构性。电源管理和虚拟化策略以及应对逐渐出现的损耗故障的需求是造成这种情况的根本原因。在所有这些情况下,都需要一种非侵入性和高效的重新配置方法,以允许网络在重新配置过程中不间断地运行,同时保持无死锁。本文的灵感来自于为片外网络开发的重叠静态重构(OSR)协议。然而,在其原生形式中,它在noc中的实现是遥不可及的。因此,我们提供了NoC交换机架构和系统级基础设施的精心设计,以支持具有成本效益,完整和透明的重新配置过程。重新配置过程中的性能不受影响,并且实现成本(关键路径和面积开销)被证明是完全可以承受的。对8×8二维网格的重构过程只需要不到250个周期,对系统性能的影响很小。
{"title":"OSR-Lite: Fast and deadlock-free NoC reconfiguration framework","authors":"Alessandro Strano, D. Bertozzi, F. Triviño, J. L. Sánchez, F. J. Alfaro, J. Flich","doi":"10.1109/SAMOS.2012.6404161","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404161","url":null,"abstract":"Current and future on-chip networks will feature an enhanced degree of reconfigurability. Power management and virtualization strategies as well as the need to survive to the progressive onset of wear-out faults are root causes for that. In all these cases, a non-intrusive and efficient reconfiguration method is needed to allow the network to function uninterruptedly over the course of the reconfiguration process while remaining deadlock-free. This paper is inspired by the overlapped static reconfiguration (OSR) protocol developed for off-chip networks. However, in its native form its implementation in NoCs is out-of-reach. Therefore, we provide a careful engineering of the NoC switch architecture and of the system-level infrastructure to support a cost-effective, complete and transparent reconfiguration process. Performance during the reconfiguration process is not affected and implementation costs (critical path and area overhead) are proved to be fully affordable for a constrained system. Less than 250 cycles are needed for the reconfiguration process of an 8×8 2D mesh with marginal impact on system performance.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123097832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Instrumentation techniques for cyber-physical systems using the targeted dataflow interchange format 使用目标数据流交换格式的网络物理系统的仪表技术
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404201
S. Bhattacharyya
Dataflow methods are widely used for the design and implementation of signal processing functionality in cyber-physical systems. Systematically integrating instrumentation methods into dataflow-based design processes is important to facilitate trade-off assessment and tuning of alternative scheduling strategies. Such instrumentation-driven scheduler development is particularly important for dynamically structured signal processing computations. In this talk, we will present methods developed in the targeted dataflow interchange format (TDIF) environment for rigorously supporting instrumentation throughout the scheduling process. TDIF, a software tool for design and implementation of signal processing systems, emphasizes processes for retargetable design, analysis, and optimization of hardware and software. We will present an internal representation used within TDIF called the instrumented generalized schedule tree (IGST), and demonstrate the utility of IGSTs for constructing, representing, and manipulating dataflow graph schedules in connection with diverse forms of instrumentation functionality, including monitoring associated with memory usage, performance and energy consumption. This talk is based on joint work with Chung-Ching Shen, Hsiang-Huang Wu, Nimish Sane, and William Plishker.
数据流方法被广泛用于网络物理系统中信号处理功能的设计和实现。系统地将仪表方法集成到基于数据流的设计过程中,对于促进权衡评估和调整备选调度策略非常重要。这种仪器驱动的调度器开发对于动态结构化信号处理计算尤为重要。在这次演讲中,我们将介绍在目标数据流交换格式(TDIF)环境中开发的方法,这些方法在整个调度过程中严格支持仪表。TDIF是一种用于设计和实现信号处理系统的软件工具,强调硬件和软件的可重定向设计、分析和优化过程。我们将介绍TDIF中使用的一种内部表示,称为仪表化广义调度树(IGST),并演示IGST在与各种形式的仪表功能(包括与内存使用、性能和能耗相关的监控)相关的数据流图调度的构造、表示和操作方面的效用。本讲座是基于与沈忠清、吴祥煌、Nimish Sane和William Plishker的合作成果。
{"title":"Instrumentation techniques for cyber-physical systems using the targeted dataflow interchange format","authors":"S. Bhattacharyya","doi":"10.1109/SAMOS.2012.6404201","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404201","url":null,"abstract":"Dataflow methods are widely used for the design and implementation of signal processing functionality in cyber-physical systems. Systematically integrating instrumentation methods into dataflow-based design processes is important to facilitate trade-off assessment and tuning of alternative scheduling strategies. Such instrumentation-driven scheduler development is particularly important for dynamically structured signal processing computations. In this talk, we will present methods developed in the targeted dataflow interchange format (TDIF) environment for rigorously supporting instrumentation throughout the scheduling process. TDIF, a software tool for design and implementation of signal processing systems, emphasizes processes for retargetable design, analysis, and optimization of hardware and software. We will present an internal representation used within TDIF called the instrumented generalized schedule tree (IGST), and demonstrate the utility of IGSTs for constructing, representing, and manipulating dataflow graph schedules in connection with diverse forms of instrumentation functionality, including monitoring associated with memory usage, performance and energy consumption. This talk is based on joint work with Chung-Ching Shen, Hsiang-Huang Wu, Nimish Sane, and William Plishker.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115377262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic FPGA synthesis of memory intensive C-based kernels 基于内存密集型c内核的自动FPGA合成
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404167
Matthew Milford, J. McAllister
Realising high performance image and signal processing applications on modern FPGA presents a challenging implementation problem due to the large data frames streaming through these systems. Specifically, to meet the high bandwidth and data storage demands of these applications, complex hierarchical memory architectures must be manually specified at the Register Transfer Level (RTL). Automated approaches which convert high-level operation descriptions, for instance in the form of C programs, to an FPGA architecture, are unable to automatically realise such architectures. This paper presents a solution to this problem. It presents a compiler to automatically derive such memory architectures from a C program. By transforming the input C program to a unique dataflow modelling dialect, known as Valved Dataflow (VDF), a mapping and synthesis approach developed for this dialect can be exploited to automatically create high performance image and video processing architectures. Memory intensive C kernels for Motion Estimation (CIF Frames at 30 fps), Matrix Multiplication (128×128 @ 500 iter/sec) and Sobel Edge Detection (720p @ 30 fps), which are unrealisable by current state-of-the-art C-based synthesis tools, are automatically derived from a C description of the algorithm.
在现代FPGA上实现高性能图像和信号处理应用是一个具有挑战性的实现问题,因为这些系统中有大量的数据帧流。具体来说,为了满足这些应用的高带宽和数据存储需求,必须在RTL (Register Transfer Level)手动指定复杂的分层内存架构。将高级操作描述(例如以C程序的形式)转换为FPGA架构的自动化方法无法自动实现这种架构。本文提出了一种解决这一问题的方法。它提供了一个编译器来自动地从C程序派生这种内存体系结构。通过将输入C程序转换为一种独特的数据流建模方言,称为有值数据流(VDF),可以利用为该方言开发的映射和综合方法来自动创建高性能图像和视频处理架构。用于运动估计(CIF帧速度为30 fps)、矩阵乘法(128×128 @ 500 iter/秒)和索贝尔边缘检测(720p @ 30 fps)的内存密集型C核,这些都是当前最先进的基于C的合成工具无法实现的,它们自动从算法的C描述中衍生出来。
{"title":"Automatic FPGA synthesis of memory intensive C-based kernels","authors":"Matthew Milford, J. McAllister","doi":"10.1109/SAMOS.2012.6404167","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404167","url":null,"abstract":"Realising high performance image and signal processing applications on modern FPGA presents a challenging implementation problem due to the large data frames streaming through these systems. Specifically, to meet the high bandwidth and data storage demands of these applications, complex hierarchical memory architectures must be manually specified at the Register Transfer Level (RTL). Automated approaches which convert high-level operation descriptions, for instance in the form of C programs, to an FPGA architecture, are unable to automatically realise such architectures. This paper presents a solution to this problem. It presents a compiler to automatically derive such memory architectures from a C program. By transforming the input C program to a unique dataflow modelling dialect, known as Valved Dataflow (VDF), a mapping and synthesis approach developed for this dialect can be exploited to automatically create high performance image and video processing architectures. Memory intensive C kernels for Motion Estimation (CIF Frames at 30 fps), Matrix Multiplication (128×128 @ 500 iter/sec) and Sobel Edge Detection (720p @ 30 fps), which are unrealisable by current state-of-the-art C-based synthesis tools, are automatically derived from a C description of the algorithm.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128773880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Counting stream registers: An efficient and effective snoop filter architecture 计数流寄存器:一个高效和有效的窥探过滤器架构
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404165
Aanjhan Ranganathan, Ali Galip Bayrak, Theo Kluter, P. Brisk, E. Charbon, P. Ienne
We introduce a counting stream register snoop filter, which improves the performance of existing snoop filters based on stream registers. Over time, this class of snoop filters loses the ability to filter memory addresses that have been loaded, and then evicted, from the caches that are filtered; they include cache wrap detection logic, which resets the filter whenever the contents of the cache have been completely replaced. The counting stream register snoop filter introduced here replaces the cache wrap detection logic with a direct-mapped update unit and augments each stream register with a counter, which acts as a validity checker; loading new data into the cache increments the counter, while replacements, snoopy invalidations, and evictions decrement it. A cache wrap is detected whenever the counter reaches zero. Our experimental evaluation shows that the counting stream register snoop filter architecture improves the accuracy compared to traditional stream register snoop filters for representative embedded workloads.
我们引入了一种计数流寄存器snoop滤波器,改进了现有基于流寄存器的snoop滤波器的性能。随着时间的推移,这类snoop过滤器失去了从被过滤的缓存中过滤已加载并被驱逐的内存地址的能力;它们包括缓存包装检测逻辑,该逻辑在缓存的内容被完全替换时重置过滤器。这里介绍的计数流寄存器snoop过滤器用直接映射的更新单元取代了缓存包装检测逻辑,并为每个流寄存器增加了一个计数器,作为有效性检查器;将新数据加载到缓存中会使计数器增加,而替换、snoopy无效和清除会使计数器减少。每当计数器达到零时,就检测到缓存包装。我们的实验评估表明,对于代表性的嵌入式工作负载,与传统的流寄存器snoop滤波器相比,计数流寄存器snoop滤波器架构提高了精度。
{"title":"Counting stream registers: An efficient and effective snoop filter architecture","authors":"Aanjhan Ranganathan, Ali Galip Bayrak, Theo Kluter, P. Brisk, E. Charbon, P. Ienne","doi":"10.1109/SAMOS.2012.6404165","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404165","url":null,"abstract":"We introduce a counting stream register snoop filter, which improves the performance of existing snoop filters based on stream registers. Over time, this class of snoop filters loses the ability to filter memory addresses that have been loaded, and then evicted, from the caches that are filtered; they include cache wrap detection logic, which resets the filter whenever the contents of the cache have been completely replaced. The counting stream register snoop filter introduced here replaces the cache wrap detection logic with a direct-mapped update unit and augments each stream register with a counter, which acts as a validity checker; loading new data into the cache increments the counter, while replacements, snoopy invalidations, and evictions decrement it. A cache wrap is detected whenever the counter reaches zero. Our experimental evaluation shows that the counting stream register snoop filter architecture improves the accuracy compared to traditional stream register snoop filters for representative embedded workloads.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123715581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
TaBit: A framework for task graph to bitstream generation TaBit:一个从任务图到比特流生成的框架
Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404175
A. Bonetto, Andrea Cazzaniga, Gianluca Durelli, C. Pilato, D. Sciuto, M. Santambrogio
Nowadays, the usual embedded design flow makes use of different tools to perform the several steps required to obtain a running application on a reconfigurable platform. The integration among these tools is usually not fully automated, forcing the developer to take care of these intermediate steps. This process slows down the application development and delays its time to market. In this work we present the TaBit framework, intended for FPGA designers, that is able to guide the developer from the original partitioned application, described as a task graph, down to its deployment onto the target device. Moreover, this framework defines a set of interfaces that allows the developer to integrate custom scheduling and floor placing techniques. The framework takes care of the integration between the different steps and, based on the designer inputs, it is able to automatically generate a software Scheduling Engine and the set of bitstreams ready to be deployed onto the target device.
如今,通常的嵌入式设计流程使用不同的工具来执行在可重构平台上获得运行应用程序所需的几个步骤。这些工具之间的集成通常不是完全自动化的,这迫使开发人员处理这些中间步骤。这个过程减慢了应用程序的开发并延迟了其上市时间。在这项工作中,我们提出了TaBit框架,用于FPGA设计人员,它能够指导开发人员从原始分区应用程序(描述为任务图)到其部署到目标设备上。此外,该框架定义了一组接口,允许开发人员集成自定义调度和地板放置技术。该框架负责不同步骤之间的集成,并且基于设计器的输入,它能够自动生成软件调度引擎和准备部署到目标设备上的一组比特流。
{"title":"TaBit: A framework for task graph to bitstream generation","authors":"A. Bonetto, Andrea Cazzaniga, Gianluca Durelli, C. Pilato, D. Sciuto, M. Santambrogio","doi":"10.1109/SAMOS.2012.6404175","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404175","url":null,"abstract":"Nowadays, the usual embedded design flow makes use of different tools to perform the several steps required to obtain a running application on a reconfigurable platform. The integration among these tools is usually not fully automated, forcing the developer to take care of these intermediate steps. This process slows down the application development and delays its time to market. In this work we present the TaBit framework, intended for FPGA designers, that is able to guide the developer from the original partitioned application, described as a task graph, down to its deployment onto the target device. Moreover, this framework defines a set of interfaces that allows the developer to integrate custom scheduling and floor placing techniques. The framework takes care of the integration between the different steps and, based on the designer inputs, it is able to automatically generate a software Scheduling Engine and the set of bitstreams ready to be deployed onto the target device.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129247119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An FPGA-based prototyping method for verification, characterization and optimization of LDPC error correction systems 基于fpga的LDPC纠错系统验证、表征和优化原型方法
Pub Date : 2012-07-01 DOI: 10.1109/SAMOS.2012.6404188
P. Sakellariou, I. Tsatsaragkos, N. Kanistras, A. Mahdi, Vassilis Paliouras
This paper introduces a methodology for forward error correction (FEC) architectures prototyping, oriented to system verification and characterization. A complete design flow is described, which satisfies the requirement for error-free hardware design and acceleration of FEC simulations. FPGA devices give the designer the ability to observe rare events, due to tremendous speed-up of FEC operations. A Matlab-based system assists the investigation of the impact of very rare decoding failure events on the FEC system performance and the finding of solutions which aim to parameters optimization and BER performance improvement of LDPC codes in the error floor region. Furthermore, the development of an embedded system, which offers remote access to the system under test and verification process automation, is explored. The presented here prototyping approach exploits the high-processing speed of FPGA-based emulators and the observability and usability of software-based models.
本文介绍了一种面向系统验证和表征的前向纠错(FEC)架构原型的方法。给出了一套完整的设计流程,满足了硬件设计的无差错和FEC仿真的加速要求。由于FEC操作的巨大加速,FPGA器件使设计人员能够观察到罕见事件。基于matlab的系统有助于研究非常罕见的解码失败事件对FEC系统性能的影响,并找到旨在优化参数和提高错误层区域LDPC码误码率性能的解决方案。此外,还探讨了嵌入式系统的开发,该系统提供了对测试验证过程自动化系统的远程访问。本文提出的原型方法利用了基于fpga的仿真器的高处理速度和基于软件的模型的可观察性和可用性。
{"title":"An FPGA-based prototyping method for verification, characterization and optimization of LDPC error correction systems","authors":"P. Sakellariou, I. Tsatsaragkos, N. Kanistras, A. Mahdi, Vassilis Paliouras","doi":"10.1109/SAMOS.2012.6404188","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404188","url":null,"abstract":"This paper introduces a methodology for forward error correction (FEC) architectures prototyping, oriented to system verification and characterization. A complete design flow is described, which satisfies the requirement for error-free hardware design and acceleration of FEC simulations. FPGA devices give the designer the ability to observe rare events, due to tremendous speed-up of FEC operations. A Matlab-based system assists the investigation of the impact of very rare decoding failure events on the FEC system performance and the finding of solutions which aim to parameters optimization and BER performance improvement of LDPC codes in the error floor region. Furthermore, the development of an embedded system, which offers remote access to the system under test and verification process automation, is explored. The presented here prototyping approach exploits the high-processing speed of FPGA-based emulators and the observability and usability of software-based models.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121737216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
BADCO: Behavioral Application-Dependent Superscalar Core model 行为应用相关的超标量核心模型
Pub Date : 1900-01-01 DOI: 10.1109/SAMOS.2012.6404158
Ricardo A. Velásquez, P. Michaud, André Seznec
Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a cycle-accurate uncore model. Behavioral core models are built from cycle-accurate simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10% in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude.
微架构的研究和开发在很大程度上依赖于模拟器。理想的仿真器应该是简单、易于开发的,它应该是精确、准确和非常快速的。但是理想的模拟器并不存在,微架构师在处理器开发的不同阶段使用不同类型的模拟器,这取决于哪个是最重要的,是精度还是仿真速度。近似微体系结构模型以精度换取仿真速度,在精度损失仍然可以接受的情况下,对研究和设计空间探索非常有用。行为超标量核心建模是在研究重点不是核心本身的情况下以准确性换取仿真速度的一种可能方法。在这种方法中,超标量核心被视为在特定时间向非核心发出请求的黑盒。行为核心模型可以连接到周期精确的非核心模型。行为核心模型是根据周期精确模拟建立的。一旦建立模型的时间被平摊,就可以获得重要的仿真加速。我们描述和研究了一种定义现代超标量岩心行为模型的新方法。所提出的行为应用程序相关的超标量核心模型BADCO预测在超标量核心上运行的线程的执行时间,在大多数情况下误差小于10%。我们证明了BADCO在质量上是准确的,能够预测当我们改变非核心时性能是如何变化的。我们获得的模拟加速通常在一到两个数量级之间。
{"title":"BADCO: Behavioral Application-Dependent Superscalar Core model","authors":"Ricardo A. Velásquez, P. Michaud, André Seznec","doi":"10.1109/SAMOS.2012.6404158","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404158","url":null,"abstract":"Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a cycle-accurate uncore model. Behavioral core models are built from cycle-accurate simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10% in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123363316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2012 International Conference on Embedded Computer Systems (SAMOS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1