首页 > 最新文献

2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools最新文献

英文 中文
Application Dependent FPGA Testing Method 应用相关的FPGA测试方法
M. Rozkovec, Jiri Jenícek, O. Novák
Application dependent FPGA testing can reduce time and memory requirements comparing with the tests that exercise complete FPGA structure. This paper describes a methodology of FPGA testing that does not require reconfiguration of the tested hardware and thus it preserves conditions that caused erroneous behavior of the FPGA during its function. We show that the tested part of the FPGA can be efficiently tested by deterministic test patters even in case if we have no precise information about the internal FPGA structure. It is too hardware consuming to store uncompressed deterministic test patterns on the FPGA. From this reason we propose to compress the deterministic test patterns with the help of COMPAS – a compression system that uses scan chains for pattern decompression. COMPAS is well suited for current FPGAs as they can store the scan chain content in the LUT based shift registers. The COMPAS test compression system is based on test pattern overlapping, we propose an improved version of it. Application of overlapped test patterns requires additional shift registers for saving test patterns during test response recording into the internal scan chains. The neighborhood of the tested part of the FPGA can be dynamically reconfigured into shift registers and ORA. The shift registers contain compressed test sequence and allow fast test pattern decompression. Experimental results given in the paper demonstrate efficiency of the proposed FPGA tetste testing method.
与完整的FPGA结构测试相比,应用相关的FPGA测试可以减少时间和内存需求。本文描述了一种不需要重新配置被测硬件的FPGA测试方法,因此它保留了导致FPGA在其功能期间错误行为的条件。结果表明,即使在没有FPGA内部结构的精确信息的情况下,确定的测试模式也可以有效地测试FPGA的测试部分。在FPGA上存储未压缩的确定性测试模式太耗费硬件。基于这个原因,我们建议在COMPAS的帮助下压缩确定性测试模式,COMPAS是一个使用扫描链进行模式解压缩的压缩系统。COMPAS非常适合当前的fpga,因为它们可以将扫描链内容存储在基于LUT的移位寄存器中。基于测试模式重叠的COMPAS测试压缩系统,提出了一种改进版本。重叠测试模式的应用需要额外的移位寄存器,以便在测试响应记录到内部扫描链期间保存测试模式。FPGA被测部分的邻域可以动态地重新配置为移位寄存器和ORA。移位寄存器包含压缩的测试序列,并允许快速测试模式解压缩。实验结果证明了所提出的FPGA测试方法的有效性。
{"title":"Application Dependent FPGA Testing Method","authors":"M. Rozkovec, Jiri Jenícek, O. Novák","doi":"10.1109/DSD.2010.65","DOIUrl":"https://doi.org/10.1109/DSD.2010.65","url":null,"abstract":"Application dependent FPGA testing can reduce time and memory requirements comparing with the tests that exercise complete FPGA structure. This paper describes a methodology of FPGA testing that does not require reconfiguration of the tested hardware and thus it preserves conditions that caused erroneous behavior of the FPGA during its function. We show that the tested part of the FPGA can be efficiently tested by deterministic test patters even in case if we have no precise information about the internal FPGA structure. It is too hardware consuming to store uncompressed deterministic test patterns on the FPGA. From this reason we propose to compress the deterministic test patterns with the help of COMPAS – a compression system that uses scan chains for pattern decompression. COMPAS is well suited for current FPGAs as they can store the scan chain content in the LUT based shift registers. The COMPAS test compression system is based on test pattern overlapping, we propose an improved version of it. Application of overlapped test patterns requires additional shift registers for saving test patterns during test response recording into the internal scan chains. The neighborhood of the tested part of the FPGA can be dynamically reconfigured into shift registers and ORA. The shift registers contain compressed test sequence and allow fast test pattern decompression. Experimental results given in the paper demonstrate efficiency of the proposed FPGA tetste testing method.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128060111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Re-NUCA: Boosting CMP Performance Through Block Replication Re-NUCA:通过块复制提高CMP性能
P. Foglia, C. Prete, M. Solinas, Giovanna Monni
Chip Multiprocessor (CMP) systems have become the reference architecture for designing micro-processors, thanks to the improvements in semiconductor nanotechnology that have continuously provided a crescent number of faster and smaller per-chip transistors. The interests for CMPs grew up since classical techniques for boosting performance, e.g. the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. CMP systems generally adopt a large last-level-cache (LLC) (typically, L2 or L3) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for tolerating wire-delay effects on the overall performance. In this paper, we introduce a novel NUCA architecture, called Re-NUCA, specifically suited for (but not limited to) CMPs in which cores are placed at different sides of the shared cache. The idea is to allow shared blocks to be replicated inside the shared cache, in order to avoid the limitations to performance improvements that arise in classical D-NUCA caches due to the conflict hit problem. Our results show that Re-NUCA outperforms D-NUCA of more then 5% on average, but for those applications that strongly suffer from the conflict hit problem we observe performance improvements up to 15%.
芯片多处理器(CMP)系统已经成为设计微处理器的参考架构,这要归功于半导体纳米技术的进步,它不断地提供了数量如新月一般的更快、更小的单片晶体管。由于能量限制和电线延迟效应,提高性能的经典技术(例如增加时钟频率和每个时钟周期执行的工作量)不再能够提供显着的改进,因此对cmp的兴趣不断增长。CMP系统通常采用在所有核心之间共享的大型最后一级缓存(LLC)(通常是L2或L3)和专用L1缓存。由于私有缓存的miss解析时间取决于LLC的响应时间,而LLC的响应时间以线延迟为主,因此线延迟会影响性能。NUCA缓存已被提议用于单核和多核系统,作为容忍线延迟对整体性能影响的机制。在本文中,我们介绍了一种新的NUCA架构,称为Re-NUCA,特别适用于(但不限于)cmp,其中内核放置在共享缓存的不同侧。这个想法是允许在共享缓存内复制共享块,以避免由于冲突命中问题而在经典D-NUCA缓存中出现的性能改进限制。我们的结果表明,Re-NUCA的性能平均优于D-NUCA 5%以上,但对于那些严重遭受冲突打击问题的应用程序,我们观察到性能提高高达15%。
{"title":"Re-NUCA: Boosting CMP Performance Through Block Replication","authors":"P. Foglia, C. Prete, M. Solinas, Giovanna Monni","doi":"10.1109/DSD.2010.41","DOIUrl":"https://doi.org/10.1109/DSD.2010.41","url":null,"abstract":"Chip Multiprocessor (CMP) systems have become the reference architecture for designing micro-processors, thanks to the improvements in semiconductor nanotechnology that have continuously provided a crescent number of faster and smaller per-chip transistors. The interests for CMPs grew up since classical techniques for boosting performance, e.g. the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. CMP systems generally adopt a large last-level-cache (LLC) (typically, L2 or L3) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for tolerating wire-delay effects on the overall performance. In this paper, we introduce a novel NUCA architecture, called Re-NUCA, specifically suited for (but not limited to) CMPs in which cores are placed at different sides of the shared cache. The idea is to allow shared blocks to be replicated inside the shared cache, in order to avoid the limitations to performance improvements that arise in classical D-NUCA caches due to the conflict hit problem. Our results show that Re-NUCA outperforms D-NUCA of more then 5% on average, but for those applications that strongly suffer from the conflict hit problem we observe performance improvements up to 15%.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127292852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Low Power FPGA Implementations of 256-bit Luffa Hash Function 256位丝瓜哈希函数的低功耗FPGA实现
P. Kitsos, N. Sklavos, A. Skodras
Low power techniques in a FPGA implementation of the hash function called Luffa are presented in this paper. This hash function is under consideration for adoption as standard. Two major gate level techniques are introduced in order to reduce the power consumption, namely the pipeline technique (with some variants) and the use of embedded RAM blocks instead of general purpose logic elements. Power consumption reduction from 1.2 to 8.7 times is achieved by means of the proposed techniques compared with the implementation without any low power issue.
本文介绍了一种低功耗的FPGA实现哈希函数Luffa的技术。这个哈希函数正在考虑作为标准采用。为了降低功耗,介绍了两种主要的门级技术,即管道技术(具有某些变体)和使用嵌入式RAM块而不是通用逻辑元件。与没有任何低功耗问题的实现相比,采用所提出的技术可将功耗降低1.2至8.7倍。
{"title":"Low Power FPGA Implementations of 256-bit Luffa Hash Function","authors":"P. Kitsos, N. Sklavos, A. Skodras","doi":"10.1109/DSD.2010.19","DOIUrl":"https://doi.org/10.1109/DSD.2010.19","url":null,"abstract":"Low power techniques in a FPGA implementation of the hash function called Luffa are presented in this paper. This hash function is under consideration for adoption as standard. Two major gate level techniques are introduced in order to reduce the power consumption, namely the pipeline technique (with some variants) and the use of embedded RAM blocks instead of general purpose logic elements. Power consumption reduction from 1.2 to 8.7 times is achieved by means of the proposed techniques compared with the implementation without any low power issue.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Evaluating a Transmission Power Self-Optimization Technique for WSN in EMI Environments 电磁干扰环境下WSN传输功率自优化技术评估
F. Lavratti, A. R. Pinto, L. Bolzani, Fabian Vargas, C. Montez, F. Hernandez, E. Gatti, C. Silva
Wireless Sensor Networks (WSNs) can be used to monitor hazardous and inaccessible areas. The WSN is composed of several nodes each provided with its separated power supply, e.g. battery. Working in hardly accessible places it is preferable to assure the adoption of the minimum transmission power in order to prolong as much as possible the WSN’’s lifetime. Though, we have to keep in mind that the reliability of the data transmitted represents a crucial requirement. Therefore, power optimization and reliability have become the most important concerns when dealing with modern systems based on WSN. In this context, we propose to evaluate the effectiveness of a Transmission Power Self-Optimization (TPSO) technique for WSNs in an Electromagnetic Interference (EMI) Environment. The TPSO technique consists of an algorithm able to guarantee an equally high Quality of Service (QoS), concentrating on the WSN’’s Efficiency (Ef), while optimizing the transmission power necessary for data communication. Thus, the main idea behind our approach is to reach a trade-off between Ef and energy consumption in an environment with inherent noise.
无线传感器网络(WSNs)可用于监测危险和难以进入的区域。无线传感器网络由多个节点组成,每个节点都提供独立的电源,例如电池。在难以接近的环境中,为了尽可能延长无线传感器网络的使用寿命,最好保证采用最小的传输功率。但是,我们必须记住,传输数据的可靠性是一个至关重要的要求。因此,在处理基于无线传感器网络的现代系统时,功率优化和可靠性已成为最重要的问题。在此背景下,我们提出评估在电磁干扰(EMI)环境下无线传感器网络的传输功率自优化(TPSO)技术的有效性。TPSO技术包括一种能够保证同样高的服务质量(QoS)的算法,专注于WSN的效率(Ef),同时优化数据通信所需的传输功率。因此,我们的方法背后的主要思想是在具有固有噪声的环境中达到Ef和能源消耗之间的权衡。
{"title":"Evaluating a Transmission Power Self-Optimization Technique for WSN in EMI Environments","authors":"F. Lavratti, A. R. Pinto, L. Bolzani, Fabian Vargas, C. Montez, F. Hernandez, E. Gatti, C. Silva","doi":"10.1109/DSD.2010.116","DOIUrl":"https://doi.org/10.1109/DSD.2010.116","url":null,"abstract":"Wireless Sensor Networks (WSNs) can be used to monitor hazardous and inaccessible areas. The WSN is composed of several nodes each provided with its separated power supply, e.g. battery. Working in hardly accessible places it is preferable to assure the adoption of the minimum transmission power in order to prolong as much as possible the WSN’’s lifetime. Though, we have to keep in mind that the reliability of the data transmitted represents a crucial requirement. Therefore, power optimization and reliability have become the most important concerns when dealing with modern systems based on WSN. In this context, we propose to evaluate the effectiveness of a Transmission Power Self-Optimization (TPSO) technique for WSNs in an Electromagnetic Interference (EMI) Environment. The TPSO technique consists of an algorithm able to guarantee an equally high Quality of Service (QoS), concentrating on the WSN’’s Efficiency (Ef), while optimizing the transmission power necessary for data communication. Thus, the main idea behind our approach is to reach a trade-off between Ef and energy consumption in an environment with inherent noise.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"4 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120891562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An Approximate Maximum Common Subgraph Algorithm for Large Digital Circuits 大型数字电路的近似最大公子图算法
J. Rutgers, P. T. Wolkotte, P. Hölzenspies, J. Kuper, G. Smit
This paper presents an approximate Maximum Common Sub graph (MCS) algorithm, specifically for directed, cyclic graphs representing digital circuits. Because of the application domain, the graphs have nice properties: they are very sparse, have many different labels, and most vertices have only one predecessor. The algorithm iterates over all vertices once and uses heuristics to find the MCS. It is linear in computational complexity with respect to the size of the graph. Experiments show that very large common sub graphs were found in graphs of up to 200,000 vertices within a few minutes, when a quarter or less of the graphs differ. The variation in run-time and quality of the result is low.
本文提出了一种近似的最大公共子图(MCS)算法,专门用于表示数字电路的有向循环图。由于应用领域的原因,图具有很好的属性:它们非常稀疏,有许多不同的标签,并且大多数顶点只有一个前身。该算法对所有顶点迭代一次,并使用启发式方法找到MCS。它的计算复杂度与图的大小是线性的。实验表明,当四分之一或更少的图不同时,在几分钟内,在多达200,000个顶点的图中发现了非常大的公共子图。运行时的变化和结果的质量很低。
{"title":"An Approximate Maximum Common Subgraph Algorithm for Large Digital Circuits","authors":"J. Rutgers, P. T. Wolkotte, P. Hölzenspies, J. Kuper, G. Smit","doi":"10.1109/DSD.2010.29","DOIUrl":"https://doi.org/10.1109/DSD.2010.29","url":null,"abstract":"This paper presents an approximate Maximum Common Sub graph (MCS) algorithm, specifically for directed, cyclic graphs representing digital circuits. Because of the application domain, the graphs have nice properties: they are very sparse, have many different labels, and most vertices have only one predecessor. The algorithm iterates over all vertices once and uses heuristics to find the MCS. It is linear in computational complexity with respect to the size of the graph. Experiments show that very large common sub graphs were found in graphs of up to 200,000 vertices within a few minutes, when a quarter or less of the graphs differ. The variation in run-time and quality of the result is low.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121316926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A C-to-RTL Flow as an Energy Efficient Alternative to Embedded Processors in Digital Systems C-to-RTL流作为数字系统中嵌入式处理器的节能替代方案
Sameer D. Sahasrabuddhe, S. Subramanian, Kunal P. Ghosh, K. Arya, M. Desai
We present a high-level synthesis flow for mapping an algorithm description (in C) to a provably equivalent register transfer level (RTL) description of hardware. This flow uses an intermediate representation which is an orthogonal factorization of the program behavior into control, data and memory aspects, and is suitable for the description of large systems. We show that optimizations such as arbiter-less resource sharing can be efficiently computed on this representation. We apply the flow to a wide range of examples ranging from stream ciphers to database and linear algebra applications. The resulting RTL is then put through a standard ASIC tool chain (synthesis followed by automatic place-and-route), and the performance and power dissipation of the resulting layout is computed. We observe that the energy consumption (per completed task) of each resulting circuit is considerably lower than that of an equivalent executable running on a low-power processor, indicating that this C-to-RTL flow offers an energy efficient alternative to the use of embedded processors in mapping algorithms to digital VLSI systems.
我们提出了一个高级合成流程,用于将算法描述(用C语言)映射到可证明的等效寄存器传输级(RTL)硬件描述。该流程使用一种中间表示,即将程序行为正交分解为控制、数据和内存方面,适用于大型系统的描述。我们证明了在这种表示上可以有效地计算诸如无仲裁资源共享之类的优化。我们将流应用于从流密码到数据库和线性代数应用的广泛示例。然后将得到的RTL放入标准的ASIC工具链(综合之后是自动放置和布线),并计算得到的布局的性能和功耗。我们观察到,每个结果电路的能耗(每个完成的任务)大大低于在低功耗处理器上运行的等效可执行文件的能耗,这表明这种C-to-RTL流程为将算法映射到数字VLSI系统中使用嵌入式处理器提供了一种节能替代方案。
{"title":"A C-to-RTL Flow as an Energy Efficient Alternative to Embedded Processors in Digital Systems","authors":"Sameer D. Sahasrabuddhe, S. Subramanian, Kunal P. Ghosh, K. Arya, M. Desai","doi":"10.1109/DSD.2010.52","DOIUrl":"https://doi.org/10.1109/DSD.2010.52","url":null,"abstract":"We present a high-level synthesis flow for mapping an algorithm description (in C) to a provably equivalent register transfer level (RTL) description of hardware. This flow uses an intermediate representation which is an orthogonal factorization of the program behavior into control, data and memory aspects, and is suitable for the description of large systems. We show that optimizations such as arbiter-less resource sharing can be efficiently computed on this representation. We apply the flow to a wide range of examples ranging from stream ciphers to database and linear algebra applications. The resulting RTL is then put through a standard ASIC tool chain (synthesis followed by automatic place-and-route), and the performance and power dissipation of the resulting layout is computed. We observe that the energy consumption (per completed task) of each resulting circuit is considerably lower than that of an equivalent executable running on a low-power processor, indicating that this C-to-RTL flow offers an energy efficient alternative to the use of embedded processors in mapping algorithms to digital VLSI systems.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127047891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Programming Model and a NoC-Based Architecture for Streaming Applications 流媒体应用的编程模型和基于noc的体系结构
Yun Wu, D. Houzet, Sylvain Huet
The ever increasing density of integration makes the NoC a relevant communication design paradigm even for FPGAs. But NoC are always designed without considerations of applications and programming models, like busses and crossbars. Dealing with parallelism is still challenging. One way is to take into account the parallel programming model and application field in the design of the NoC, to reduce the semantic gap between application and implementation. In this paper we present a NoC and a design flow which target the implementation of streaming applications, e.g. image and video processing. The NoC topology is described as a matrix of routers (maybe a sparse matrix) mapped on a matrix of FPGAs for prototyping, which brings up a hierarchical dimension. Besides, the NoC has been developed in conjunction with a streaming programming model expressed with a subset of System C language. This allows optimizing the NoC by implementing the communication and synchronization primitives’mechanisms of the programming model directly in hardware: the size of such a router connected to 4 processing elements is about 2000 CLB from Xilinx FPGA, which is comparable with the size of a single processor. The design flow automates the implementation of an application expressed with a System C subset to a NoC based architecture.
不断增加的集成密度使NoC甚至成为fpga的相关通信设计范例。但是NoC的设计总是不考虑应用程序和编程模型,比如总线和交叉栏。处理并行性仍然是一个挑战。一种方法是在NoC的设计中考虑并行编程模型和应用领域,以减少应用和实现之间的语义差距。在本文中,我们提出了一个NoC和一个设计流程,目标是实现流应用,如图像和视频处理。NoC拓扑被描述为路由器矩阵(可能是稀疏矩阵)映射到用于原型设计的fpga矩阵上,这带来了一个层次维度。此外,NoC还与一个用System C语言子集表示的流编程模型相结合。这允许通过直接在硬件中实现编程模型的通信和同步原语机制来优化NoC:这样一个连接到4个处理元素的路由器的大小大约是来自Xilinx FPGA的2000 CLB,这与单个处理器的大小相当。设计流将应用程序的实现自动化,该应用程序使用System C子集表示为基于NoC的体系结构。
{"title":"A Programming Model and a NoC-Based Architecture for Streaming Applications","authors":"Yun Wu, D. Houzet, Sylvain Huet","doi":"10.1109/DSD.2010.66","DOIUrl":"https://doi.org/10.1109/DSD.2010.66","url":null,"abstract":"The ever increasing density of integration makes the NoC a relevant communication design paradigm even for FPGAs. But NoC are always designed without considerations of applications and programming models, like busses and crossbars. Dealing with parallelism is still challenging. One way is to take into account the parallel programming model and application field in the design of the NoC, to reduce the semantic gap between application and implementation. In this paper we present a NoC and a design flow which target the implementation of streaming applications, e.g. image and video processing. The NoC topology is described as a matrix of routers (maybe a sparse matrix) mapped on a matrix of FPGAs for prototyping, which brings up a hierarchical dimension. Besides, the NoC has been developed in conjunction with a streaming programming model expressed with a subset of System C language. This allows optimizing the NoC by implementing the communication and synchronization primitives’mechanisms of the programming model directly in hardware: the size of such a router connected to 4 processing elements is about 2000 CLB from Xilinx FPGA, which is comparable with the size of a single processor. The design flow automates the implementation of an application expressed with a System C subset to a NoC based architecture.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"21 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125685240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Unified Digit Serial Systolic Montgomery Multiplication Architecture for Special Classes of Polynomials over GF(2m) GF(2m)上特殊多项式类的统一数字序列收缩Montgomery乘法体系
S. Talapatra, H. Rahaman, Samir K. Saha
This paper presents an unified digit-serial systolic multiplication architecture for all-one polynomials (AOP) and trinomial over GF (2m) for efficient implementation of Montgomery Multiplication (MM) algorithm suitable for cryptosystem. This is the first reported unified digit serial systolic digit level pipelined MM architecture for AOP and trinomials over GF (2). Analysis shows that the latency and circuit complexity of the proposed architecture are significantly less compared to earlier design for same class of polynomials. The proposed multiplier has clock cycle latency of (2N) where N=ém/Lù, m is the word size and L is the digit size.
为了有效地实现适用于密码系统的Montgomery乘法算法,提出了一种适用于GF (2m)上的全一多项式(AOP)和三项式的统一数字序列收缩乘法体系结构。这是在GF(2)上首次报道的用于AOP和三项式的统一数字串行收缩数字级流水线MM体系结构。分析表明,所提出的体系结构的延迟和电路复杂性与早期设计的同类多项式相比显着降低。所提出的乘法器的时钟周期延迟为(2N),其中N= /Lù, m是单词大小,L是数字大小。
{"title":"Unified Digit Serial Systolic Montgomery Multiplication Architecture for Special Classes of Polynomials over GF(2m)","authors":"S. Talapatra, H. Rahaman, Samir K. Saha","doi":"10.1109/DSD.2010.59","DOIUrl":"https://doi.org/10.1109/DSD.2010.59","url":null,"abstract":"This paper presents an unified digit-serial systolic multiplication architecture for all-one polynomials (AOP) and trinomial over GF (2m) for efficient implementation of Montgomery Multiplication (MM) algorithm suitable for cryptosystem. This is the first reported unified digit serial systolic digit level pipelined MM architecture for AOP and trinomials over GF (2). Analysis shows that the latency and circuit complexity of the proposed architecture are significantly less compared to earlier design for same class of polynomials. The proposed multiplier has clock cycle latency of (2N) where N=ém/Lù, m is the word size and L is the digit size.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132773426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A Case for Hardware Task Management Support for the StarSS Programming Model 为StarSS编程模型提供硬件任务管理支持的案例
C. Meenderinck, B. Juurlink
StarSS is a parallel programming model that eases the task of the programmer. He or she has to identify the tasks that can potentially be executed in parallel and the inputs and outputs of these tasks, while the runtime system takes care of the difficult issues of determining inter task dependencies, synchronization, load balancing, scheduling to optimize data locality, etc. Given these issues, however, the runtime system might become a bottleneck that limits the scalability of the system. The contribution of this paper is two-fold. First, we analyze the scalability of the current software runtime system for several synthetic benchmarks with different dependency patterns and task sizes. We show that for fine-grained tasks the system does not scale beyond five cores. Furthermore, we identify the main scalability bottlenecks of the runtime system. Second, we present the design of Nexus, a hardware support system for StarSS applications, that greatly reduces the task management overhead.
StarSS是一种简化程序员任务的并行编程模型。他或她必须确定可能并行执行的任务以及这些任务的输入和输出,而运行时系统则负责确定任务间依赖关系、同步、负载平衡、调度以优化数据局部性等难题。然而,考虑到这些问题,运行时系统可能会成为限制系统可伸缩性的瓶颈。本文的贡献是双重的。首先,我们分析了具有不同依赖模式和任务大小的几个合成基准的当前软件运行时系统的可伸缩性。我们表明,对于细粒度任务,系统不能扩展到超过5个内核。此外,我们还确定了运行时系统的主要可伸缩性瓶颈。其次,我们设计了一个用于StarSS应用程序的硬件支持系统Nexus,它大大降低了任务管理开销。
{"title":"A Case for Hardware Task Management Support for the StarSS Programming Model","authors":"C. Meenderinck, B. Juurlink","doi":"10.1109/DSD.2010.63","DOIUrl":"https://doi.org/10.1109/DSD.2010.63","url":null,"abstract":"StarSS is a parallel programming model that eases the task of the programmer. He or she has to identify the tasks that can potentially be executed in parallel and the inputs and outputs of these tasks, while the runtime system takes care of the difficult issues of determining inter task dependencies, synchronization, load balancing, scheduling to optimize data locality, etc. Given these issues, however, the runtime system might become a bottleneck that limits the scalability of the system. The contribution of this paper is two-fold. First, we analyze the scalability of the current software runtime system for several synthetic benchmarks with different dependency patterns and task sizes. We show that for fine-grained tasks the system does not scale beyond five cores. Furthermore, we identify the main scalability bottlenecks of the runtime system. Second, we present the design of Nexus, a hardware support system for StarSS applications, that greatly reduces the task management overhead.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116554513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A Design Process for Hardware/Software System Co-design and its Application to Designing a Reconfigurable FPGA 硬件/软件系统协同设计过程及其在可重构FPGA设计中的应用
F. Moreno, I. López, R. Sanz
This paper is going to address the topic of hardware/software systems co-design. The paper will develop two points of view. First, it provides a system-theoretical layout on the problem of designing hardware-software systems. This layout will enable the designer to proceed systematically in optimizing the tradeoff between the desired functionality, available resources and operating conditions. Second, the paper will describe an application of some of the theoretical principles to the design of an embedded automotive system built on a low-cost FPGA.
本文将讨论硬件/软件系统协同设计的主题。本文将阐述两个观点。首先,对硬件软件系统的设计问题进行了系统的理论布局。这种布局将使设计人员能够系统地在期望的功能、可用资源和操作条件之间进行优化权衡。其次,本文将描述一些理论原理在基于低成本FPGA的嵌入式汽车系统设计中的应用。
{"title":"A Design Process for Hardware/Software System Co-design and its Application to Designing a Reconfigurable FPGA","authors":"F. Moreno, I. López, R. Sanz","doi":"10.1109/DSD.2010.43","DOIUrl":"https://doi.org/10.1109/DSD.2010.43","url":null,"abstract":"This paper is going to address the topic of hardware/software systems co-design. The paper will develop two points of view. First, it provides a system-theoretical layout on the problem of designing hardware-software systems. This layout will enable the designer to proceed systematically in optimizing the tradeoff between the desired functionality, available resources and operating conditions. Second, the paper will describe an application of some of the theoretical principles to the design of an embedded automotive system built on a low-cost FPGA.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128447527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1