Robert Hartl, Andreas-Juergen Rohatschek, W. Stechele, A. Herkersdorf
Single-Event-Upsets in synchronous register-based designs are a severe problem for safety-critical applications. Exact and detailed error rate estimations are needed to determinea system’s level of reliability. Available methods for estimation consider only special effects, use special reliability models or are computationally intensive. We present an innovative method that is able to calculate the architectural vulnerability factor (AVF)of any RT-level circuit description by applying time-reversed stimulus values. This method, which we call Backwards Analysis, considers all major masking effects (logic masking, information lifetime, timing derating, transitive masking) in a single algorithm and delivers results in several levels of detail from average AVF through sensitivity waveforms. The results show the critical parts and states of a design, which could be used for reliability assessment and selective hardening of the circuit to reach a target failure rate.
{"title":"Architectural Vulnerability Factor Estimation with Backwards Analysis","authors":"Robert Hartl, Andreas-Juergen Rohatschek, W. Stechele, A. Herkersdorf","doi":"10.1109/DSD.2010.104","DOIUrl":"https://doi.org/10.1109/DSD.2010.104","url":null,"abstract":"Single-Event-Upsets in synchronous register-based designs are a severe problem for safety-critical applications. Exact and detailed error rate estimations are needed to determinea system’s level of reliability. Available methods for estimation consider only special effects, use special reliability models or are computationally intensive. We present an innovative method that is able to calculate the architectural vulnerability factor (AVF)of any RT-level circuit description by applying time-reversed stimulus values. This method, which we call Backwards Analysis, considers all major masking effects (logic masking, information lifetime, timing derating, transitive masking) in a single algorithm and delivers results in several levels of detail from average AVF through sensitivity waveforms. The results show the critical parts and states of a design, which could be used for reliability assessment and selective hardening of the circuit to reach a target failure rate.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131135621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lara G. Villanueva, G. Callicó, F. Tobajas, S. López, V. Armas, J. López, R. Sarmiento
Nowadays, images are employed in several areas of medicine for early diagnosis. In this sense, the industry provides accurate models to obtain, for example, X-ray and cardiology images of high resolution. However, other images, such as those related to pathological anatomy present in many situations poor quality, which complicates the diagnostic process. This work is focused on the quality enhancement of this type of images through a system based on super-resolution techniques. The results show that the proposed methodology can help medical specialists in the diagnostic of several pathologies.
{"title":"Medical Diagnosis Improvement Through Image Quality Enhancement Based on Super-Resolution","authors":"Lara G. Villanueva, G. Callicó, F. Tobajas, S. López, V. Armas, J. López, R. Sarmiento","doi":"10.1109/DSD.2010.35","DOIUrl":"https://doi.org/10.1109/DSD.2010.35","url":null,"abstract":"Nowadays, images are employed in several areas of medicine for early diagnosis. In this sense, the industry provides accurate models to obtain, for example, X-ray and cardiology images of high resolution. However, other images, such as those related to pathological anatomy present in many situations poor quality, which complicates the diagnostic process. This work is focused on the quality enhancement of this type of images through a system based on super-resolution techniques. The results show that the proposed methodology can help medical specialists in the diagnostic of several pathologies.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134383122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dmitri Mironov, R. Ubar, S. Devadze, J. Raik, A. Jutman
Logic simulation is a critical component of the design tool flow in modern hardware development efforts. In this paper a new algorithm for parallel logic simulation is proposed based on a new model of Structurally Synthesized Multiple Input BDDs (SSMIBDD). The SSMIBDDs allow further model size reduction and therefore higher speed of logic simulation than its predecessor SSBDD model. The paper presents a method of SSMIBDD synthesis from the given gate network and the main principles of parallel logic simulation with SSMIBDDs. Experimental data demonstrate in average 2.9 times improvement in the speed of logic simulation because of the reduced number of nodes in SSMIBDDs. Similarly to the SSBDDs, the new model preserves structural information about the circuit, which is needed for processing of faults. The reduced complexity of SSMIBDDs leads to the more powerful fault collapsing and as the result to more efficient fault simulation and fault injection to evaluate the dependability of fault tolerant circuits.
{"title":"Structurally Synthesized Multiple Input BDDs for Speeding Up Logic-Level Simulation of Digital Circuits","authors":"Dmitri Mironov, R. Ubar, S. Devadze, J. Raik, A. Jutman","doi":"10.1109/DSD.2010.27","DOIUrl":"https://doi.org/10.1109/DSD.2010.27","url":null,"abstract":"Logic simulation is a critical component of the design tool flow in modern hardware development efforts. In this paper a new algorithm for parallel logic simulation is proposed based on a new model of Structurally Synthesized Multiple Input BDDs (SSMIBDD). The SSMIBDDs allow further model size reduction and therefore higher speed of logic simulation than its predecessor SSBDD model. The paper presents a method of SSMIBDD synthesis from the given gate network and the main principles of parallel logic simulation with SSMIBDDs. Experimental data demonstrate in average 2.9 times improvement in the speed of logic simulation because of the reduced number of nodes in SSMIBDDs. Similarly to the SSBDDs, the new model preserves structural information about the circuit, which is needed for processing of faults. The reduced complexity of SSMIBDDs leads to the more powerful fault collapsing and as the result to more efficient fault simulation and fault injection to evaluate the dependability of fault tolerant circuits.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130562342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network-on-chip (NoC) is being proposed as a scalable and reusable communication platform for future embedded systems. The performance of NoC largely depends on the underlying deadlock-free and efficient routing algorithm. When the adaptive routing returns a set of acceptable output channels, then a selection strategy is used to select the output channel, therefore the selection strategy affects the efficiency of adaptive routing. In this paper a novel selection strategy for avoiding congested areas using a fuzzy-based routing decision is proposed that can be used with any adaptive routing algorithm. The objective of the proposed selection strategy is to choose a channel that has more free slots input buffer and lower power consumption. The routing path is established by minimizing a cost which is calculated by fuzzy controller and considers the power consumption and free slots input buffer of cores. Performance evaluation is carried out by using a flit-accurate simulator under different traffic scenarios. Result experiments show that the proposed selection strategy applied to Odd-Even routing algorithm can effectively improves average delay and power consumption to meet power balance requirement and avoid hotspot with low hardware overhead.
{"title":"Power Distribution in NoCs Through a Fuzzy Based Selection Strategy for Adaptive Routing","authors":"Nastaran Salehi, A. Khademzadeh, A. Dana","doi":"10.1109/DSD.2010.24","DOIUrl":"https://doi.org/10.1109/DSD.2010.24","url":null,"abstract":"Network-on-chip (NoC) is being proposed as a scalable and reusable communication platform for future embedded systems. The performance of NoC largely depends on the underlying deadlock-free and efficient routing algorithm. When the adaptive routing returns a set of acceptable output channels, then a selection strategy is used to select the output channel, therefore the selection strategy affects the efficiency of adaptive routing. In this paper a novel selection strategy for avoiding congested areas using a fuzzy-based routing decision is proposed that can be used with any adaptive routing algorithm. The objective of the proposed selection strategy is to choose a channel that has more free slots input buffer and lower power consumption. The routing path is established by minimizing a cost which is calculated by fuzzy controller and considers the power consumption and free slots input buffer of cores. Performance evaluation is carried out by using a flit-accurate simulator under different traffic scenarios. Result experiments show that the proposed selection strategy applied to Odd-Even routing algorithm can effectively improves average delay and power consumption to meet power balance requirement and avoid hotspot with low hardware overhead.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123402309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In most of existing approaches, the reorganization of test vector sequence and reordering scan chains registers to reduce power consumption are solved separately, they are seen as independent procedures. In the paper it is shown that a correlation between these two processes and strong reasons to combine them into one procedure run concurrently exist. Based on this idea, it is demonstrated that search spaces of both procedures can be combined together into a single search space in order to achieve better results during the optimization process. The optimization over the united search space was tested on ISCAS85, ISCAS89 and ITC99 benchmark circuits implemented by means of CMOS primitives from AMI technological libraries. Results presented in the paper show that lower power consumption can be achieved if the correlation is reflected, i.e., if the search space is united rather than divided into separate spaces. At the end of the paper, results achieved by genetic algorithm based optimization are presented, discussed and compared with results of existing methods.
{"title":"The Use of Genetic Algorithm to Derive Correlation Between Test Vector and Scan Register Sequences and Reduce Power Consumption","authors":"Z. Kotásek, Jaroslav Skarvada, Josef Strnadel","doi":"10.1109/DSD.2010.37","DOIUrl":"https://doi.org/10.1109/DSD.2010.37","url":null,"abstract":"In most of existing approaches, the reorganization of test vector sequence and reordering scan chains registers to reduce power consumption are solved separately, they are seen as independent procedures. In the paper it is shown that a correlation between these two processes and strong reasons to combine them into one procedure run concurrently exist. Based on this idea, it is demonstrated that search spaces of both procedures can be combined together into a single search space in order to achieve better results during the optimization process. The optimization over the united search space was tested on ISCAS85, ISCAS89 and ITC99 benchmark circuits implemented by means of CMOS primitives from AMI technological libraries. Results presented in the paper show that lower power consumption can be achieved if the correlation is reflected, i.e., if the search space is united rather than divided into separate spaces. At the end of the paper, results achieved by genetic algorithm based optimization are presented, discussed and compared with results of existing methods.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124912052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The communication latency of Network-on-Chip (NoC) is one of the factors that significantly impacts on the application performance on System-on-Chips. To reduce the NoC latency, we propose a low latency architecture of router, which utilizes virtual output queuing (VOQ) to shorten the processing time of a packet transfer. Based on taking advantage of VOQ in buffering, the number of pipeline stages of a packet transfer can be reduced to two stages of switch allocation and switch traversal. By speculatively implementing these stages in a parallel fashion, the router can perform a packet transfer in only one clock cycle. In addition, a multiple VOQ architecture that each input port maintains more than one queue for each output channel is also proposed for improving the throughput of router. We have implemented the proposed router on FPGA and evaluated in terms of communication latency, throughput and hardware amount. The experimental results show that in a 4x4 two-dimensional mesh network, the proposed router reduces the communication latency by 25% and cost of area by 67.3% as compared to the look-ahead speculative virtual channel router.
{"title":"A Low Cost Single-Cycle Router Based on Virtual Output Queuing for On-chip Networks","authors":"S. Nguyen, S. Oyanagi","doi":"10.1109/DSD.2010.15","DOIUrl":"https://doi.org/10.1109/DSD.2010.15","url":null,"abstract":"The communication latency of Network-on-Chip (NoC) is one of the factors that significantly impacts on the application performance on System-on-Chips. To reduce the NoC latency, we propose a low latency architecture of router, which utilizes virtual output queuing (VOQ) to shorten the processing time of a packet transfer. Based on taking advantage of VOQ in buffering, the number of pipeline stages of a packet transfer can be reduced to two stages of switch allocation and switch traversal. By speculatively implementing these stages in a parallel fashion, the router can perform a packet transfer in only one clock cycle. In addition, a multiple VOQ architecture that each input port maintains more than one queue for each output channel is also proposed for improving the throughput of router. We have implemented the proposed router on FPGA and evaluated in terms of communication latency, throughput and hardware amount. The experimental results show that in a 4x4 two-dimensional mesh network, the proposed router reduces the communication latency by 25% and cost of area by 67.3% as compared to the look-ahead speculative virtual channel router.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125013589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christiaan Baaij, M. Kooijman, J. Kuper, Arjan Boeijink, Marco E. T. Gerards
CλaSH is a functional hardware description language that borrows both its syntax and semantics from the functional programming language Haskell. Polymorphism and higher-order functions provide a level of abstraction and generality that allow a circuit designer to describe circuits in a more natural way than possible with the language elements found in the traditional hardware description languages. Circuit descriptions can be translated to synthesizable VHDL using the prototype CλaSH compiler. As the circuit descriptions, simulation code, and test input are also valid Haskell, complete simulations can be done by a Haskell compiler or interpreter, allowing high-speed simulation and analysis.
{"title":"C?aSH: Structural Descriptions of Synchronous Hardware Using Haskell","authors":"Christiaan Baaij, M. Kooijman, J. Kuper, Arjan Boeijink, Marco E. T. Gerards","doi":"10.1109/DSD.2010.21","DOIUrl":"https://doi.org/10.1109/DSD.2010.21","url":null,"abstract":"CλaSH is a functional hardware description language that borrows both its syntax and semantics from the functional programming language Haskell. Polymorphism and higher-order functions provide a level of abstraction and generality that allow a circuit designer to describe circuits in a more natural way than possible with the language elements found in the traditional hardware description languages. Circuit descriptions can be translated to synthesizable VHDL using the prototype CλaSH compiler. As the circuit descriptions, simulation code, and test input are also valid Haskell, complete simulations can be done by a Haskell compiler or interpreter, allowing high-speed simulation and analysis.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128858835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Present day System-on-Chip utilizes Network-on-Chip for communication between the cores, which require proper flow control schemes for efficient utilization of network resources. We propose a flow control scheme that combines piggybacking and credit flit transmission with in-channel signaling to provide, router's input port's free buffer count information to the neighboring routers. Alternating bit protocol is used for transmitting and receiving data and credit flits. Our scheme does not use additional flit cycles or extra signaling lines overhead for credit flit transmission. We have used Noxim, a SystemC based simulator to evaluate the performance of our scheme. Compared to dedicated signaling flow control, in the proposed scheme, throughput remains the same, whereas, there is an increase in average delay by maximum of five flit cycles (13 percent) for transpose traffic and minimum of one flit cycle (0.3 percent) for hotspot traffic. Also, a router designed to implement our scheme requires 12.69 percent less signaling lines.
{"title":"In-channel Flow Control Scheme for Network-on-Chip","authors":"Vrishali Vijay Nimbalkar, Kuruvilla Varghese","doi":"10.1109/DSD.2010.73","DOIUrl":"https://doi.org/10.1109/DSD.2010.73","url":null,"abstract":"Present day System-on-Chip utilizes Network-on-Chip for communication between the cores, which require proper flow control schemes for efficient utilization of network resources. We propose a flow control scheme that combines piggybacking and credit flit transmission with in-channel signaling to provide, router's input port's free buffer count information to the neighboring routers. Alternating bit protocol is used for transmitting and receiving data and credit flits. Our scheme does not use additional flit cycles or extra signaling lines overhead for credit flit transmission. We have used Noxim, a SystemC based simulator to evaluate the performance of our scheme. Compared to dedicated signaling flow control, in the proposed scheme, throughput remains the same, whereas, there is an increase in average delay by maximum of five flit cycles (13 percent) for transpose traffic and minimum of one flit cycle (0.3 percent) for hotspot traffic. Also, a router designed to implement our scheme requires 12.69 percent less signaling lines.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124531037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes and evaluates a method to generate partial FPGA configurations at run-time. The proposed technique is aimed at adaptive embedded systems that employ run-time reconfiguration to achieve high flexibility and performance. The approach is based on the availability of a library of partial bit streams for a set of basic components. New partial configurations for circuits defined by net lists of basic components are created by merging together a default bit stream of the target area, the relocated configurations of the components, and the configurations of the switch matrices used for building the connections between the components. An implementation targeting the Virtex-II Pro platform FPGA is described. It runs on the embedded 300MHz Power PC CPU present in the FPGA. The proof-of-concept implementation was used to create partial configurations at run-time for 20 circuits with up to 21 components and 288 connections. The complete configuration creation process took between 7s and 97s.
本文描述并评估了一种在运行时生成部分FPGA配置的方法。该技术针对自适应嵌入式系统,采用运行时重构来实现高灵活性和高性能。该方法基于一组基本组件的部分比特流库的可用性。通过合并目标区域的默认比特流、组件的重新定位配置和用于在组件之间建立连接的开关矩阵的配置,可以创建由基本组件的网络列表定义的电路的新部分配置。描述了一种针对Virtex-II Pro平台FPGA的实现。它运行在FPGA中的嵌入式300MHz Power PC CPU上。概念验证实现用于在运行时为多达21个组件和288个连接的20个电路创建部分配置。完整的配置创建过程耗时7秒到97秒。
{"title":"Creation of Partial FPGA Configurations at Run-Time","authors":"M. Silva, J. Ferreira","doi":"10.1109/DSD.2010.14","DOIUrl":"https://doi.org/10.1109/DSD.2010.14","url":null,"abstract":"This paper describes and evaluates a method to generate partial FPGA configurations at run-time. The proposed technique is aimed at adaptive embedded systems that employ run-time reconfiguration to achieve high flexibility and performance. The approach is based on the availability of a library of partial bit streams for a set of basic components. New partial configurations for circuits defined by net lists of basic components are created by merging together a default bit stream of the target area, the relocated configurations of the components, and the configurations of the switch matrices used for building the connections between the components. An implementation targeting the Virtex-II Pro platform FPGA is described. It runs on the embedded 300MHz Power PC CPU present in the FPGA. The proof-of-concept implementation was used to create partial configurations at run-time for 20 circuits with up to 21 components and 288 connections. The complete configuration creation process took between 7s and 97s.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131949126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As technology advances, the number of cores in Chip Multi Processor systems (CMPs) and Multi Processor Systems-on-Chips (MPSoCs) keeps increasing. Current test chips and products reach tens of cores, and it is expected to reach hundreds of cores in the near future. Such complexity demands for an efficient network-on-chip (NoC). The common choice to build such networks is the 2D mesh topology (as it matches the regular tile-based design) and the Dimension-Order Routing (DOR) algorithm (because its simplicity). The network in such systems must provide sustained throughput and ultra low latencies. One of the key components in the network is the router, and thus, it plays a major role when designing for such performance levels. In this paper we propose a new pipelined router design focused in reducing the router latency. As a first step we identify the router components that take most of the critical path, and thus limit the router frequency. In particular, the arbiter is the one limiting the performance of the router. Based on this fact, we simplify the arbiter logic by using multiple smaller arbiters. The initial set of requests in the initial arbiter is then distributed over the smaller arbiters that operate in parallel. With this design procedure, and with a proper internal router organization, different router architectures are evolved. All of them enable the use of smaller arbiters in parallel by replicating ports and assuming the use of the DOR algorithm. The net result of such changes is a faster router. Preliminary results demonstrate a router latency reduction ranging from 10% to 21% with an increase of the router area. Network latency is reduced in a range from 11% to 15%.
{"title":"A Latency-Efficient Router Architecture for CMP Systems","authors":"Antoni Roca, J. Flich, F. Silla, J. Duato","doi":"10.1109/DSD.2010.42","DOIUrl":"https://doi.org/10.1109/DSD.2010.42","url":null,"abstract":"As technology advances, the number of cores in Chip Multi Processor systems (CMPs) and Multi Processor Systems-on-Chips (MPSoCs) keeps increasing. Current test chips and products reach tens of cores, and it is expected to reach hundreds of cores in the near future. Such complexity demands for an efficient network-on-chip (NoC). The common choice to build such networks is the 2D mesh topology (as it matches the regular tile-based design) and the Dimension-Order Routing (DOR) algorithm (because its simplicity). The network in such systems must provide sustained throughput and ultra low latencies. One of the key components in the network is the router, and thus, it plays a major role when designing for such performance levels. In this paper we propose a new pipelined router design focused in reducing the router latency. As a first step we identify the router components that take most of the critical path, and thus limit the router frequency. In particular, the arbiter is the one limiting the performance of the router. Based on this fact, we simplify the arbiter logic by using multiple smaller arbiters. The initial set of requests in the initial arbiter is then distributed over the smaller arbiters that operate in parallel. With this design procedure, and with a proper internal router organization, different router architectures are evolved. All of them enable the use of smaller arbiters in parallel by replicating ports and assuming the use of the DOR algorithm. The net result of such changes is a faster router. Preliminary results demonstrate a router latency reduction ranging from 10% to 21% with an increase of the router area. Network latency is reduced in a range from 11% to 15%.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124384374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}