TCP/IP protocol processing latency has been an important issue in high-speed networks. In this paper, we present an architectural study of TCP/IP protocol. We port the TCP/IP protocol stack from the 4.4 FreeBSD to the SimpleScalar simulation environment. The architectural characteristics, such as instruction level parallelism and cache behavior, are studied through simulation. We also compare the characteristics of TCP/IP protocol to that of SPECint benchmark programs. It turns out that the former is quite different from the latter due to the unique processing structure. Furthermore, in order to improve the effectiveness of instruction cache, frequent instruction pairs are analyzed, and corresponding architectural optimizations are made to the instruction set architecture. The performance is evaluated in the simulator. We find that a 23% improvement can be achieved by taking advantage of the optimization. The instruction set optimizations proposed in this paper will be helpful for the design of new programmable protocol processors in future.
{"title":"Architectural analysis and instruction-set optimization design of network protocol processors","authors":"Haiyong Xie, Li Zhao, L. Bhuyan","doi":"10.1145/944645.944703","DOIUrl":"https://doi.org/10.1145/944645.944703","url":null,"abstract":"TCP/IP protocol processing latency has been an important issue in high-speed networks. In this paper, we present an architectural study of TCP/IP protocol. We port the TCP/IP protocol stack from the 4.4 FreeBSD to the SimpleScalar simulation environment. The architectural characteristics, such as instruction level parallelism and cache behavior, are studied through simulation. We also compare the characteristics of TCP/IP protocol to that of SPECint benchmark programs. It turns out that the former is quite different from the latter due to the unique processing structure. Furthermore, in order to improve the effectiveness of instruction cache, frequent instruction pairs are analyzed, and corresponding architectural optimizations are made to the instruction set architecture. The performance is evaluated in the simulator. We find that a 23% improvement can be achieved by taking advantage of the optimization. The instruction set optimizations proposed in this paper will be helpful for the design of new programmable protocol processors in future.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129449387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a software implementation of a very fast parallel Reed-Solomon decoder on the second generation of MorphoSys reconfigurable computation platform, which is targeting on streamed applications such as multimedia and DSP. Numerous modifications of the first-generation of the architecture have made a scalable computation and communication intensive architecture capable of extracting parallelisms of fine grain in instruction level. Many algorithms and the whole digital video broadcasting base-band receiver as well, have been mapped onto the second architecture with impressive performance. The mapping of a Reed-Solomon decoder proposed in the paper highly parallelizes all of its sub-algorithms, including Syndrome Computation, Berlekamp Algorithm, Chein Search, and Error Value Computation, in a SIMD fashion. The mapping is tested on a cycle-accurate simulator, "Mulate", and the performance is encouragingly better than other architectures. The decoding speed of the RS (255,239,16) decoder using two different methods of GF multiplication can be 1.319 Gbps and 2.534 Gbps, respectively. Furthermore, since there is no functionality specifically tailored to Reed-Solomon decoder, the result has demonstrated the capability of MorphoSys architecture to extracting instruction level parallelism from streamed applications.
{"title":"A fast parallel Reed-Solomon decoder on a reconfigurable architecture","authors":"A. Koohi, N. Bagherzadeh, Chengzhi Pan","doi":"10.1145/944645.944660","DOIUrl":"https://doi.org/10.1145/944645.944660","url":null,"abstract":"This paper presents a software implementation of a very fast parallel Reed-Solomon decoder on the second generation of MorphoSys reconfigurable computation platform, which is targeting on streamed applications such as multimedia and DSP. Numerous modifications of the first-generation of the architecture have made a scalable computation and communication intensive architecture capable of extracting parallelisms of fine grain in instruction level. Many algorithms and the whole digital video broadcasting base-band receiver as well, have been mapped onto the second architecture with impressive performance. The mapping of a Reed-Solomon decoder proposed in the paper highly parallelizes all of its sub-algorithms, including Syndrome Computation, Berlekamp Algorithm, Chein Search, and Error Value Computation, in a SIMD fashion. The mapping is tested on a cycle-accurate simulator, \"Mulate\", and the performance is encouragingly better than other architectures. The decoding speed of the RS (255,239,16) decoder using two different methods of GF multiplication can be 1.319 Gbps and 2.534 Gbps, respectively. Furthermore, since there is no functionality specifically tailored to Reed-Solomon decoder, the result has demonstrated the capability of MorphoSys architecture to extracting instruction level parallelism from streamed applications.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130836335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim Kogel, Malte Doerper, Andreas Wieferink, R. Leupers, G. Ascheid, H. Meyr, S. Goossens
Ever increasing complexity and heterogeneity of SoC platforms require on-chip communication schemes beyond the currently omnipresent shared bus architectures. To prevent time consuming design changes late in the design flow, we propose the early exploration of the on-chip communication architecture to meet performance and cost requirements. Based on SystemC 2.0.1 we have defined a modular exploration framework, which is able to capture the effect on performance for different on-chip networks like dedicated point-to-point, shared bus, and crossbar topologies. Monitoring of performance parameters like utilization, latency and throughput drives the mapping of the intermodule traffic to an efficient communication architecture. The effectiveness of our approach is demonstrated by the exemplary design of a high performance Network Processing Unit (NPU), which is compared against a commercial NPU device.
{"title":"A modular simulation framework for architectural exploration of on-chip interconnection networks","authors":"Tim Kogel, Malte Doerper, Andreas Wieferink, R. Leupers, G. Ascheid, H. Meyr, S. Goossens","doi":"10.1145/944645.944648","DOIUrl":"https://doi.org/10.1145/944645.944648","url":null,"abstract":"Ever increasing complexity and heterogeneity of SoC platforms require on-chip communication schemes beyond the currently omnipresent shared bus architectures. To prevent time consuming design changes late in the design flow, we propose the early exploration of the on-chip communication architecture to meet performance and cost requirements. Based on SystemC 2.0.1 we have defined a modular exploration framework, which is able to capture the effect on performance for different on-chip networks like dedicated point-to-point, shared bus, and crossbar topologies. Monitoring of performance parameters like utilization, latency and throughput drives the mapping of the intermodule traffic to an efficient communication architecture. The effectiveness of our approach is demonstrated by the exemplary design of a high performance Network Processing Unit (NPU), which is compared against a commercial NPU device.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131021087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multitasked real-time systems often employ caches to boost performance. However the unpredictable dynamic behavior of caches makes schedulability analysis of such systems difficult. In particular, the effect of caches needs to be considered for estimating the inter-task interference. As the memory blocks of different tasks can map to the same cache blocks, preemption of a task may introduce additional cache misses. The time penalty introduced by these misses is called the cache-related preemption delay (CRPD). In this paper, we provide a program path analysis technique to estimate CRPD. Our technique performs path analysis of both the preempted and the preempting tasks. Furthermore, we improve the accuracy of the analysis by estimating the possible states of the entire cache at each possible preemption point rather than estimating the states of each cache block independently. To avoid incurring high space requirements, the cache states can be maintained symbolically as a binary decision diagram. Experimental results indicate that we obtain tight CRPD estimates for realistic benchmarks.
{"title":"Accurate estimation of cache-related preemption delay","authors":"H. S. Negi, T. Mitra, Abhik Roychoudhury","doi":"10.1145/944645.944698","DOIUrl":"https://doi.org/10.1145/944645.944698","url":null,"abstract":"Multitasked real-time systems often employ caches to boost performance. However the unpredictable dynamic behavior of caches makes schedulability analysis of such systems difficult. In particular, the effect of caches needs to be considered for estimating the inter-task interference. As the memory blocks of different tasks can map to the same cache blocks, preemption of a task may introduce additional cache misses. The time penalty introduced by these misses is called the cache-related preemption delay (CRPD). In this paper, we provide a program path analysis technique to estimate CRPD. Our technique performs path analysis of both the preempted and the preempting tasks. Furthermore, we improve the accuracy of the analysis by estimating the possible states of the entire cache at each possible preemption point rather than estimating the states of each cache block independently. To avoid incurring high space requirements, the cache states can be maintained symbolically as a binary decision diagram. Experimental results indicate that we obtain tight CRPD estimates for realistic benchmarks.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128480981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing complexity of embedded applications and pressure on time-to-market has resulted in the increasing use of embedded real-time operating systems. Unfortunately, RTOSes can introduce a significant performance degradation. The paper presents the Real-Time Task Manager (RTM) - a processor extension that minimizes the performance drawbacks associated with RTOSes. The RTM accomplishes this by supporting, in hardware, a few of the common RTOS operations that are performance bottlenecks: task scheduling, time management, and event management. By exploiting the inherent parallelism of these operations, the RTM completes them in constant time, thereby significantly reducing RTOS overhead. It decreases both the processor time used by the RTOS and the maximum response time by an order of magnitude.
{"title":"Hardware support for real-time operating systems","authors":"P. Kohout, B. Ganesh, B. Jacob","doi":"10.1145/944645.944656","DOIUrl":"https://doi.org/10.1145/944645.944656","url":null,"abstract":"The growing complexity of embedded applications and pressure on time-to-market has resulted in the increasing use of embedded real-time operating systems. Unfortunately, RTOSes can introduce a significant performance degradation. The paper presents the Real-Time Task Manager (RTM) - a processor extension that minimizes the performance drawbacks associated with RTOSes. The RTM accomplishes this by supporting, in hardware, a few of the common RTOS operations that are performance bottlenecks: task scheduling, time management, and event management. By exploiting the inherent parallelism of these operations, the RTM completes them in constant time, thereby significantly reducing RTOS overhead. It decreases both the processor time used by the RTOS and the maximum response time by an order of magnitude.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132795773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The analysis of the amount of human resources required to complete a project is felt as a critical issue in any company of the electronics industry. In particular, early estimation of the effort involved in a development process is a key requirement for any cost-driven system-level design decision. In this paper, we present a methodology to predict the final size of a VHDL project on the basis of a high-level description, obtaining a significant indication about the development effort. The methodology is the composition of a number of specialized models, tailored to estimate the size of specific component types. Models were trained and tested on two disjoint and large sets of real VHDL projects. Quality-of-result indicators show that the methodology is both accurate and robust.
{"title":"Early estimation of the size of VHDL projects","authors":"W. Fornaciari, F. Salice, D. Scarpazza","doi":"10.1145/944645.944699","DOIUrl":"https://doi.org/10.1145/944645.944699","url":null,"abstract":"The analysis of the amount of human resources required to complete a project is felt as a critical issue in any company of the electronics industry. In particular, early estimation of the effort involved in a development process is a key requirement for any cost-driven system-level design decision. In this paper, we present a methodology to predict the final size of a VHDL project on the basis of a high-level description, obtaining a significant indication about the development effort. The methodology is the composition of a number of specialized models, tailored to estimate the size of specific component types. Models were trained and tested on two disjoint and large sets of real VHDL projects. Quality-of-result indicators show that the methodology is both accurate and robust.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131369432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article we present an approach to object-oriented hardware design and synthesis based on SystemC. We give an introduction to an extended SystemC synthesis subset which we propose, and, in particular, its object-oriented features. We will also briefly outline our basic synthesis concepts for object-oriented hardware specifications. Finally, we present some examples for the application of the extended synthesis subset, which are directly processable by a first synthesis tool prototype which we have developed for this purpose.
{"title":"Extending the SystemC synthesis subset by object-oriented features","authors":"E. Grimpe, F. Oppenheimer","doi":"10.1145/944645.944652","DOIUrl":"https://doi.org/10.1145/944645.944652","url":null,"abstract":"In this article we present an approach to object-oriented hardware design and synthesis based on SystemC. We give an introduction to an extended SystemC synthesis subset which we propose, and, in particular, its object-oriented features. We will also briefly outline our basic synthesis concepts for object-oriented hardware specifications. Finally, we present some examples for the application of the extended synthesis subset, which are directly processable by a first synthesis tool prototype which we have developed for this purpose.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133166942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed embedded systems implemented with mixed, event-triggered task sets, which communicate over bus protocols consisting of both static and dynamic phases, are emerging as the new standard in application areas such as automotive electronics. In a previous paper, we developed a holistic timing analysis and scheduling approach for this category of systems. Based on this result, new design problems are solved, which we identified as characteristic for such hybrid systems: partitioning of the system functionality into time-triggered and event-triggered domains and the optimization of parameters corresponding to the communication protocol. We addressed both problems in the context of a heuristic which performs mapping and scheduling of the system functionality. We demonstrate the efficiency of the proposed technique with extensive experiments.
{"title":"Design optimization of mixed time/event-triggered distributed embedded systems","authors":"T. Pop, P. Eles, Zebo Peng","doi":"10.1145/944645.944672","DOIUrl":"https://doi.org/10.1145/944645.944672","url":null,"abstract":"Distributed embedded systems implemented with mixed, event-triggered task sets, which communicate over bus protocols consisting of both static and dynamic phases, are emerging as the new standard in application areas such as automotive electronics. In a previous paper, we developed a holistic timing analysis and scheduling approach for this category of systems. Based on this result, new design problems are solved, which we identified as characteristic for such hybrid systems: partitioning of the system functionality into time-triggered and event-triggered domains and the optimization of parameters corresponding to the communication protocol. We addressed both problems in the context of a heuristic which performs mapping and scheduling of the system functionality. We demonstrate the efficiency of the proposed technique with extensive experiments.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129261263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Future wireless Internet enabled services will be increasingly powerful supporting many more applications including one of the most crucial, security. Although SoCs offer more resistance to bus probing attacks, power/EM attacks on cores and network snooping attacks by malicious code are relevant. This paper presents a methodology for security on NoC at both the network level (or transport layer) and at the core level (or application layer) is proposed. For the first time a low cost security wrapper design is presented, which prevents unencrypted keys from leaving the cores and NoC. This is crucial to prevent untrusted software on or off the NoC from gaining access to keys. At the core level (application layer) power analysis attacks are examined for the first time for parallel and adiabatic architectural cores. With the emergence of secure IP cores in the market, a security methodology for designing NoCs is crucial for supporting future wireless Internet enabled devices.
{"title":"Security wrappers and power analysis for SoC technology","authors":"C. Gebotys, Y. Zhang","doi":"10.1145/944645.944689","DOIUrl":"https://doi.org/10.1145/944645.944689","url":null,"abstract":"Future wireless Internet enabled services will be increasingly powerful supporting many more applications including one of the most crucial, security. Although SoCs offer more resistance to bus probing attacks, power/EM attacks on cores and network snooping attacks by malicious code are relevant. This paper presents a methodology for security on NoC at both the network level (or transport layer) and at the core level (or application layer) is proposed. For the first time a low cost security wrapper design is presented, which prevents unencrypted keys from leaving the cores and NoC. This is crucial to prevent untrusted software on or off the NoC from gaining access to keys. At the core level (application layer) power analysis attacks are examined for the first time for parallel and adiabatic architectural cores. With the emergence of secure IP cores in the market, a security methodology for designing NoCs is crucial for supporting future wireless Internet enabled devices.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116146167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ForSyDe methodology has been developed for system level design. Starting with a formal specification model that captures the functionality of the system at a high abstraction level, it provides formal design transformation methods for a transparent refinement process of the specification model into an implementation model that is optimized for synthesis. A transformation may be semantic preserving or a design decision. The latter modifies the semantics of the system level description and changes the meaning of the model. The main contribution of this paper is the incorporation of model checking to verify that refined system blocks satisfy the design specification. We illustrate the translation of the ForSyDe code to the SMV language and the verification of local design decisions with a case study of a ForSyDe equalizer model.
{"title":"Verification of design decisions in ForSyDe","authors":"Tarvo Raudvere, I. Sander, A. Singh, A. Jantsch","doi":"10.1145/944645.944692","DOIUrl":"https://doi.org/10.1145/944645.944692","url":null,"abstract":"The ForSyDe methodology has been developed for system level design. Starting with a formal specification model that captures the functionality of the system at a high abstraction level, it provides formal design transformation methods for a transparent refinement process of the specification model into an implementation model that is optimized for synthesis. A transformation may be semantic preserving or a design decision. The latter modifies the semantics of the system level description and changes the meaning of the model. The main contribution of this paper is the incorporation of model checking to verify that refined system blocks satisfy the design specification. We illustrate the translation of the ForSyDe code to the SMV language and the verification of local design decisions with a case study of a ForSyDe equalizer model.","PeriodicalId":174422,"journal":{"name":"First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}