Rodrígo González-Alberquilla, Fernando Castro, L. Piñuel, F. Tirado
The L1 data cache is one of the most frequently accessed structures in the processor. Because of this and its moderate size it is a major consumer of power. In order to reduce its power consumption, in this paper a small filter structure that exploits the special features of the references to the stack region is proposed. This filter, which acts as a top -non-inclusive- level of the data memory hierarchy, consists of a register set that keeps the data stored in the neighborhood of the top of the stack. Our simulation results show that using a small Stack Filter (SF) of only a few registers, 15% to 30% data cache power savings can be achieved on average, with a negligible performance penalty.
{"title":"Stack oriented data cache filtering","authors":"Rodrígo González-Alberquilla, Fernando Castro, L. Piñuel, F. Tirado","doi":"10.1145/1629435.1629472","DOIUrl":"https://doi.org/10.1145/1629435.1629472","url":null,"abstract":"The L1 data cache is one of the most frequently accessed structures in the processor. Because of this and its moderate size it is a major consumer of power. In order to reduce its power consumption, in this paper a small filter structure that exploits the special features of the references to the stack region is proposed. This filter, which acts as a top -non-inclusive- level of the data memory hierarchy, consists of a register set that keeps the data stored in the neighborhood of the top of the stack. Our simulation results show that using a small Stack Filter (SF) of only a few registers, 15% to 30% data cache power savings can be achieved on average, with a negligible performance penalty.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122506449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SystemC allows description of a digital system using traditional programming features as well as spatial connectivity features common in hardware description languages. We describe an approach for in-system emulation of circuits described in SystemC. SystemC emulation provides a number of benefits over synthesis, including fast compilation, faster design time, and lower tool cost. The approach involves a new SystemC bytecode format that executes on an emulation engine running on the microprocessor and/or FPGA of a development platform. Portability is enhanced via a USB flash-drive approach to loading the bytecode format onto the platform. Performance is improved using emulation accelerators on an FPGA. We describe our SystemC-to-bytecode compiler, bytecode format, emulation engine, and emulation accelerators. We illustrate use of the approach on a variety of examples, showing easy porting of a single application across various platforms, and showing emulation speed on an FPGA that is comparable to SystemC execution on a PC.
{"title":"Portable SystemC-on-a-chip","authors":"Scott Sirowy, Bailey Miller, F. Vahid","doi":"10.1145/1629435.1629439","DOIUrl":"https://doi.org/10.1145/1629435.1629439","url":null,"abstract":"SystemC allows description of a digital system using traditional programming features as well as spatial connectivity features common in hardware description languages. We describe an approach for in-system emulation of circuits described in SystemC. SystemC emulation provides a number of benefits over synthesis, including fast compilation, faster design time, and lower tool cost. The approach involves a new SystemC bytecode format that executes on an emulation engine running on the microprocessor and/or FPGA of a development platform. Portability is enhanced via a USB flash-drive approach to loading the bytecode format onto the platform. Performance is improved using emulation accelerators on an FPGA. We describe our SystemC-to-bytecode compiler, bytecode format, emulation engine, and emulation accelerators. We illustrate use of the approach on a variety of examples, showing easy porting of a single application across various platforms, and showing emulation speed on an FPGA that is comparable to SystemC execution on a PC.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117164375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low power design criteria for embedded systems have lead to many innovative architectures. One of the core architectural changes that have come in the recent past are streaming registers. These architectures have been shown to be both power efficient and performance efficient. However code has to be efficiently mapped on them to make maximal use of their potential. This paper introduces a novel technique for compiling C code on streaming registers. The proposed technique not only uses the temporal locality in arrays but also spatial locality to map code on streaming registers. The proposed Stream Register Allocation (SARA) technique is also shown to provide good mapping efficiency as well as it is shown to be scalable on realistic applications.
{"title":"SARA: StreAm register allocation","authors":"P. Raghavan, F. Catthoor","doi":"10.1145/1629435.1629442","DOIUrl":"https://doi.org/10.1145/1629435.1629442","url":null,"abstract":"Low power design criteria for embedded systems have lead to many innovative architectures. One of the core architectural changes that have come in the recent past are streaming registers. These architectures have been shown to be both power efficient and performance efficient. However code has to be efficiently mapped on them to make maximal use of their potential. This paper introduces a novel technique for compiling C code on streaming registers. The proposed technique not only uses the temporal locality in arrays but also spatial locality to map code on streaming registers. The proposed Stream Register Allocation (SARA) technique is also shown to provide good mapping efficiency as well as it is shown to be scalable on realistic applications.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"16 1-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120996163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Field Programmable Gate Arrays (FPGAs) have long held the promise of allowing designers to create systems with performance levels close to custom circuits but with a softwarelike productivity for reconfiguring the gates. Unfortunately achieving this promise has been elusive. Modern platform FPGAs are now large enough to support complete heterogeneous Multiprocessor System-On-Chips (MPSoCs), however standardized design flows and programming models for such platforms do not yet exist. To achieve true softwarelike levels of productivity, the design flow and development environment for heterogeneous MPSoCs must resemble that of standard homogeneous systems. In this paper we present a new design flow and run-time system that enables developers to program a heterogeneous MPSoC using standard POSIX-compatible programming abstractions. The ability to use a standard programming model is achieved by using a hardware-based microkernel to provide OS services to all heterogeneous components. This approach makes programming heterogeneous MPSoCs transparent, and can increase programmer productivity by replacing synthesis of custom components with faster compilation of heterogeneous executables. The use of a hardware microkernel provides OS services in an ISA-neutral manner, which allows for seamless synchronization and communication amongst heterogeneous threads.
{"title":"Building heterogeneous reconfigurable systems with a hardware microkernel","authors":"J. Agron, D. Andrews","doi":"10.1145/1629435.1629489","DOIUrl":"https://doi.org/10.1145/1629435.1629489","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) have long held the promise of allowing designers to create systems with performance levels close to custom circuits but with a softwarelike productivity for reconfiguring the gates. Unfortunately achieving this promise has been elusive. Modern platform FPGAs are now large enough to support complete heterogeneous Multiprocessor System-On-Chips (MPSoCs), however standardized design flows and programming models for such platforms do not yet exist. To achieve true softwarelike levels of productivity, the design flow and development environment for heterogeneous MPSoCs must resemble that of standard homogeneous systems. In this paper we present a new design flow and run-time system that enables developers to program a heterogeneous MPSoC using standard POSIX-compatible programming abstractions. The ability to use a standard programming model is achieved by using a hardware-based microkernel to provide OS services to all heterogeneous components. This approach makes programming heterogeneous MPSoCs transparent, and can increase programmer productivity by replacing synthesis of custom components with faster compilation of heterogeneous executables. The use of a hardware microkernel provides OS services in an ISA-neutral manner, which allows for seamless synchronization and communication amongst heterogeneous threads.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130827738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present a new on-the-fly hardware acceleration approach, based on a smart Direct Memory Access (sDMA) controller, for the layer 2 (L2) downlink protocol stack processing in Long Term Evolution (LTE) and beyond mobile devices. We use virtual prototyping in order to simulate an ARM1176 processor based hardware platform together with the executed software comprising an LTE protocol stack model. The sDMA controller with diff erent hardware accelerator units for the time critical algorithms in the protocol stack is implemented and integrated in the hardware platform. We prove our new hardware/software partitioning concept for the LTE L2 by measuring the average execution time per transport block in the protocol stack at di fferent activated on-the-fly hardware acceleration stages in the sDMA controller. At LTE data rates of 100 Mbit/s, we achieve a speedup of 24% compared to a pure software implementation by enabling the sDMA hardware support for header processing in the protocol stack. Furthermore, an activation of the complete on-the-fly hardware acceleration in the sDMA controller, including on-the-fly deciphering, leads to a speedup of more than 50 %. Finally, at transmission conditions with more computational demands and data rates up to 320 Mbit/s, we obtain acceleration ratios of almost 80 %. Investigations show that our new sDMA on-the-fly hardware acceleration approach in combination with a single-core processor off ers the required computational power for next generation mobile devices.
{"title":"On-the-fly hardware acceleration for protocol stack processing in next generation mobile devices","authors":"David Szczesny, S. Hessel, Felix Bruns, A. Bilgic","doi":"10.1145/1629435.1629457","DOIUrl":"https://doi.org/10.1145/1629435.1629457","url":null,"abstract":"In this paper we present a new on-the-fly hardware acceleration approach, based on a smart Direct Memory Access (sDMA) controller, for the layer 2 (L2) downlink protocol stack processing in Long Term Evolution (LTE) and beyond mobile devices. We use virtual prototyping in order to simulate an ARM1176 processor based hardware platform together with the executed software comprising an LTE protocol stack model. The sDMA controller with diff erent hardware accelerator units for the time critical algorithms in the protocol stack is implemented and integrated in the hardware platform. We prove our new hardware/software partitioning concept for the LTE L2 by measuring the average execution time per transport block in the protocol stack at di fferent activated on-the-fly hardware acceleration stages in the sDMA controller. At LTE data rates of 100 Mbit/s, we achieve a speedup of 24% compared to a pure software implementation by enabling the sDMA hardware support for header processing in the protocol stack. Furthermore, an activation of the complete on-the-fly hardware acceleration in the sDMA controller, including on-the-fly deciphering, leads to a speedup of more than 50 %. Finally, at transmission conditions with more computational demands and data rates up to 320 Mbit/s, we obtain acceleration ratios of almost 80 %. Investigations show that our new sDMA on-the-fly hardware acceleration approach in combination with a single-core processor off ers the required computational power for next generation mobile devices.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133431268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Gao, Jia Huang, J. Ceng, R. Leupers, G. Ascheid, H. Meyr
Profilers play an important role in software/hardware design, optimization, and verification. Various approaches have been proposed to implement profilers. The most widespread approach adopted in the embedded domain is Instruction Set Simulation (ISS) based profiling, which provides uncompromised accuracy but limited execution speed. Source code profilers, on the contrary, are fast but less accurate. This paper introduces TotalProf, a fast and accurate source code cross profiler that estimates the performance of an application from three aspects: First, code optimization and a novel virtual compiler backend are employed to resemble the course of target compilation. Second, an optimistic static scheduler is introduced to estimate the behavior of the target processor's datapath. Last but not least, dynamic events, such as cache misses, bus contention and branch prediction failures, are simulated at runtime. With an abstract architecture description, the tool can be easily retargeted in a performance characteristics oriented way to estimate different processor architectures, including DSPs and VLIW machines. Multiple instances of TotalProf can be integrated with SystemC to support heterogeneous Multi-Processor System-on-Chip (MPSoC) profiling. With only about a 5 to 15% error rate introduced to the major performance metrics, such as cycle count, memory accesses and cache misses, a more than one Giga-Instruction-Per-Second (GIPS) execution speed is achieved.
{"title":"TotalProf: a fast and accurate retargetable source code profiler","authors":"L. Gao, Jia Huang, J. Ceng, R. Leupers, G. Ascheid, H. Meyr","doi":"10.1145/1629435.1629477","DOIUrl":"https://doi.org/10.1145/1629435.1629477","url":null,"abstract":"Profilers play an important role in software/hardware design, optimization, and verification. Various approaches have been proposed to implement profilers. The most widespread approach adopted in the embedded domain is Instruction Set Simulation (ISS) based profiling, which provides uncompromised accuracy but limited execution speed. Source code profilers, on the contrary, are fast but less accurate. This paper introduces TotalProf, a fast and accurate source code cross profiler that estimates the performance of an application from three aspects: First, code optimization and a novel virtual compiler backend are employed to resemble the course of target compilation. Second, an optimistic static scheduler is introduced to estimate the behavior of the target processor's datapath. Last but not least, dynamic events, such as cache misses, bus contention and branch prediction failures, are simulated at runtime. With an abstract architecture description, the tool can be easily retargeted in a performance characteristics oriented way to estimate different processor architectures, including DSPs and VLIW machines. Multiple instances of TotalProf can be integrated with SystemC to support heterogeneous Multi-Processor System-on-Chip (MPSoC) profiling. With only about a 5 to 15% error rate introduced to the major performance metrics, such as cycle count, memory accesses and cache misses, a more than one Giga-Instruction-Per-Second (GIPS) execution speed is achieved.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131070714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Schröder, Wolfgang Klingauf, Robert Günzel, M. Burton, Eric Roesler
With the emergance of ESL design methodologies, frameworks are being developed to enable engineers to easily configure and control models-under-simulation. Each of these frameworks has proven good for its specific use case, but they are incompatible. ESL engineers must be able to leverage models and tools from different sources in order to be successful. But with today's diversity of configuration mechanisms, engineers spend too much time writing adapters between models that have been developed using different tools. We see a need for making the various existing configuration mechanisms cooperate. We present a solution based on a SystemC middleware. The middleware uses a generic transaction passing mechanism based on TLM-2 concepts and provides inter-operability between the different configuration interfaces in a heterogeneous design. The paper analyses configuration in general and explains the technical consideration for our middleware and shows how it makes the state-of-the-art configuration frameworks inter-operable.
{"title":"Configuration and control of SystemC models using TLM middleware","authors":"C. Schröder, Wolfgang Klingauf, Robert Günzel, M. Burton, Eric Roesler","doi":"10.1145/1629435.1629447","DOIUrl":"https://doi.org/10.1145/1629435.1629447","url":null,"abstract":"With the emergance of ESL design methodologies, frameworks are being developed to enable engineers to easily configure and control models-under-simulation. Each of these frameworks has proven good for its specific use case, but they are incompatible.\u0000 ESL engineers must be able to leverage models and tools from different sources in order to be successful. But with today's diversity of configuration mechanisms, engineers spend too much time writing adapters between models that have been developed using different tools. We see a need for making the various existing configuration mechanisms cooperate.\u0000 We present a solution based on a SystemC middleware. The middleware uses a generic transaction passing mechanism based on TLM-2 concepts and provides inter-operability between the different configuration interfaces in a heterogeneous design. The paper analyses configuration in general and explains the technical consideration for our middleware and shows how it makes the state-of-the-art configuration frameworks inter-operable.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115372578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase in compilation time is acceptable in exchange for significant reduction in energy consumption. Our technique transforms loops containing nested conditional blocks. Specifically, the transformation takes advantage of the fact that the Boolean value of a conditional expression, determining the true/false paths, can be statically analyzed and this information, combined with loop dependency information, can be used to break up the original loop, containing conditional expressions, into a number of smaller loops without conditional expressions. Subsequently, each of the smaller loops can be executed at the lowest voltage/frequency setting yielding overall energy reduction. Our experiments with loop kernels from mpeg4, mpeg-decoder, mpeg-encoder, mp3, qsdpcm and gimp show an impressive energy reduction of 26.56% (average) and 66% (best case) when running on a StrongARM embedded processor. The energy reduction was obtained at no additional performance penalty.
{"title":"Efficient dynamic voltage/frequency scaling through algorithmic loop transformation","authors":"M. Ghodrat, T. Givargis","doi":"10.1145/1629435.1629464","DOIUrl":"https://doi.org/10.1145/1629435.1629464","url":null,"abstract":"We present a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase in compilation time is acceptable in exchange for significant reduction in energy consumption. Our technique transforms loops containing nested conditional blocks. Specifically, the transformation takes advantage of the fact that the Boolean value of a conditional expression, determining the true/false paths, can be statically analyzed and this information, combined with loop dependency information, can be used to break up the original loop, containing conditional expressions, into a number of smaller loops without conditional expressions. Subsequently, each of the smaller loops can be executed at the lowest voltage/frequency setting yielding overall energy reduction. Our experiments with loop kernels from mpeg4, mpeg-decoder, mpeg-encoder, mp3, qsdpcm and gimp show an impressive energy reduction of 26.56% (average) and 66% (best case) when running on a StrongARM embedded processor. The energy reduction was obtained at no additional performance penalty.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114430902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Increasing application complexity and improvements in process technology have today enabled chip multiprocessors (CMPs) with tens to hundreds of cores on a chip. Networks on Chip (NoCs) have emerged as scalable communication fabrics that can support high bandwidths for these massively parallel systems. However, traditional electrical NoC implementations still need to overcome the challenges of high data transfer latencies and large power consumption. On-chip photonic interconnects have recently been proposed as an alternative to address these challenges, with high performance-per-watt characteristics for intra-chip communication. In this paper, we explore using photonic interconnects on a chip to enhance traditional electrical NoCs. Our proposed hybrid photonic NoC utilizes a photonic ring waveguide to enhance a traditional 2D electrical mesh NoC. Experimental results indicate a strong motivation for considering the proposed hybrid photonic NoC for future CMPs -- as much as a 13× reduction in power consumption and improved throughput and access latencies, compared to traditional electrical 2D mesh and torus NoC architectures.
{"title":"Exploring hybrid photonic networks-on-chip foremerging chip multiprocessors","authors":"Shirish Bahirat, S. Pasricha","doi":"10.1145/1629435.1629453","DOIUrl":"https://doi.org/10.1145/1629435.1629453","url":null,"abstract":"Increasing application complexity and improvements in process technology have today enabled chip multiprocessors (CMPs) with tens to hundreds of cores on a chip. Networks on Chip (NoCs) have emerged as scalable communication fabrics that can support high bandwidths for these massively parallel systems. However, traditional electrical NoC implementations still need to overcome the challenges of high data transfer latencies and large power consumption. On-chip photonic interconnects have recently been proposed as an alternative to address these challenges, with high performance-per-watt characteristics for intra-chip communication. In this paper, we explore using photonic interconnects on a chip to enhance traditional electrical NoCs. Our proposed hybrid photonic NoC utilizes a photonic ring waveguide to enhance a traditional 2D electrical mesh NoC. Experimental results indicate a strong motivation for considering the proposed hybrid photonic NoC for future CMPs -- as much as a 13× reduction in power consumption and improved throughput and access latencies, compared to traditional electrical 2D mesh and torus NoC architectures.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121449986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Rohrer, K. Atasu, J. V. Lunteren, C. Hagleitner
Current trends in network security force network intrusion detection systems (NIDS) to scan network traffic at wirespeed beyond 10 Gbps against increasingly complex patterns, often specified using regular expressions. As a result, dedicated regular-expression accelerators have recently received considerable attention. The storage efficiency of the compiled patterns is a key factor in the overall performance and critically depends on the distribution of the patterns to a limited number of parallel pattern-matching engines. In this work, we first present a formal definition and complexity analysis of the pattern distribution problem and then introduce optimal and heuristic methods to solve it. Our experiments with five sets of regular expressions from both public and proprietary NIDS result in an up to 8.8x better storage efficiency than the state of the art. The average improvement is 2.3x.
{"title":"Memory-efficient distribution of regular expressions for fast deep packet inspection","authors":"J. Rohrer, K. Atasu, J. V. Lunteren, C. Hagleitner","doi":"10.1145/1629435.1629456","DOIUrl":"https://doi.org/10.1145/1629435.1629456","url":null,"abstract":"Current trends in network security force network intrusion detection systems (NIDS) to scan network traffic at wirespeed beyond 10 Gbps against increasingly complex patterns, often specified using regular expressions. As a result, dedicated regular-expression accelerators have recently received considerable attention. The storage efficiency of the compiled patterns is a key factor in the overall performance and critically depends on the distribution of the patterns to a limited number of parallel pattern-matching engines. In this work, we first present a formal definition and complexity analysis of the pattern distribution problem and then introduce optimal and heuristic methods to solve it. Our experiments with five sets of regular expressions from both public and proprietary NIDS result in an up to 8.8x better storage efficiency than the state of the art. The average improvement is 2.3x.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122078585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}