Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601951
Chun-Chen Liu, Haikun Zhu, Chung-Kuan Cheng
This paper develops a novel high-speed inter-chip serial signaling scheme with leakage shunt resistors and termination resistors between the signal trace and the ground. For given abstract topology transmission line based on the data for IBM high-end AS/400 system[1] [2], we put termination resistors at the end of receiver and adjust the shunt and termination resistors value to get the optimal distortion-less transmission line. Analytical formulas are derived to predict the worst case jitter and eye-opening based on bitonic step Response Assumption[3]. Our schemes and the other two comparison cases are discussed.
{"title":"Passive compensation for high performance inter-chip communication","authors":"Chun-Chen Liu, Haikun Zhu, Chung-Kuan Cheng","doi":"10.1109/ICCD.2007.4601951","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601951","url":null,"abstract":"This paper develops a novel high-speed inter-chip serial signaling scheme with leakage shunt resistors and termination resistors between the signal trace and the ground. For given abstract topology transmission line based on the data for IBM high-end AS/400 system[1] [2], we put termination resistors at the end of receiver and adjust the shunt and termination resistors value to get the optimal distortion-less transmission line. Analytical formulas are derived to predict the worst case jitter and eye-opening based on bitonic step Response Assumption[3]. Our schemes and the other two comparison cases are discussed.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"105 1","pages":"547-552"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75666590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601949
N. Lotze, M. Ortmanns, Y. Manoli
This paper investigates self-timed asynchronous design techniques for subthreshold digital circuits. In this voltage range extremely high voltage-dependent delay uncertainties arise which make the use of synchronous circuits rather inefficient or their reliability doubtful. Delay-line controlled circuits face these difficulties with self-timed operation with the disadvantage of necessary timing margins for proper operation. In this paper we discuss these necessary timing overheads and present our approach to their analysis and reduction to a minimum value by the use of circuit techniques allowing completion detection. Transistor-level simulation results for an entirely delay-adaptable counter under variable supply down to 200 mV are presented. Additionally an analytical comparison and simulation of timing and energy consumption of more complex subthreshold asynchronous circuits is shown. The outcome is that a combination of delay-line based circuits with circuits using completion detection is promising for applications where the supply voltages are at extremely low levels.
{"title":"A Study on self-timed asynchronous subthreshold logic","authors":"N. Lotze, M. Ortmanns, Y. Manoli","doi":"10.1109/ICCD.2007.4601949","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601949","url":null,"abstract":"This paper investigates self-timed asynchronous design techniques for subthreshold digital circuits. In this voltage range extremely high voltage-dependent delay uncertainties arise which make the use of synchronous circuits rather inefficient or their reliability doubtful. Delay-line controlled circuits face these difficulties with self-timed operation with the disadvantage of necessary timing margins for proper operation. In this paper we discuss these necessary timing overheads and present our approach to their analysis and reduction to a minimum value by the use of circuit techniques allowing completion detection. Transistor-level simulation results for an entirely delay-adaptable counter under variable supply down to 200 mV are presented. Additionally an analytical comparison and simulation of timing and energy consumption of more complex subthreshold asynchronous circuits is shown. The outcome is that a combination of delay-line based circuits with circuits using completion detection is promising for applications where the supply voltages are at extremely low levels.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"1 1","pages":"533-540"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78504005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601911
Yongxiang Liu, Yuchun Ma, E. Kursun, Glenn D. Reinman, J. Cong
Most previous 3D IC research focused on "stacking" traditional 2D silicon layers, so the interconnect reduction is limited to interblock delays. In this paper, we propose techniques that enable efficient exploration of the 3D design space where each logical block can span more than one silicon layers. Although further power and performance improvement is achievable through fine grain 3D integration, the necessary modeling and tool infrastructure has been mostly missing. We develop a cube packing engine which can simultaneously optimize physical and architectural design for effective utilization of 3D in terms of performance, area and temperature. Our experimental results using a design driver show 36% performance improvement (in BIPS) over 2D and 14% over 3D with single layer blocks. Additionally multi-layer blocks can provide up to 30% reduction in power dissipation compared to the single-layer alternatives. Peak temperature of the design is kept within limits as a result of thermal-aware floorplanning and thermal via insertion techniques.
{"title":"Fine grain 3D integration for microarchitecture design through cube packing exploration","authors":"Yongxiang Liu, Yuchun Ma, E. Kursun, Glenn D. Reinman, J. Cong","doi":"10.1109/ICCD.2007.4601911","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601911","url":null,"abstract":"Most previous 3D IC research focused on \"stacking\" traditional 2D silicon layers, so the interconnect reduction is limited to interblock delays. In this paper, we propose techniques that enable efficient exploration of the 3D design space where each logical block can span more than one silicon layers. Although further power and performance improvement is achievable through fine grain 3D integration, the necessary modeling and tool infrastructure has been mostly missing. We develop a cube packing engine which can simultaneously optimize physical and architectural design for effective utilization of 3D in terms of performance, area and temperature. Our experimental results using a design driver show 36% performance improvement (in BIPS) over 2D and 14% over 3D with single layer blocks. Additionally multi-layer blocks can provide up to 30% reduction in power dissipation compared to the single-layer alternatives. Peak temperature of the design is kept within limits as a result of thermal-aware floorplanning and thermal via insertion techniques.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"65 1","pages":"259-266"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85851636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601934
Jugash Chandarlapati, Mainak Chaudhuri
The emerging trend of larger number of cores or processors on a single chip in the server, desktop, and mobile notebook platforms necessarily demands larger amount of on-chip last level cache. However, larger caches threaten to dramatically increase the leakage power as the industry moves into deeper sub-micron technology. In this paper, with the aim of reducing leakage energy we introduce LEMap (low energy map), a novel virtual address translation scheme to control the set of physical pages mapped to each bank of a large multi-banked non-uniform access L2 cache shared across all the cores. Combination of profiling, a simple off-line clustering algorithm, and a new flavor of Irix-style application-directed page placement system call maps the virtual pages that are accessed in the L2 cache roughly together onto the same region of the cache. Thus LEMap makes the access windows of the pages mapped to a region roughly identical and increases the average idle time of a region. As a result, powering down a region after the last access to the clusters of the corresponding virtual pages saves a much bigger amount of L2 cache energy compared to a usual virtual address translation scheme that is oblivious to access patterns. Our execution-driven simulation of an eight-core chip-multiprocessor with a 16 MB shared L2 cache using a 65 nm process on eight shared memory parallel applications drawn from SPLASH-2, SPEC OMP, and DIS suites shows that LEMap, on average, saves 7% of total energy, 50% of L2 cache energy, and 52% of L2 cache power while suffering from a 3% loss in performance compared to a baseline system that employs drowsy cells as well as region power-down without access clustering.
{"title":"LEMap: Controlling leakage in large chip-multiprocessor caches via profile-guided virtual address translation","authors":"Jugash Chandarlapati, Mainak Chaudhuri","doi":"10.1109/ICCD.2007.4601934","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601934","url":null,"abstract":"The emerging trend of larger number of cores or processors on a single chip in the server, desktop, and mobile notebook platforms necessarily demands larger amount of on-chip last level cache. However, larger caches threaten to dramatically increase the leakage power as the industry moves into deeper sub-micron technology. In this paper, with the aim of reducing leakage energy we introduce LEMap (low energy map), a novel virtual address translation scheme to control the set of physical pages mapped to each bank of a large multi-banked non-uniform access L2 cache shared across all the cores. Combination of profiling, a simple off-line clustering algorithm, and a new flavor of Irix-style application-directed page placement system call maps the virtual pages that are accessed in the L2 cache roughly together onto the same region of the cache. Thus LEMap makes the access windows of the pages mapped to a region roughly identical and increases the average idle time of a region. As a result, powering down a region after the last access to the clusters of the corresponding virtual pages saves a much bigger amount of L2 cache energy compared to a usual virtual address translation scheme that is oblivious to access patterns. Our execution-driven simulation of an eight-core chip-multiprocessor with a 16 MB shared L2 cache using a 65 nm process on eight shared memory parallel applications drawn from SPLASH-2, SPEC OMP, and DIS suites shows that LEMap, on average, saves 7% of total energy, 50% of L2 cache energy, and 52% of L2 cache power while suffering from a 3% loss in performance compared to a baseline system that employs drowsy cells as well as region power-down without access clustering.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"s1-15 1","pages":"423-430"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85971662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601945
V. Salapura, J. Brunheroto, F. Redígolo, A. Gara
Compared to conventional SRAM, embedded DRAM (eDRAM) offers power, bandwidth and density advantages for large on-chip cache memories. However, eDRAM suffers from comparatively slower access times than conventional SRAM arrays. To hide eDRAM access latencies, the Blue Gene/L(&) supercomputer implements small private prefetch caches. We present an exploration of design trade-offs for the prefetch D-cache for eDRAM. We use full system simulation to consider operating system impact. We validate our modeling environment by comparing our simulation results to measurements on actual Blue Gene systems. Actual execution times also include any system effects not modeled in our performance simulator, and confirm the selection of simulation parameters included in the model. Our experiments show that even small prefetch caches with wide lines efficiently capture spatial locality in many applications. Our 2kB private prefetch caches reduce execution time by 10% on average, effectively hiding the latency of the eDRAM-based memory system.
{"title":"Exploiting eDRAM bandwidth with data prefetching: simulation and measurements","authors":"V. Salapura, J. Brunheroto, F. Redígolo, A. Gara","doi":"10.1109/ICCD.2007.4601945","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601945","url":null,"abstract":"Compared to conventional SRAM, embedded DRAM (eDRAM) offers power, bandwidth and density advantages for large on-chip cache memories. However, eDRAM suffers from comparatively slower access times than conventional SRAM arrays. To hide eDRAM access latencies, the Blue Gene/L(&) supercomputer implements small private prefetch caches. We present an exploration of design trade-offs for the prefetch D-cache for eDRAM. We use full system simulation to consider operating system impact. We validate our modeling environment by comparing our simulation results to measurements on actual Blue Gene systems. Actual execution times also include any system effects not modeled in our performance simulator, and confirm the selection of simulation parameters included in the model. Our experiments show that even small prefetch caches with wide lines efficiently capture spatial locality in many applications. Our 2kB private prefetch caches reduce execution time by 10% on average, effectively hiding the latency of the eDRAM-based memory system.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"85 1","pages":"504-511"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83000033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601900
Lili Zhou, C. Wakayama, Robin Panda, N. Jangkrajarng, B. Hu, C. Shi
A 1024-bit, 1/2-rate fully parallel low-density parity-check (LDPC) code decoder has been designed and implemented using a three-dimensional (3D) 0.18 mum fully depleted silicon-on-insulator (FDSOI) CMOS technology based on wafer bonding. The 3D-IC decoder was implemented with about 8M transistors, placed on three tiers, each with one active layer and three metal layers, using 6.9 mm by 7.0 mm of die area. It was simulated to have a 2 Gbps throughput, and consume only 260 mW. This first large-scale 3D application-specific integrated circuit (ASIC) with fine-grain (5mum) vertical interconnects is made possible by jointly developing a complete automated 3D design flow from a commercial 2-D design flow combined with the needed 3D-design tools. The 3D implementation is estimated to offer more than 10 xpower-delay-area product improvement over its corresponding 2D implementation. The work demonstrated the benefits of fine-grain 3D integration for interconnect-heavy very-large-scale digital ASIC implementation.
采用基于晶圆键合的三维(3D) 0.18毫微米全耗尽绝缘体上硅(FDSOI) CMOS技术,设计并实现了一种1024位、1/2速率全并行低密度奇偶校验(LDPC)码解码器。3D-IC解码器由大约8M个晶体管实现,放置在三层上,每层有一个有源层和三个金属层,使用6.9 mm × 7.0 mm的芯片面积。它被模拟为具有2 Gbps的吞吐量,并且仅消耗260 mW。这是第一个具有细粒度(5mum)垂直互连的大规模3D专用集成电路(ASIC),通过联合开发一个完整的自动化3D设计流程,将商业2d设计流程与所需的3D设计工具相结合,使其成为可能。据估计,与相应的2D实现相比,3D实现可提供10倍以上的延迟面积产品改进。这项工作证明了细粒度3D集成对于互连繁重的超大规模数字ASIC实现的好处。
{"title":"Implementing a 2-Gbs 1024-bit ½-rate low-density parity-check code decoder in three-dimensional integrated circuits","authors":"Lili Zhou, C. Wakayama, Robin Panda, N. Jangkrajarng, B. Hu, C. Shi","doi":"10.1109/ICCD.2007.4601900","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601900","url":null,"abstract":"A 1024-bit, 1/2-rate fully parallel low-density parity-check (LDPC) code decoder has been designed and implemented using a three-dimensional (3D) 0.18 mum fully depleted silicon-on-insulator (FDSOI) CMOS technology based on wafer bonding. The 3D-IC decoder was implemented with about 8M transistors, placed on three tiers, each with one active layer and three metal layers, using 6.9 mm by 7.0 mm of die area. It was simulated to have a 2 Gbps throughput, and consume only 260 mW. This first large-scale 3D application-specific integrated circuit (ASIC) with fine-grain (5mum) vertical interconnects is made possible by jointly developing a complete automated 3D design flow from a commercial 2-D design flow combined with the needed 3D-design tools. The 3D implementation is estimated to offer more than 10 xpower-delay-area product improvement over its corresponding 2D implementation. The work demonstrated the benefits of fine-grain 3D integration for interconnect-heavy very-large-scale digital ASIC implementation.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"5 1","pages":"194-201"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84048649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601897
Yoshiyuki Kaeriyama, Daichi Zaitsu, Ken-ichi Suzuki, Hiroaki Kobayashi, N. Ohba
Ray tracing is a computer graphics technique to generate photo-realistic images. All though it can generate precise and realistic images, it requires a large amount of computation. The intersection test between rays and objects is one of the dominant factors of the ray tracing speed. We propose a new parallel processing architecture, named R PL S, for accelerating the ray tracing computation. R PL S boosts the speed of the intersection test by using a new algorithm based on ray-casting through planes, data streaming architecture offers highly efficient data provision in a multi-core environment. We estimate the performance of a future SoC implementation of R PL S by software simulation, and show 600 times speedup over a conventional CPU implementation.
{"title":"Multi-core data streaming architecture for ray tracing","authors":"Yoshiyuki Kaeriyama, Daichi Zaitsu, Ken-ichi Suzuki, Hiroaki Kobayashi, N. Ohba","doi":"10.1109/ICCD.2007.4601897","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601897","url":null,"abstract":"Ray tracing is a computer graphics technique to generate photo-realistic images. All though it can generate precise and realistic images, it requires a large amount of computation. The intersection test between rays and objects is one of the dominant factors of the ray tracing speed. We propose a new parallel processing architecture, named R PL S, for accelerating the ray tracing computation. R PL S boosts the speed of the intersection test by using a new algorithm based on ray-casting through planes, data streaming architecture offers highly efficient data provision in a multi-core environment. We estimate the performance of a future SoC implementation of R PL S by software simulation, and show 600 times speedup over a conventional CPU implementation.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"380 1","pages":"171-178"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76452703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601958
B. Sethuraman, R. Vemuri
In this research, we analyze the power variations present in a router having varied number of ports, in a Networks- on-Chip. The work is divided into two sections, projecting the merits and shortcomings of a multi-port router from the aspect of power consumption. First, we evaluate the power variations present during the transfers between various port pairs in a multi-port router. The power gains achieved through careful port selection during the mapping phase of the NoC design are shown. Secondly, through exhaustive experimentation, we discuss the IR-drop related issues that arise when using large multi-port routers.
{"title":"Power variations of multi-port routers in an application-specific NoC design : A case study","authors":"B. Sethuraman, R. Vemuri","doi":"10.1109/ICCD.2007.4601958","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601958","url":null,"abstract":"In this research, we analyze the power variations present in a router having varied number of ports, in a Networks- on-Chip. The work is divided into two sections, projecting the merits and shortcomings of a multi-port router from the aspect of power consumption. First, we evaluate the power variations present during the transfers between various port pairs in a multi-port router. The power gains achieved through careful port selection during the mapping phase of the NoC design are shown. Secondly, through exhaustive experimentation, we discuss the IR-drop related issues that arise when using large multi-port routers.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"29 1","pages":"595-600"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87035078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601917
M. Schulte, Dimitri Tan, C. Lemonds
Floating-point division is an important operation in scientific computing and multimedia applications. This paper presents and compares two division algorithms for an times86 microprocessor, which utilizes a rectangular multiplier that is optimized for multimedia applications. The proposed division algorithms are based on Goldschmidt's division algorithm and provide correctly rounded results for IEEE 754 single, double, and extended precision floating-point numbers. Compared to a previous Goldschmidt division algorithm, the fastest proposed algorithm requires 25% to 37% fewer cycles, while utilizing a multiplier that is roughly 2.5 times smaller.
{"title":"Floating-point division algorithms for an x86 microprocessor with a rectangular multiplier","authors":"M. Schulte, Dimitri Tan, C. Lemonds","doi":"10.1109/ICCD.2007.4601917","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601917","url":null,"abstract":"Floating-point division is an important operation in scientific computing and multimedia applications. This paper presents and compares two division algorithms for an times86 microprocessor, which utilizes a rectangular multiplier that is optimized for multimedia applications. The proposed division algorithms are based on Goldschmidt's division algorithm and provide correctly rounded results for IEEE 754 single, double, and extended precision floating-point numbers. Compared to a previous Goldschmidt division algorithm, the fastest proposed algorithm requires 25% to 37% fewer cycles, while utilizing a multiplier that is roughly 2.5 times smaller.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"39 1","pages":"304-310"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88027798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601898
David Meisner, S. Reda
In single processor architectures, computationally- intensive functions are typically accelerated using hardware accelerators, which exploit the concurrency in the function code to achieve a significant speedup over software. The increased design constraints from power density and signal delay have shifted processor architectures in general towards multi-core designs. The migration to multi-core designs introduces the possibility of sharing hardware accelerators between cores. In this paper, we propose the concept of a hardware library, which is a pool of accelerated functions that are accessible by multiple cores. We find that sharing provides significant reductions in the area, logic usage and leakage power required for hardware acceleration. Contention for these units may exist in certain cases; however, the savings in terms of chip area are more appealing to many applications, particularly the embedded domain. We study the performance implications for our proposal using various multi-core arrangements, with actual implementations in FPGA fabrics. FPGAs are particularly appealing due to their cost effectiveness and the attained area savings enable designers to easily add functionality without significant chip revision. Our results show that is possible to save up to 37% of a chip's available logic and interconnect resources at a negligible impact (< 3%) to the performance.
{"title":"Hardware libraries: An architecture for economic acceleration in soft multi-core environments","authors":"David Meisner, S. Reda","doi":"10.1109/ICCD.2007.4601898","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601898","url":null,"abstract":"In single processor architectures, computationally- intensive functions are typically accelerated using hardware accelerators, which exploit the concurrency in the function code to achieve a significant speedup over software. The increased design constraints from power density and signal delay have shifted processor architectures in general towards multi-core designs. The migration to multi-core designs introduces the possibility of sharing hardware accelerators between cores. In this paper, we propose the concept of a hardware library, which is a pool of accelerated functions that are accessible by multiple cores. We find that sharing provides significant reductions in the area, logic usage and leakage power required for hardware acceleration. Contention for these units may exist in certain cases; however, the savings in terms of chip area are more appealing to many applications, particularly the embedded domain. We study the performance implications for our proposal using various multi-core arrangements, with actual implementations in FPGA fabrics. FPGAs are particularly appealing due to their cost effectiveness and the attained area savings enable designers to easily add functionality without significant chip revision. Our results show that is possible to save up to 37% of a chip's available logic and interconnect resources at a negligible impact (< 3%) to the performance.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"21 1","pages":"179-186"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88051510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}