首页 > 最新文献

2007 25th International Conference on Computer Design最新文献

英文 中文
Passive compensation for high performance inter-chip communication 无源补偿的高性能芯片间通信
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601951
Chun-Chen Liu, Haikun Zhu, Chung-Kuan Cheng
This paper develops a novel high-speed inter-chip serial signaling scheme with leakage shunt resistors and termination resistors between the signal trace and the ground. For given abstract topology transmission line based on the data for IBM high-end AS/400 system[1] [2], we put termination resistors at the end of receiver and adjust the shunt and termination resistors value to get the optimal distortion-less transmission line. Analytical formulas are derived to predict the worst case jitter and eye-opening based on bitonic step Response Assumption[3]. Our schemes and the other two comparison cases are discussed.
本文提出了一种新的高速芯片间串行信号传输方案,该方案在信号走线与地之间采用漏电分流电阻和端接电阻。基于IBM高端AS/400系统[1][2]的数据,对于给定的抽象拓扑传输线,我们在接收端放置端接电阻,通过调整分流和端接电阻的值来获得最佳的无失真传输线。基于双音阶跃响应假设,推导出最坏情况下的抖动和大眼预测的解析公式[3]。讨论了我们的方案和另外两个比较案例。
{"title":"Passive compensation for high performance inter-chip communication","authors":"Chun-Chen Liu, Haikun Zhu, Chung-Kuan Cheng","doi":"10.1109/ICCD.2007.4601951","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601951","url":null,"abstract":"This paper develops a novel high-speed inter-chip serial signaling scheme with leakage shunt resistors and termination resistors between the signal trace and the ground. For given abstract topology transmission line based on the data for IBM high-end AS/400 system[1] [2], we put termination resistors at the end of receiver and adjust the shunt and termination resistors value to get the optimal distortion-less transmission line. Analytical formulas are derived to predict the worst case jitter and eye-opening based on bitonic step Response Assumption[3]. Our schemes and the other two comparison cases are discussed.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"105 1","pages":"547-552"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75666590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Study on self-timed asynchronous subthreshold logic 自定时异步子阈值逻辑的研究
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601949
N. Lotze, M. Ortmanns, Y. Manoli
This paper investigates self-timed asynchronous design techniques for subthreshold digital circuits. In this voltage range extremely high voltage-dependent delay uncertainties arise which make the use of synchronous circuits rather inefficient or their reliability doubtful. Delay-line controlled circuits face these difficulties with self-timed operation with the disadvantage of necessary timing margins for proper operation. In this paper we discuss these necessary timing overheads and present our approach to their analysis and reduction to a minimum value by the use of circuit techniques allowing completion detection. Transistor-level simulation results for an entirely delay-adaptable counter under variable supply down to 200 mV are presented. Additionally an analytical comparison and simulation of timing and energy consumption of more complex subthreshold asynchronous circuits is shown. The outcome is that a combination of delay-line based circuits with circuits using completion detection is promising for applications where the supply voltages are at extremely low levels.
本文研究了亚阈值数字电路的自定时异步设计技术。在这个电压范围内,出现了极高的电压相关延迟不确定性,这使得同步电路的使用效率相当低或其可靠性值得怀疑。延迟线控制电路在自定时操作时面临这些困难,其缺点是需要适当的时间裕度才能正常运行。在本文中,我们讨论了这些必要的时序开销,并提出了我们的方法来分析和减少到最小值,通过使用允许完井检测的电路技术。给出了一种全延迟自适应计数器在200mv可变电源下的晶体管级仿真结果。此外,还对更复杂的亚阈值异步电路的时序和能耗进行了分析比较和仿真。结果是,基于延迟线的电路与使用完井检测的电路相结合,对于电源电压处于极低水平的应用很有希望。
{"title":"A Study on self-timed asynchronous subthreshold logic","authors":"N. Lotze, M. Ortmanns, Y. Manoli","doi":"10.1109/ICCD.2007.4601949","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601949","url":null,"abstract":"This paper investigates self-timed asynchronous design techniques for subthreshold digital circuits. In this voltage range extremely high voltage-dependent delay uncertainties arise which make the use of synchronous circuits rather inefficient or their reliability doubtful. Delay-line controlled circuits face these difficulties with self-timed operation with the disadvantage of necessary timing margins for proper operation. In this paper we discuss these necessary timing overheads and present our approach to their analysis and reduction to a minimum value by the use of circuit techniques allowing completion detection. Transistor-level simulation results for an entirely delay-adaptable counter under variable supply down to 200 mV are presented. Additionally an analytical comparison and simulation of timing and energy consumption of more complex subthreshold asynchronous circuits is shown. The outcome is that a combination of delay-line based circuits with circuits using completion detection is promising for applications where the supply voltages are at extremely low levels.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"1 1","pages":"533-540"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78504005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Fine grain 3D integration for microarchitecture design through cube packing exploration 通过立方体填充探索微架构设计的细粒度三维集成
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601911
Yongxiang Liu, Yuchun Ma, E. Kursun, Glenn D. Reinman, J. Cong
Most previous 3D IC research focused on "stacking" traditional 2D silicon layers, so the interconnect reduction is limited to interblock delays. In this paper, we propose techniques that enable efficient exploration of the 3D design space where each logical block can span more than one silicon layers. Although further power and performance improvement is achievable through fine grain 3D integration, the necessary modeling and tool infrastructure has been mostly missing. We develop a cube packing engine which can simultaneously optimize physical and architectural design for effective utilization of 3D in terms of performance, area and temperature. Our experimental results using a design driver show 36% performance improvement (in BIPS) over 2D and 14% over 3D with single layer blocks. Additionally multi-layer blocks can provide up to 30% reduction in power dissipation compared to the single-layer alternatives. Peak temperature of the design is kept within limits as a result of thermal-aware floorplanning and thermal via insertion techniques.
之前大多数3D集成电路的研究都集中在“堆叠”传统的2D硅层上,因此互连的减少仅限于块间延迟。在本文中,我们提出了能够有效探索3D设计空间的技术,其中每个逻辑块可以跨越多个硅层。虽然通过细粒度3D集成可以进一步提高功率和性能,但基本没有必要的建模和工具基础设施。我们开发了一个立方体包装引擎,可以同时优化物理和建筑设计,以便在性能,面积和温度方面有效利用3D。我们使用设计驱动程序的实验结果显示,单层块比2D性能提高36%(在BIPS中),比3D性能提高14%。此外,与单层替代产品相比,多层模块可提供高达30%的功耗降低。由于热意识地板规划和热通过插入技术,设计的峰值温度保持在限制范围内。
{"title":"Fine grain 3D integration for microarchitecture design through cube packing exploration","authors":"Yongxiang Liu, Yuchun Ma, E. Kursun, Glenn D. Reinman, J. Cong","doi":"10.1109/ICCD.2007.4601911","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601911","url":null,"abstract":"Most previous 3D IC research focused on \"stacking\" traditional 2D silicon layers, so the interconnect reduction is limited to interblock delays. In this paper, we propose techniques that enable efficient exploration of the 3D design space where each logical block can span more than one silicon layers. Although further power and performance improvement is achievable through fine grain 3D integration, the necessary modeling and tool infrastructure has been mostly missing. We develop a cube packing engine which can simultaneously optimize physical and architectural design for effective utilization of 3D in terms of performance, area and temperature. Our experimental results using a design driver show 36% performance improvement (in BIPS) over 2D and 14% over 3D with single layer blocks. Additionally multi-layer blocks can provide up to 30% reduction in power dissipation compared to the single-layer alternatives. Peak temperature of the design is kept within limits as a result of thermal-aware floorplanning and thermal via insertion techniques.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"65 1","pages":"259-266"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85851636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
LEMap: Controlling leakage in large chip-multiprocessor caches via profile-guided virtual address translation LEMap:通过配置文件引导的虚拟地址转换控制大型芯片多处理器缓存中的泄漏
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601934
Jugash Chandarlapati, Mainak Chaudhuri
The emerging trend of larger number of cores or processors on a single chip in the server, desktop, and mobile notebook platforms necessarily demands larger amount of on-chip last level cache. However, larger caches threaten to dramatically increase the leakage power as the industry moves into deeper sub-micron technology. In this paper, with the aim of reducing leakage energy we introduce LEMap (low energy map), a novel virtual address translation scheme to control the set of physical pages mapped to each bank of a large multi-banked non-uniform access L2 cache shared across all the cores. Combination of profiling, a simple off-line clustering algorithm, and a new flavor of Irix-style application-directed page placement system call maps the virtual pages that are accessed in the L2 cache roughly together onto the same region of the cache. Thus LEMap makes the access windows of the pages mapped to a region roughly identical and increases the average idle time of a region. As a result, powering down a region after the last access to the clusters of the corresponding virtual pages saves a much bigger amount of L2 cache energy compared to a usual virtual address translation scheme that is oblivious to access patterns. Our execution-driven simulation of an eight-core chip-multiprocessor with a 16 MB shared L2 cache using a 65 nm process on eight shared memory parallel applications drawn from SPLASH-2, SPEC OMP, and DIS suites shows that LEMap, on average, saves 7% of total energy, 50% of L2 cache energy, and 52% of L2 cache power while suffering from a 3% loss in performance compared to a baseline system that employs drowsy cells as well as region power-down without access clustering.
在服务器、桌面和移动笔记本平台中,单个芯片上的核心或处理器数量越来越多,这一趋势必然要求更多的片上最后一级缓存。然而,随着工业向更深的亚微米技术发展,更大的缓存可能会大幅增加泄漏功率。在本文中,为了减少泄漏能量,我们引入了LEMap(低能量映射),这是一种新的虚拟地址转换方案,用于控制映射到所有核心共享的大型多银行非统一访问L2缓存的每个银行的物理页面集。分析、一个简单的离线聚类算法和一种新的irix风格的应用程序定向页面放置系统调用的组合,将在L2缓存中访问的虚拟页面大致一起映射到缓存的同一区域。因此,LEMap使得映射到一个区域的页面的访问窗口大致相同,并增加了一个区域的平均空闲时间。因此,在最后一次访问相应虚拟页面的集群之后关闭一个区域,与不受访问模式影响的通常虚拟地址转换方案相比,可以节省更多的L2缓存能量。军旅生涯我们执行力模拟的八chip-multiprocessor共有16 MB L2高速缓存使用65 nm制程来自SPLASH-2八共享内存并行应用程序,规范OMP, LEMap DIS套件显示,平均节省7%的总能量,L2高速缓存的能量的50%,而且52%的L2缓存能力而遭受3%的损失相比,性能基线系统采用昏昏欲睡的细胞以及地区省电没有访问集群。
{"title":"LEMap: Controlling leakage in large chip-multiprocessor caches via profile-guided virtual address translation","authors":"Jugash Chandarlapati, Mainak Chaudhuri","doi":"10.1109/ICCD.2007.4601934","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601934","url":null,"abstract":"The emerging trend of larger number of cores or processors on a single chip in the server, desktop, and mobile notebook platforms necessarily demands larger amount of on-chip last level cache. However, larger caches threaten to dramatically increase the leakage power as the industry moves into deeper sub-micron technology. In this paper, with the aim of reducing leakage energy we introduce LEMap (low energy map), a novel virtual address translation scheme to control the set of physical pages mapped to each bank of a large multi-banked non-uniform access L2 cache shared across all the cores. Combination of profiling, a simple off-line clustering algorithm, and a new flavor of Irix-style application-directed page placement system call maps the virtual pages that are accessed in the L2 cache roughly together onto the same region of the cache. Thus LEMap makes the access windows of the pages mapped to a region roughly identical and increases the average idle time of a region. As a result, powering down a region after the last access to the clusters of the corresponding virtual pages saves a much bigger amount of L2 cache energy compared to a usual virtual address translation scheme that is oblivious to access patterns. Our execution-driven simulation of an eight-core chip-multiprocessor with a 16 MB shared L2 cache using a 65 nm process on eight shared memory parallel applications drawn from SPLASH-2, SPEC OMP, and DIS suites shows that LEMap, on average, saves 7% of total energy, 50% of L2 cache energy, and 52% of L2 cache power while suffering from a 3% loss in performance compared to a baseline system that employs drowsy cells as well as region power-down without access clustering.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"s1-15 1","pages":"423-430"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85971662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Exploiting eDRAM bandwidth with data prefetching: simulation and measurements 利用eDRAM带宽与数据预取:模拟和测量
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601945
V. Salapura, J. Brunheroto, F. Redígolo, A. Gara
Compared to conventional SRAM, embedded DRAM (eDRAM) offers power, bandwidth and density advantages for large on-chip cache memories. However, eDRAM suffers from comparatively slower access times than conventional SRAM arrays. To hide eDRAM access latencies, the Blue Gene/L(&) supercomputer implements small private prefetch caches. We present an exploration of design trade-offs for the prefetch D-cache for eDRAM. We use full system simulation to consider operating system impact. We validate our modeling environment by comparing our simulation results to measurements on actual Blue Gene systems. Actual execution times also include any system effects not modeled in our performance simulator, and confirm the selection of simulation parameters included in the model. Our experiments show that even small prefetch caches with wide lines efficiently capture spatial locality in many applications. Our 2kB private prefetch caches reduce execution time by 10% on average, effectively hiding the latency of the eDRAM-based memory system.
与传统的SRAM相比,嵌入式DRAM (eDRAM)为大型片上高速缓存存储器提供了功耗、带宽和密度优势。然而,与传统的SRAM阵列相比,eDRAM的访问时间相对较慢。为了隐藏eDRAM访问延迟,Blue Gene/L(&)超级计算机实现了小型私有预取缓存。我们提出了对eDRAM预取d缓存的设计权衡的探索。我们使用全系统模拟来考虑操作系统的影响。我们通过将仿真结果与实际Blue Gene系统的测量结果进行比较来验证我们的建模环境。实际执行时间还包括我们的性能模拟器中未建模的任何系统效果,并确认模型中包含的仿真参数的选择。我们的实验表明,在许多应用中,即使是具有宽线的小预取缓存也能有效地捕获空间局域性。我们的2kB私有预取缓存平均减少了10%的执行时间,有效地隐藏了基于edram的内存系统的延迟。
{"title":"Exploiting eDRAM bandwidth with data prefetching: simulation and measurements","authors":"V. Salapura, J. Brunheroto, F. Redígolo, A. Gara","doi":"10.1109/ICCD.2007.4601945","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601945","url":null,"abstract":"Compared to conventional SRAM, embedded DRAM (eDRAM) offers power, bandwidth and density advantages for large on-chip cache memories. However, eDRAM suffers from comparatively slower access times than conventional SRAM arrays. To hide eDRAM access latencies, the Blue Gene/L(&) supercomputer implements small private prefetch caches. We present an exploration of design trade-offs for the prefetch D-cache for eDRAM. We use full system simulation to consider operating system impact. We validate our modeling environment by comparing our simulation results to measurements on actual Blue Gene systems. Actual execution times also include any system effects not modeled in our performance simulator, and confirm the selection of simulation parameters included in the model. Our experiments show that even small prefetch caches with wide lines efficiently capture spatial locality in many applications. Our 2kB private prefetch caches reduce execution time by 10% on average, effectively hiding the latency of the eDRAM-based memory system.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"85 1","pages":"504-511"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83000033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Implementing a 2-Gbs 1024-bit ½-rate low-density parity-check code decoder in three-dimensional integrated circuits 在三维集成电路中实现2gb 1024位半速率低密度奇偶校验码解码器
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601900
Lili Zhou, C. Wakayama, Robin Panda, N. Jangkrajarng, B. Hu, C. Shi
A 1024-bit, 1/2-rate fully parallel low-density parity-check (LDPC) code decoder has been designed and implemented using a three-dimensional (3D) 0.18 mum fully depleted silicon-on-insulator (FDSOI) CMOS technology based on wafer bonding. The 3D-IC decoder was implemented with about 8M transistors, placed on three tiers, each with one active layer and three metal layers, using 6.9 mm by 7.0 mm of die area. It was simulated to have a 2 Gbps throughput, and consume only 260 mW. This first large-scale 3D application-specific integrated circuit (ASIC) with fine-grain (5mum) vertical interconnects is made possible by jointly developing a complete automated 3D design flow from a commercial 2-D design flow combined with the needed 3D-design tools. The 3D implementation is estimated to offer more than 10 xpower-delay-area product improvement over its corresponding 2D implementation. The work demonstrated the benefits of fine-grain 3D integration for interconnect-heavy very-large-scale digital ASIC implementation.
采用基于晶圆键合的三维(3D) 0.18毫微米全耗尽绝缘体上硅(FDSOI) CMOS技术,设计并实现了一种1024位、1/2速率全并行低密度奇偶校验(LDPC)码解码器。3D-IC解码器由大约8M个晶体管实现,放置在三层上,每层有一个有源层和三个金属层,使用6.9 mm × 7.0 mm的芯片面积。它被模拟为具有2 Gbps的吞吐量,并且仅消耗260 mW。这是第一个具有细粒度(5mum)垂直互连的大规模3D专用集成电路(ASIC),通过联合开发一个完整的自动化3D设计流程,将商业2d设计流程与所需的3D设计工具相结合,使其成为可能。据估计,与相应的2D实现相比,3D实现可提供10倍以上的延迟面积产品改进。这项工作证明了细粒度3D集成对于互连繁重的超大规模数字ASIC实现的好处。
{"title":"Implementing a 2-Gbs 1024-bit ½-rate low-density parity-check code decoder in three-dimensional integrated circuits","authors":"Lili Zhou, C. Wakayama, Robin Panda, N. Jangkrajarng, B. Hu, C. Shi","doi":"10.1109/ICCD.2007.4601900","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601900","url":null,"abstract":"A 1024-bit, 1/2-rate fully parallel low-density parity-check (LDPC) code decoder has been designed and implemented using a three-dimensional (3D) 0.18 mum fully depleted silicon-on-insulator (FDSOI) CMOS technology based on wafer bonding. The 3D-IC decoder was implemented with about 8M transistors, placed on three tiers, each with one active layer and three metal layers, using 6.9 mm by 7.0 mm of die area. It was simulated to have a 2 Gbps throughput, and consume only 260 mW. This first large-scale 3D application-specific integrated circuit (ASIC) with fine-grain (5mum) vertical interconnects is made possible by jointly developing a complete automated 3D design flow from a commercial 2-D design flow combined with the needed 3D-design tools. The 3D implementation is estimated to offer more than 10 xpower-delay-area product improvement over its corresponding 2D implementation. The work demonstrated the benefits of fine-grain 3D integration for interconnect-heavy very-large-scale digital ASIC implementation.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"5 1","pages":"194-201"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84048649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multi-core data streaming architecture for ray tracing 用于光线追踪的多核数据流架构
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601897
Yoshiyuki Kaeriyama, Daichi Zaitsu, Ken-ichi Suzuki, Hiroaki Kobayashi, N. Ohba
Ray tracing is a computer graphics technique to generate photo-realistic images. All though it can generate precise and realistic images, it requires a large amount of computation. The intersection test between rays and objects is one of the dominant factors of the ray tracing speed. We propose a new parallel processing architecture, named R PL S, for accelerating the ray tracing computation. R PL S boosts the speed of the intersection test by using a new algorithm based on ray-casting through planes, data streaming architecture offers highly efficient data provision in a multi-core environment. We estimate the performance of a future SoC implementation of R PL S by software simulation, and show 600 times speedup over a conventional CPU implementation.
光线追踪是一种生成逼真图像的计算机图形技术。尽管它可以生成精确和逼真的图像,但它需要大量的计算。光线与物体的相交测试是影响光线追踪速度的主要因素之一。我们提出了一种新的并行处理架构,命名为rpl S,以加速光线追踪计算。rpl S通过使用基于平面光线投射的新算法来提高交叉测试的速度,数据流架构在多核环境下提供高效的数据供应。我们通过软件仿真估计了R PL S的未来SoC实现的性能,并显示了比传统CPU实现提高600倍的速度。
{"title":"Multi-core data streaming architecture for ray tracing","authors":"Yoshiyuki Kaeriyama, Daichi Zaitsu, Ken-ichi Suzuki, Hiroaki Kobayashi, N. Ohba","doi":"10.1109/ICCD.2007.4601897","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601897","url":null,"abstract":"Ray tracing is a computer graphics technique to generate photo-realistic images. All though it can generate precise and realistic images, it requires a large amount of computation. The intersection test between rays and objects is one of the dominant factors of the ray tracing speed. We propose a new parallel processing architecture, named R PL S, for accelerating the ray tracing computation. R PL S boosts the speed of the intersection test by using a new algorithm based on ray-casting through planes, data streaming architecture offers highly efficient data provision in a multi-core environment. We estimate the performance of a future SoC implementation of R PL S by software simulation, and show 600 times speedup over a conventional CPU implementation.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"380 1","pages":"171-178"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76452703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Power variations of multi-port routers in an application-specific NoC design : A case study 多端口路由器在特定应用NoC设计中的功率变化:一个案例研究
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601958
B. Sethuraman, R. Vemuri
In this research, we analyze the power variations present in a router having varied number of ports, in a Networks- on-Chip. The work is divided into two sections, projecting the merits and shortcomings of a multi-port router from the aspect of power consumption. First, we evaluate the power variations present during the transfers between various port pairs in a multi-port router. The power gains achieved through careful port selection during the mapping phase of the NoC design are shown. Secondly, through exhaustive experimentation, we discuss the IR-drop related issues that arise when using large multi-port routers.
在本研究中,我们分析了在片上网络中具有不同端口数量的路由器中存在的功率变化。本文分为两个部分,从功耗方面分析了多端口路由器的优缺点。首先,我们评估了在多端口路由器中不同端口对之间传输期间存在的功率变化。图中显示了在NoC设计的映射阶段通过仔细选择端口所获得的功率增益。其次,通过详尽的实验,我们讨论了使用大型多端口路由器时出现的IR-drop相关问题。
{"title":"Power variations of multi-port routers in an application-specific NoC design : A case study","authors":"B. Sethuraman, R. Vemuri","doi":"10.1109/ICCD.2007.4601958","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601958","url":null,"abstract":"In this research, we analyze the power variations present in a router having varied number of ports, in a Networks- on-Chip. The work is divided into two sections, projecting the merits and shortcomings of a multi-port router from the aspect of power consumption. First, we evaluate the power variations present during the transfers between various port pairs in a multi-port router. The power gains achieved through careful port selection during the mapping phase of the NoC design are shown. Secondly, through exhaustive experimentation, we discuss the IR-drop related issues that arise when using large multi-port routers.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"29 1","pages":"595-600"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87035078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Floating-point division algorithms for an x86 microprocessor with a rectangular multiplier 带矩形乘法器的x86微处理器的浮点除法算法
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601917
M. Schulte, Dimitri Tan, C. Lemonds
Floating-point division is an important operation in scientific computing and multimedia applications. This paper presents and compares two division algorithms for an times86 microprocessor, which utilizes a rectangular multiplier that is optimized for multimedia applications. The proposed division algorithms are based on Goldschmidt's division algorithm and provide correctly rounded results for IEEE 754 single, double, and extended precision floating-point numbers. Compared to a previous Goldschmidt division algorithm, the fastest proposed algorithm requires 25% to 37% fewer cycles, while utilizing a multiplier that is roughly 2.5 times smaller.
浮点除法是科学计算和多媒体应用中的重要运算。本文介绍并比较了两种适用于times86微处理器的除法算法,该微处理器采用了针对多媒体应用进行优化的矩形乘法器。所提出的除法算法基于Goldschmidt的除法算法,并为IEEE 754单精度、双精度和扩展精度浮点数提供正确的舍入结果。与之前的Goldschmidt除法算法相比,最快的算法需要减少25%到37%的周期,同时使用的乘法器大约小2.5倍。
{"title":"Floating-point division algorithms for an x86 microprocessor with a rectangular multiplier","authors":"M. Schulte, Dimitri Tan, C. Lemonds","doi":"10.1109/ICCD.2007.4601917","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601917","url":null,"abstract":"Floating-point division is an important operation in scientific computing and multimedia applications. This paper presents and compares two division algorithms for an times86 microprocessor, which utilizes a rectangular multiplier that is optimized for multimedia applications. The proposed division algorithms are based on Goldschmidt's division algorithm and provide correctly rounded results for IEEE 754 single, double, and extended precision floating-point numbers. Compared to a previous Goldschmidt division algorithm, the fastest proposed algorithm requires 25% to 37% fewer cycles, while utilizing a multiplier that is roughly 2.5 times smaller.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"39 1","pages":"304-310"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88027798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Hardware libraries: An architecture for economic acceleration in soft multi-core environments 硬件库:在软多核环境中实现经济加速的体系结构
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601898
David Meisner, S. Reda
In single processor architectures, computationally- intensive functions are typically accelerated using hardware accelerators, which exploit the concurrency in the function code to achieve a significant speedup over software. The increased design constraints from power density and signal delay have shifted processor architectures in general towards multi-core designs. The migration to multi-core designs introduces the possibility of sharing hardware accelerators between cores. In this paper, we propose the concept of a hardware library, which is a pool of accelerated functions that are accessible by multiple cores. We find that sharing provides significant reductions in the area, logic usage and leakage power required for hardware acceleration. Contention for these units may exist in certain cases; however, the savings in terms of chip area are more appealing to many applications, particularly the embedded domain. We study the performance implications for our proposal using various multi-core arrangements, with actual implementations in FPGA fabrics. FPGAs are particularly appealing due to their cost effectiveness and the attained area savings enable designers to easily add functionality without significant chip revision. Our results show that is possible to save up to 37% of a chip's available logic and interconnect resources at a negligible impact (< 3%) to the performance.
在单处理器体系结构中,计算密集型函数通常使用硬件加速器加速,硬件加速器利用函数代码中的并发性来实现比软件显著的加速。功率密度和信号延迟带来的设计限制使得处理器架构普遍转向多核设计。向多核设计的迁移引入了在内核之间共享硬件加速器的可能性。在本文中,我们提出了硬件库的概念,它是一个由多个内核访问的加速函数池。我们发现,共享可以显著减少硬件加速所需的面积、逻辑使用和泄漏功率。在某些情况下,可能存在对这些单位的争夺;然而,在芯片面积方面的节省更吸引许多应用,特别是嵌入式领域。我们使用各种多核安排来研究我们的提议的性能影响,并在FPGA结构中实际实现。fpga特别具有吸引力,因为它们的成本效益和所获得的面积节省使设计人员能够轻松地添加功能,而无需对芯片进行重大修改。我们的结果表明,在对性能的影响可以忽略不计(< 3%)的情况下,可以节省高达37%的芯片可用逻辑和互连资源。
{"title":"Hardware libraries: An architecture for economic acceleration in soft multi-core environments","authors":"David Meisner, S. Reda","doi":"10.1109/ICCD.2007.4601898","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601898","url":null,"abstract":"In single processor architectures, computationally- intensive functions are typically accelerated using hardware accelerators, which exploit the concurrency in the function code to achieve a significant speedup over software. The increased design constraints from power density and signal delay have shifted processor architectures in general towards multi-core designs. The migration to multi-core designs introduces the possibility of sharing hardware accelerators between cores. In this paper, we propose the concept of a hardware library, which is a pool of accelerated functions that are accessible by multiple cores. We find that sharing provides significant reductions in the area, logic usage and leakage power required for hardware acceleration. Contention for these units may exist in certain cases; however, the savings in terms of chip area are more appealing to many applications, particularly the embedded domain. We study the performance implications for our proposal using various multi-core arrangements, with actual implementations in FPGA fabrics. FPGAs are particularly appealing due to their cost effectiveness and the attained area savings enable designers to easily add functionality without significant chip revision. Our results show that is possible to save up to 37% of a chip's available logic and interconnect resources at a negligible impact (< 3%) to the performance.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"21 1","pages":"179-186"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88051510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2007 25th International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1