首页 > 最新文献

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文 中文
GPU-based multifrontal optimizing method in sparse Cholesky factorization 稀疏Cholesky分解中基于gpu的多额优化方法
Ran Zheng, Wei Wang, Hai Jin, Song Wu, Yong Chen, Han Jiang
In many scientific computing applications, sparse Cholesky factorization is used to solve large sparse linear equations in distributed environment. GPU computing is a new way to solve the problem. However, sparse Cholesky factorization on GPU is hardly to achieve excellent performance due to the structure irregularity of matrix and the low GPU resource utilization. A hybrid CPU-GPU implementation of sparse Cholesky factorization is proposed based on multifrontal method. A large sparse coefficient matrix is decomposed into a series of small dense matrices (frontal matrices) in the method, and then multiple GEMM (General Matrix-matrix Multiplication) operations are computed. GEMMs are the main operations in sparse Cholesky factorization, but they are hardly to perform better in parallel on GPU. In order to improve the performance, the scheme of multiple task queues is adopted when performing multiple GEMMs parallelized with multifrontal method; all GEMM tasks are scheduled dynamically on GPU and CPU based on computation scales for load balance and computing-time reduction. Experimental results show that the approach can outperform the implementations of BLAS and cuBLAS, achieving up to 3.15× and 1.98× speedup, respectively.
在许多科学计算应用中,稀疏Cholesky分解被用于求解分布式环境下的大型稀疏线性方程。GPU计算是解决这一问题的一种新方法。然而,由于矩阵结构的不规则性和GPU资源的低利用率,稀疏Cholesky分解在GPU上很难取得优异的性能。提出了一种基于多额方法的稀疏Cholesky分解的CPU-GPU混合实现。该方法将一个大的稀疏系数矩阵分解为一系列小的密集矩阵(正面矩阵),然后进行多次通用矩阵-矩阵乘法运算。gemm是稀疏Cholesky分解的主要操作,但在GPU上很难有更好的并行性能。为了提高性能,在并行执行多个gem时,采用多任务队列方案;所有GEMM任务根据计算规模在GPU和CPU上动态调度,以实现负载均衡和减少计算时间。实验结果表明,该方法优于BLAS和cuBLAS的实现,分别实现了3.15倍和1.98倍的加速。
{"title":"GPU-based multifrontal optimizing method in sparse Cholesky factorization","authors":"Ran Zheng, Wei Wang, Hai Jin, Song Wu, Yong Chen, Han Jiang","doi":"10.1109/ASAP.2015.7245714","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245714","url":null,"abstract":"In many scientific computing applications, sparse Cholesky factorization is used to solve large sparse linear equations in distributed environment. GPU computing is a new way to solve the problem. However, sparse Cholesky factorization on GPU is hardly to achieve excellent performance due to the structure irregularity of matrix and the low GPU resource utilization. A hybrid CPU-GPU implementation of sparse Cholesky factorization is proposed based on multifrontal method. A large sparse coefficient matrix is decomposed into a series of small dense matrices (frontal matrices) in the method, and then multiple GEMM (General Matrix-matrix Multiplication) operations are computed. GEMMs are the main operations in sparse Cholesky factorization, but they are hardly to perform better in parallel on GPU. In order to improve the performance, the scheme of multiple task queues is adopted when performing multiple GEMMs parallelized with multifrontal method; all GEMM tasks are scheduled dynamically on GPU and CPU based on computation scales for load balance and computing-time reduction. Experimental results show that the approach can outperform the implementations of BLAS and cuBLAS, achieving up to 3.15× and 1.98× speedup, respectively.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"36 1","pages":"90-97"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85603328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
GPU kernels for high-speed 4-bit astrophysical data processing 用于高速4位天体物理数据处理的GPU内核
P. Klages, K. Bandura, N. Denman, A. Recnik, J. Sievers, K. Vanderlinde
Interferometric radio telescopes often rely on computationally expensive O(N2) correlation calculations; fortunately these computations map well to massively parallel accelerators such as low-cost GPUs. This paper describes the OpenCL kernels developed for the GPU based X-engine of a new hybrid FX correlator. Channelized data from the F-engine is supplied to the GPUs as 4-bit, offset-encoded real and imaginary integers. Because of the low bit-depth of the data, two values may be packed into a 32-bit register, allowing multiplication and addition of more than one value with a single fused multiply-add instruction. With these kernels, as many as 5.6 effective tera-operations per second (TOPS) can be executed on a 4.3 TOPS GPU. By design, these kernels allow correlations to scale to large numbers of input elements, and are limited only by maximum buffer sizes on the GPU. This code is currently working on-sky with the CHIME Pathfinder Correlator in BC, Canada.
干涉射电望远镜通常依赖于计算昂贵的O(N2)相关计算;幸运的是,这些计算很好地映射到大规模并行加速器,如低成本gpu。本文介绍了一种新型混合FX相关器的基于GPU的x引擎开发的OpenCL内核。来自f引擎的信道化数据以4位、偏移编码的实整数和虚整数的形式提供给gpu。由于数据的位深较低,两个值可以打包到一个32位寄存器中,允许使用单个融合的乘加指令对多个值进行乘法和加法运算。使用这些内核,在4.3 TOPS的GPU上可以执行多达5.6有效的每秒万亿次操作(TOPS)。通过设计,这些内核允许关联扩展到大量输入元素,并且仅受GPU上最大缓冲区大小的限制。这个代码目前正在与加拿大BC省的CHIME探路者相关器一起工作。
{"title":"GPU kernels for high-speed 4-bit astrophysical data processing","authors":"P. Klages, K. Bandura, N. Denman, A. Recnik, J. Sievers, K. Vanderlinde","doi":"10.1109/ASAP.2015.7245729","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245729","url":null,"abstract":"Interferometric radio telescopes often rely on computationally expensive O(N2) correlation calculations; fortunately these computations map well to massively parallel accelerators such as low-cost GPUs. This paper describes the OpenCL kernels developed for the GPU based X-engine of a new hybrid FX correlator. Channelized data from the F-engine is supplied to the GPUs as 4-bit, offset-encoded real and imaginary integers. Because of the low bit-depth of the data, two values may be packed into a 32-bit register, allowing multiplication and addition of more than one value with a single fused multiply-add instruction. With these kernels, as many as 5.6 effective tera-operations per second (TOPS) can be executed on a 4.3 TOPS GPU. By design, these kernels allow correlations to scale to large numbers of input elements, and are limited only by maximum buffer sizes on the GPU. This code is currently working on-sky with the CHIME Pathfinder Correlator in BC, Canada.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"89 1","pages":"164-165"},"PeriodicalIF":0.0,"publicationDate":"2015-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81448218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
An efficient real-time data pipeline for the CHIME Pathfinder radio telescope X-engine CHIME探路者射电望远镜x引擎的高效实时数据管道
A. Recnik, K. Bandura, N. Denman, A. Hincks, G. Hinshaw, P. Klages, U. Pen, K. Vanderlinde
The CHIME Pathfinder is a new interferometric radio telescope that uses a hybrid FPGA/GPU FX correlator. The GPU-based X-engine of this correlator processes over 819 Gb/s of 4+4-bit complex astronomical data from N=256 inputs across a 400MHz radio band. A software framework is presented to manage this real-time data flow, which allows each of 16 processing servers to handle 51.2 Gb/s of astronomical data, plus 8 Gb/s of ancillary data. Each server receives data in the form of UDP packets from an FPGA F-engine over the eight 10 GbE links, combines data from these packets into large (32MB-256MB) buffered frames, and transfers them to multiple GPU co-processors for correlation. The results from the GPUs are combined and normalized, then transmitted to a collection server, where they are merged into a single file. Aggressive optimizations enable each server to handle this high rate of data; allowing the efficient correlation of 25MHz of radio bandwidth per server. The solution scales well to larger values of N by adding additional servers.
CHIME探路者是一种新型干涉射电望远镜,使用混合FPGA/GPU FX相关器。该相关器的基于gpu的x引擎处理来自400MHz无线电频段N=256个输入的4+4位复杂天文数据,处理速度超过819 Gb/s。提出了一个管理实时数据流的软件框架,它允许16个处理服务器中的每一个处理51.2 Gb/s的天文数据,加上8gb /s的辅助数据。每个服务器通过8个10 GbE链路接收来自FPGA f引擎的UDP数据包形式的数据,将这些数据包中的数据合并为大型(32MB-256MB)缓冲帧,并将它们传输到多个GPU协处理器以进行关联。来自gpu的结果被合并和规范化,然后传输到一个收集服务器,在那里它们被合并成一个文件。积极的优化使每个服务器能够处理这种高速率的数据;允许每个服务器25MHz无线电带宽的有效关联。通过添加额外的服务器,该解决方案可以很好地扩展到更大的N值。
{"title":"An efficient real-time data pipeline for the CHIME Pathfinder radio telescope X-engine","authors":"A. Recnik, K. Bandura, N. Denman, A. Hincks, G. Hinshaw, P. Klages, U. Pen, K. Vanderlinde","doi":"10.1109/ASAP.2015.7245705","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245705","url":null,"abstract":"The CHIME Pathfinder is a new interferometric radio telescope that uses a hybrid FPGA/GPU FX correlator. The GPU-based X-engine of this correlator processes over 819 Gb/s of 4+4-bit complex astronomical data from N=256 inputs across a 400MHz radio band. A software framework is presented to manage this real-time data flow, which allows each of 16 processing servers to handle 51.2 Gb/s of astronomical data, plus 8 Gb/s of ancillary data. Each server receives data in the form of UDP packets from an FPGA F-engine over the eight 10 GbE links, combines data from these packets into large (32MB-256MB) buffered frames, and transfers them to multiple GPU co-processors for correlation. The results from the GPUs are combined and normalized, then transmitted to a collection server, where they are merged into a single file. Aggressive optimizations enable each server to handle this high rate of data; allowing the efficient correlation of 25MHz of radio bandwidth per server. The solution scales well to larger values of N by adding additional servers.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"22 1","pages":"57-61"},"PeriodicalIF":0.0,"publicationDate":"2015-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88821969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A GPU-based correlator X-engine implemented on the CHIME Pathfinder 基于gpu的相关器x引擎在CHIME探路者上实现
N. Denman, M. Amiri, K. Bandura, L. Connor, M. Dobbs, M. Fandino, M. Halpern, A. Hincks, G. Hinshaw, C. Höfer, P. Klages, K. Masui, J. Parra, L. Newburgh, A. Recnik, J. Shaw, K. Sigurdson, Kendrick M. Smith, K. Vanderlinde
We present the design and implementation of a custom GPU-based compute cluster that provides the correlation X-engine of the CHIME Pathfinder radio telescope. It is among the largest such systems in operation, correlating 32,896 baselines (256 inputs) over 400MHz of radio bandwidth. Making heavy use of consumer-grade parts and a custom software stack, the system was developed at a small fraction of the cost of comparable installations. Unlike existing GPU backends, this system is built around OpenCL kernels running on consumer-level AMD GPUs, taking advantage of low-cost hardware and leveraging packed integer operations to double algorithmic efficiency. The system achieves the required 105 TOPS in a 10kW power envelope, making it one of the most power-efficient X-engines in use today.
本文提出了一种基于gpu的计算集群的设计和实现,该集群提供了CHIME探路者射电望远镜的相关x引擎。它是运行中最大的此类系统之一,在400MHz无线电带宽上关联32,896个基线(256个输入)。由于大量使用消费级部件和定制软件堆栈,该系统的开发成本仅为同类安装的一小部分。与现有的GPU后端不同,该系统是围绕OpenCL内核构建的,运行在消费级AMD GPU上,利用低成本硬件和利用打包整数运算来提高算法效率。该系统在10kW的功率范围内达到了所需的105 TOPS,使其成为当今使用的最节能的x发动机之一。
{"title":"A GPU-based correlator X-engine implemented on the CHIME Pathfinder","authors":"N. Denman, M. Amiri, K. Bandura, L. Connor, M. Dobbs, M. Fandino, M. Halpern, A. Hincks, G. Hinshaw, C. Höfer, P. Klages, K. Masui, J. Parra, L. Newburgh, A. Recnik, J. Shaw, K. Sigurdson, Kendrick M. Smith, K. Vanderlinde","doi":"10.1109/ASAP.2015.7245702","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245702","url":null,"abstract":"We present the design and implementation of a custom GPU-based compute cluster that provides the correlation X-engine of the CHIME Pathfinder radio telescope. It is among the largest such systems in operation, correlating 32,896 baselines (256 inputs) over 400MHz of radio bandwidth. Making heavy use of consumer-grade parts and a custom software stack, the system was developed at a small fraction of the cost of comparable installations. Unlike existing GPU backends, this system is built around OpenCL kernels running on consumer-level AMD GPUs, taking advantage of low-cost hardware and leveraging packed integer operations to double algorithmic efficiency. The system achieves the required 105 TOPS in a 10kW power envelope, making it one of the most power-efficient X-engines in use today.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"73 1","pages":"35-40"},"PeriodicalIF":0.0,"publicationDate":"2015-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86376274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
An application-aware approach to systems support for big data 基于应用程序的大数据系统支持方法
Hong Jiang
Summary form only given. Everyday 2.5 quintillion (2.5×1018, or 2.5 million trillion) bytes of data are created by people. This data comes from everywhere: from traditional scientific computing and on-line transactions, to popular social network and mobile applications. Data produced in the last two years alone amounts to 90% of the data in the world today! This phenomenal growth and ubiquity of data has ushered in an era of “Big Data”, which brings with it new challenges as well as opportunities. In this talk, I will first discuss big data challenges facing computer and storage systems research, brought on by the huge volume, high velocity, great variety and veracity with which digital data are being produced in the world. I will first introduce some new and ongoing programs at NSF that are relevant to Big Data and to ASAP. I will then present research being conducted in my research group that seeks a scalable systems and application-aware approach to addressing some of the challenges, from the many core and storage architectures to the systems and up to the applications.
只提供摘要形式。人们每天创造2.5万亿字节(2.5×1018,或250万万亿字节)的数据。这些数据无处不在:从传统的科学计算和在线交易,到流行的社交网络和移动应用程序。仅过去两年产生的数据就占当今世界数据的90% !这种惊人的增长和无处不在的数据开启了“大数据”时代,这带来了新的挑战和机遇。在这次演讲中,我将首先讨论计算机和存储系统研究面临的大数据挑战,这些挑战是由世界上产生的海量、高速度、种类繁多和准确性高的数字数据带来的。我将首先介绍NSF与大数据和ASAP相关的一些新的和正在进行的项目。然后,我将介绍我的研究小组正在进行的研究,该研究小组寻求一种可扩展的系统和应用程序感知方法来解决一些挑战,从许多核心和存储架构到系统和应用程序。
{"title":"An application-aware approach to systems support for big data","authors":"Hong Jiang","doi":"10.1109/ASAP.2013.6567537","DOIUrl":"https://doi.org/10.1109/ASAP.2013.6567537","url":null,"abstract":"Summary form only given. Everyday 2.5 quintillion (2.5×1018, or 2.5 million trillion) bytes of data are created by people. This data comes from everywhere: from traditional scientific computing and on-line transactions, to popular social network and mobile applications. Data produced in the last two years alone amounts to 90% of the data in the world today! This phenomenal growth and ubiquity of data has ushered in an era of “Big Data”, which brings with it new challenges as well as opportunities. In this talk, I will first discuss big data challenges facing computer and storage systems research, brought on by the huge volume, high velocity, great variety and veracity with which digital data are being produced in the world. I will first introduce some new and ongoing programs at NSF that are relevant to Big Data and to ASAP. I will then present research being conducted in my research group that seeks a scalable systems and application-aware approach to addressing some of the challenges, from the many core and storage architectures to the systems and up to the applications.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"11 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2013-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83677767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The tunnel vision syndrome: Massively delaying progress 隧道视觉综合症:严重拖延进展
R. Hartenstein
Summary form only given. Not only the multicore dilemma massively reduces programmer productivity and the progress of energy-efficient performance — a critical issue for the long term overall affordability of computing. Because of the Tunnel Vision Syndrome the solutions coming from a few isolated areas, are by far too slow and massively imperfect. Systolic arrays (SA) have been introduced by a mathematician. His synthesis method was “of course” algebraic, supporting only a few applications and sequencing concepts were “not his job”. A decade later we transformed this SA draft into a general purpose machine paradigm which was presented at the 3rd and 8th through 11th ASAP. The acceptance of our other fundamental idea, Term Rewriting System (TRS) top-down use for microchip design EDA, was delayed by the TRS expert scene: by 30 years! The R&D landscape requires radically new solutions. We must avoid the reductionist philosophies of most specialized research areas and introduce connected thinking to bridge the gaps between different paradigms and between several abstraction levels. We must urgently rethink all basic assumptions and far-reaching cooperation patterns.
只提供摘要形式。多核困境不仅大大降低了程序员的生产力和节能性能的进步——这是计算长期整体可负担性的关键问题。由于隧道视觉综合症,来自少数孤立地区的解决方案太慢,而且非常不完善。收缩阵列(SA)是由一位数学家提出的。他的合成方法“当然”是代数的,只支持少数应用,排序概念“不是他的工作”。十年后,我们将这个SA草案转化为通用机器范例,并在第3、8至11届ASAP上提出。接受我们的另一个基本理念,术语重写系统(TRS)自上而下用于微芯片设计EDA,被TRS专家现场推迟了30年!研发领域需要全新的解决方案。我们必须避免大多数专业研究领域的还原论哲学,并引入关联思维来弥合不同范式之间和几个抽象层次之间的差距。我们必须紧急反思所有基本假设和长远合作模式。
{"title":"The tunnel vision syndrome: Massively delaying progress","authors":"R. Hartenstein","doi":"10.1109/ASAP.2013.6567541","DOIUrl":"https://doi.org/10.1109/ASAP.2013.6567541","url":null,"abstract":"Summary form only given. Not only the multicore dilemma massively reduces programmer productivity and the progress of energy-efficient performance — a critical issue for the long term overall affordability of computing. Because of the Tunnel Vision Syndrome the solutions coming from a few isolated areas, are by far too slow and massively imperfect. Systolic arrays (SA) have been introduced by a mathematician. His synthesis method was “of course” algebraic, supporting only a few applications and sequencing concepts were “not his job”. A decade later we transformed this SA draft into a general purpose machine paradigm which was presented at the 3rd and 8th through 11th ASAP. The acceptance of our other fundamental idea, Term Rewriting System (TRS) top-down use for microchip design EDA, was delayed by the TRS expert scene: by 30 years! The R&D landscape requires radically new solutions. We must avoid the reductionist philosophies of most specialized research areas and introduce connected thinking to bridge the gaps between different paradigms and between several abstraction levels. We must urgently rethink all basic assumptions and far-reaching cooperation patterns.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"456 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2013-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75101194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
More than 50 years of parallel processing and still no easy path to speedup 超过50年的并行处理,仍然没有简单的路径来加速
M. Flynn
The following topics are dealt with: reconfigurable systems; computer arithmetic; computer algorithm; system profiling; multicore processor; communication systems; GPU; accelerator; image processing and FPGA application.
处理以下主题:可重构系统;计算机算术;计算机算法;系统分析;多核处理器;通信系统;GPU;加速器;图像处理和FPGA应用。
{"title":"More than 50 years of parallel processing and still no easy path to speedup","authors":"M. Flynn","doi":"10.1109/ASAP.2011.6043229","DOIUrl":"https://doi.org/10.1109/ASAP.2011.6043229","url":null,"abstract":"The following topics are dealt with: reconfigurable systems; computer arithmetic; computer algorithm; system profiling; multicore processor; communication systems; GPU; accelerator; image processing and FPGA application.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"6 1","pages":"4"},"PeriodicalIF":0.0,"publicationDate":"2011-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90854519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architectures for Green routers 绿色路由器的架构
V. Prasanna
As the Information and Communication (ICT) infrastructure continues to evolve, significant energy dissipation is incurred in the core routers. Core router performance will soon be limited by the power density. About two-thirds of the power dissipation in a router is in layer 3. Packet forwarding, classification, etc. contribute significantly to this. This talk explores architectures and algorithms for network functions including deep packet inspection and packet classification in core routers. We propose energy efficient designs to realize the “Green Internet” vision. We illustrate the performance improvements for such systems and demonstrate the suitability of FPGAs for these computations. We show that SRAM based solutions combined with FPGA based architectures lead to high throughput as well as reduced power dissipation compared with the state of the art solutions based TCAMs.
随着信息通信(ICT)基础设施的不断发展,核心路由器会产生大量的能量损耗。核心路由器的性能很快就会受到功率密度的限制。路由器中大约三分之二的功耗发生在第三层。报文转发、分类等对这一点有重要贡献。本讲座探讨了核心路由器中网络功能的架构和算法,包括深度包检测和包分类。我们提出节能设计,以实现“绿色互联网”的愿景。我们说明了这种系统的性能改进,并证明了fpga对这些计算的适用性。我们表明,与基于tcam的最先进解决方案相比,基于SRAM的解决方案与基于FPGA的架构相结合可以实现高吞吐量并降低功耗。
{"title":"Architectures for Green routers","authors":"V. Prasanna","doi":"10.1109/ASAP.2011.6043230","DOIUrl":"https://doi.org/10.1109/ASAP.2011.6043230","url":null,"abstract":"As the Information and Communication (ICT) infrastructure continues to evolve, significant energy dissipation is incurred in the core routers. Core router performance will soon be limited by the power density. About two-thirds of the power dissipation in a router is in layer 3. Packet forwarding, classification, etc. contribute significantly to this. This talk explores architectures and algorithms for network functions including deep packet inspection and packet classification in core routers. We propose energy efficient designs to realize the “Green Internet” vision. We illustrate the performance improvements for such systems and demonstrate the suitability of FPGAs for these computations. We show that SRAM based solutions combined with FPGA based architectures lead to high throughput as well as reduced power dissipation compared with the state of the art solutions based TCAMs.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2 1","pages":"5"},"PeriodicalIF":0.0,"publicationDate":"2011-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78797896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Era of customization and specialization 定制化和专业化的时代
J. Cong
In order to drastically improve the energy efficiency, we believe that future computer processors need to go beyond parallelization, and provide architecture support of customization and specialization so that the processor architecture can be adapted and optimized for different application domains. Customization can be made to computing cores, memory hierarchy, and network-on-chips for efficient adaptation for different workload. Also, we believe that future processor architectures will make extensive use of accelerators to further increase energy efficiency. Such architectures present many new challenges and opportunities, such as accelerator scheduling, sharing, memory hierarchy optimization, and efficient compilation and runtime support. In this talk, I shall present our ongoing research in these areas in the Center for Domain-Specific Computing.
为了大幅提高能源效率,我们认为未来的计算机处理器需要超越并行化,并提供自定义和专门化的架构支持,以便处理器架构可以适应和优化不同的应用领域。可以对计算核心、内存层次结构和片上网络进行定制,以便有效地适应不同的工作负载。此外,我们相信未来的处理器架构将广泛使用加速器来进一步提高能源效率。这样的体系结构提出了许多新的挑战和机遇,例如加速器调度、共享、内存层次结构优化以及有效的编译和运行时支持。在这次演讲中,我将介绍我们在领域特定计算中心在这些领域正在进行的研究。
{"title":"Era of customization and specialization","authors":"J. Cong","doi":"10.1109/ASAP.2011.6043228","DOIUrl":"https://doi.org/10.1109/ASAP.2011.6043228","url":null,"abstract":"In order to drastically improve the energy efficiency, we believe that future computer processors need to go beyond parallelization, and provide architecture support of customization and specialization so that the processor architecture can be adapted and optimized for different application domains. Customization can be made to computing cores, memory hierarchy, and network-on-chips for efficient adaptation for different workload. Also, we believe that future processor architectures will make extensive use of accelerators to further increase energy efficiency. Such architectures present many new challenges and opportunities, such as accelerator scheduling, sharing, memory hierarchy optimization, and efficient compilation and runtime support. In this talk, I shall present our ongoing research in these areas in the Center for Domain-Specific Computing.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"3 1","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2011-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74268193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The light at the end of the CMOS tunnel CMOS隧道尽头的光
S. Nassif
In spite of numerous predictions to the contrary, Silicon technology is marching along past the 22nm node and on to ever finer dimensions. Innovations at the technology device, circuit and system levels continue to enable us to scale in spite of what sometime appear to be insurmountable problems in power, lack of performance, manufacturability and so on. To a large degree, these innovations are necessary because no substitute technology has been found as yet and, in fact, it does not appear likely that any such technology will become practical this decade. This leaves us with the need to anticipate and predict the near and medium term futures of CMOS for the next handful of technology nodes. This talk will focus on doing just that, and will show how an important new constraint on future system scaling is circuit resilience. Resilience is the ability of circuits to operate in spite of challenges like noise, difficult environmental conditions, ageing and manufacturing imperfections. These factors conspire to cause transient or permanent errors that are indistinguishable from traditional "hard" faults typically caused by defects during fabrication. Without significant innovation at the circuit and system levels, the probability of these events can rise quite dramatically. In the area of SRAM, such phenomena have existed for the last three or four technology nodes, but significant investments in this area have indeed allowed continued system level scaling with ever larger on-chip memories. As these same phenomena start attacking integrated circuits more pervasively, there is an urgent need for research and development in this area to avert the problems certain to arise with increased defect rates. This keynote paper explores the link between the old subject of manufacturing variability and its well-known impact on circuit performance, and the new subject of the way that same variability -in the extreme- can cause complete circuit failure. With care, we will find that the light at the end of the CMOS tunnel is the opening of new opportunities to enrich CMOS with new technologies like MEMS, optics, sensors and even biological devices. Otherwise, that light is likely to be another train…
尽管有许多相反的预测,但硅技术正沿着22纳米节点前进,并向更精细的尺寸迈进。技术设备、电路和系统层面的创新继续使我们能够扩大规模,尽管有时在功率、性能缺乏、可制造性等方面似乎是无法克服的问题。在很大程度上,这些创新是必要的,因为到目前为止还没有找到替代技术,事实上,在这个十年里,任何这样的技术都不太可能变得实用。这让我们需要预测和预测下一批技术节点的CMOS近期和中期未来。这次演讲将集中在这一点上,并将展示未来系统扩展的一个重要的新约束是电路弹性。弹性是电路在噪声、恶劣环境条件、老化和制造缺陷等挑战下运行的能力。这些因素共同导致暂时性或永久性错误,这些错误与传统的“硬”错误难以区分,这些错误通常是由制造过程中的缺陷引起的。如果在电路和系统层面没有重大的创新,这些事件发生的可能性就会急剧上升。在SRAM领域,这种现象在过去的三四个技术节点中已经存在,但在该领域的重大投资确实允许持续的系统级扩展与更大的片上存储器。由于这些相同的现象开始越来越普遍地攻击集成电路,因此迫切需要在这一领域进行研究和开发,以避免由于缺陷率增加而必然出现的问题。这篇主题论文探讨了制造可变性的旧主题和它对电路性能的众所周知的影响之间的联系,以及同样的可变性在极端情况下可能导致完全电路故障的新主题。仔细观察,我们会发现CMOS隧道尽头的光是利用MEMS,光学,传感器甚至生物器件等新技术丰富CMOS的新机会。否则,那盏灯很可能是另一列火车……
{"title":"The light at the end of the CMOS tunnel","authors":"S. Nassif","doi":"10.1109/ASAP.2010.5540756","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540756","url":null,"abstract":"In spite of numerous predictions to the contrary, Silicon technology is marching along past the 22nm node and on to ever finer dimensions. Innovations at the technology device, circuit and system levels continue to enable us to scale in spite of what sometime appear to be insurmountable problems in power, lack of performance, manufacturability and so on. To a large degree, these innovations are necessary because no substitute technology has been found as yet and, in fact, it does not appear likely that any such technology will become practical this decade. This leaves us with the need to anticipate and predict the near and medium term futures of CMOS for the next handful of technology nodes. This talk will focus on doing just that, and will show how an important new constraint on future system scaling is circuit resilience. Resilience is the ability of circuits to operate in spite of challenges like noise, difficult environmental conditions, ageing and manufacturing imperfections. These factors conspire to cause transient or permanent errors that are indistinguishable from traditional \"hard\" faults typically caused by defects during fabrication. Without significant innovation at the circuit and system levels, the probability of these events can rise quite dramatically. In the area of SRAM, such phenomena have existed for the last three or four technology nodes, but significant investments in this area have indeed allowed continued system level scaling with ever larger on-chip memories. As these same phenomena start attacking integrated circuits more pervasively, there is an urgent need for research and development in this area to avert the problems certain to arise with increased defect rates. This keynote paper explores the link between the old subject of manufacturing variability and its well-known impact on circuit performance, and the new subject of the way that same variability -in the extreme- can cause complete circuit failure. With care, we will find that the light at the end of the CMOS tunnel is the opening of new opportunities to enrich CMOS with new technologies like MEMS, optics, sensors and even biological devices. Otherwise, that light is likely to be another train…","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"9 1","pages":"4-9"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78394187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1