首页 > 最新文献

2010 IEEE 8th Symposium on Application Specific Processors (SASP)最新文献

英文 中文
A Coarse Grain Reconfigurable Architecture for sequence alignment problems in bio-informatics 生物信息学中序列比对问题的粗粒可重构结构
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521146
Pei Liu, A. Hemani
A Coarse Grain Reconfigurable Architecture (CGRA) tailored for accelerating bio-informatics algorithms is proposed. The key innovation is a light weight bio-informatics processor that can be reconfigured to perform different Add Compare and Select operations of the popular sequencing algorithms. A programmable and scalable architectural platform instantiates an array of such processing elements and allows arbitrary partitioning and scheduling schemes and capable of solving complete sequencing algorithms including the sequential phases and deal with arbitrarily large sequences. The key difference of the proposed CGRA based solution compared to FPGA and GPU based solutions is a much better match of the architecture and algorithm for the core computational need as well as the system level architectural need. This claim is quantified for three popular sequencing algorithms: the Needleman-Wunsch, Smith-Waterman and HMMER. For the same degree of parallelism, we provide a 5 X and 15 X speed-up improvements compared to FPGA and GPU respectively. For the same size of silicon, the advantage grows by a factor of another 10 X.
提出了一种用于加速生物信息学算法的粗粒可重构结构(CGRA)。关键的创新是一个轻量级的生物信息学处理器,可以重新配置以执行不同的添加比较和选择操作的流行的测序算法。可编程和可扩展的体系结构平台实例化了这种处理元素的阵列,并允许任意分区和调度方案,能够解决包括顺序阶段在内的完整排序算法,并处理任意大的序列。与基于FPGA和GPU的解决方案相比,所提出的基于CGRA的解决方案的关键区别在于,它能够更好地匹配核心计算需求和系统级架构需求的体系结构和算法。这种说法可以用三种流行的测序算法来量化:Needleman-Wunsch、Smith-Waterman和HMMER。对于相同程度的并行性,我们提供了5倍和15倍的加速改进,分别比FPGA和GPU。对于同样尺寸的硅,优势又增加了10倍。
{"title":"A Coarse Grain Reconfigurable Architecture for sequence alignment problems in bio-informatics","authors":"Pei Liu, A. Hemani","doi":"10.1109/SASP.2010.5521146","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521146","url":null,"abstract":"A Coarse Grain Reconfigurable Architecture (CGRA) tailored for accelerating bio-informatics algorithms is proposed. The key innovation is a light weight bio-informatics processor that can be reconfigured to perform different Add Compare and Select operations of the popular sequencing algorithms. A programmable and scalable architectural platform instantiates an array of such processing elements and allows arbitrary partitioning and scheduling schemes and capable of solving complete sequencing algorithms including the sequential phases and deal with arbitrarily large sequences. The key difference of the proposed CGRA based solution compared to FPGA and GPU based solutions is a much better match of the architecture and algorithm for the core computational need as well as the system level architectural need. This claim is quantified for three popular sequencing algorithms: the Needleman-Wunsch, Smith-Waterman and HMMER. For the same degree of parallelism, we provide a 5 X and 15 X speed-up improvements compared to FPGA and GPU respectively. For the same size of silicon, the advantage grows by a factor of another 10 X.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121296176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Customized architectures for faster route finding in GPS-based navigation systems 在基于gps的导航系统中更快地找到路线的定制架构
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521148
Jason Loew, D. Ponomarev, P. Madden
GPS based navigation systems became popular in dedicated handheld devices, and are now also found in modern cell phones, and other small personal devices. A key element of any navigation system is fast and effective route finding, and this depends heavily on Dijkstra's shortest path algorithm. Dijkstra's algorithm is serial in nature; prior efforts to accelerate it through parallel processing have had almost no success. In this paper, we present a practical approach to extract small-scale parallelism by shifting priority queue operations to a secondary tightly-coupled processor. We obtain a substantial speedup on real-world graphs (in particular, road maps), allowing the development of navigation systems that are more responsive, and also lower in total power consumption.
基于GPS的导航系统在专用手持设备中很受欢迎,现在也可以在现代手机和其他小型个人设备中找到。任何导航系统的关键要素都是快速有效的寻路,这在很大程度上取决于Dijkstra的最短路径算法。Dijkstra算法本质上是串行的;之前通过并行处理加速它的努力几乎没有成功。在本文中,我们提出了一种实用的方法,通过将优先级队列操作转移到二级紧耦合处理器来提取小规模并行性。我们在真实世界的图形(特别是道路地图)上获得了显著的加速,从而允许开发响应更快、总功耗更低的导航系统。
{"title":"Customized architectures for faster route finding in GPS-based navigation systems","authors":"Jason Loew, D. Ponomarev, P. Madden","doi":"10.1109/SASP.2010.5521148","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521148","url":null,"abstract":"GPS based navigation systems became popular in dedicated handheld devices, and are now also found in modern cell phones, and other small personal devices. A key element of any navigation system is fast and effective route finding, and this depends heavily on Dijkstra's shortest path algorithm. Dijkstra's algorithm is serial in nature; prior efforts to accelerate it through parallel processing have had almost no success. In this paper, we present a practical approach to extract small-scale parallelism by shifting priority queue operations to a secondary tightly-coupled processor. We obtain a substantial speedup on real-world graphs (in particular, road maps), allowing the development of navigation systems that are more responsive, and also lower in total power consumption.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125081140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A hardware pipeline for accelerating ray traversal algorithms on streaming processors 在流处理器上加速射线遍历算法的硬件管道
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521150
Michael Steffen, Joseph Zambreno
Ray Tracing is a graphics rendering method that uses rays to trace the path of light in a computer model. To accelerate the processing of rays, scenes are typically compiled into smaller spatial boxes using a tree structure and rays then traverse the tree structure to determine relevant spatial boxes. This allows computations involving rays and scene objects to be limited to only objects close to the ray and does not require processing all elements in the computer model. We present a ray traversal pipeline designed to accelerate ray tracing traversal algorithms using a combination of currently used programmable graphics processors and a new fixed hardware pipeline. Our fixed hardware pipeline performs an initial traversal operation that quickly identifies a smaller sized, fixed granularity spatial bounding box from the original scene. This spatial box can then be traversed further to identify subsequently smaller spatial bounding boxes using any user-defined acceleration algorithm. We show that our pipeline allows for an expected level of user programmability, including development of custom data structures, and can support a wide range of processor architectures. The performance of our pipeline is evaluated for ray traversal and intersection stages using a kd-tree ray tracing algorithm and a custom simulator modeling a generic streaming processor architecture. Experimental results show that our pipeline reduces the number of executed instructions on a graphics processor for the traversal operation by 2.15X for visible rays. The memory bandwidth required for traversal is also reduced by a factor of 1.3X for visible rays.
光线追踪是一种图形渲染方法,它使用光线在计算机模型中追踪光的路径。为了加速光线的处理,通常使用树形结构将场景编译成更小的空间框,然后光线遍历树形结构以确定相关的空间框。这允许涉及光线和场景对象的计算仅限于靠近光线的对象,并且不需要处理计算机模型中的所有元素。我们提出了一种射线遍历管道,旨在使用当前使用的可编程图形处理器和新的固定硬件管道的组合来加速射线跟踪遍历算法。我们的固定硬件管道执行初始遍历操作,快速识别来自原始场景的较小尺寸,固定粒度的空间边界框。然后可以使用任何用户定义的加速算法进一步遍历这个空间框,以识别随后更小的空间边界框。我们展示了我们的管道允许预期级别的用户可编程性,包括自定义数据结构的开发,并且可以支持广泛的处理器架构。我们的管道的性能是评估射线遍历和交叉阶段使用kd-tree射线跟踪算法和自定义模拟器建模通用流处理器架构。实验结果表明,我们的管道减少了图形处理器上用于遍历操作的指令数,减少了2.15倍。对于可见光,遍历所需的内存带宽也减少了1.3倍。
{"title":"A hardware pipeline for accelerating ray traversal algorithms on streaming processors","authors":"Michael Steffen, Joseph Zambreno","doi":"10.1109/SASP.2010.5521150","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521150","url":null,"abstract":"Ray Tracing is a graphics rendering method that uses rays to trace the path of light in a computer model. To accelerate the processing of rays, scenes are typically compiled into smaller spatial boxes using a tree structure and rays then traverse the tree structure to determine relevant spatial boxes. This allows computations involving rays and scene objects to be limited to only objects close to the ray and does not require processing all elements in the computer model. We present a ray traversal pipeline designed to accelerate ray tracing traversal algorithms using a combination of currently used programmable graphics processors and a new fixed hardware pipeline. Our fixed hardware pipeline performs an initial traversal operation that quickly identifies a smaller sized, fixed granularity spatial bounding box from the original scene. This spatial box can then be traversed further to identify subsequently smaller spatial bounding boxes using any user-defined acceleration algorithm. We show that our pipeline allows for an expected level of user programmability, including development of custom data structures, and can support a wide range of processor architectures. The performance of our pipeline is evaluated for ray traversal and intersection stages using a kd-tree ray tracing algorithm and a custom simulator modeling a generic streaming processor architecture. Experimental results show that our pipeline reduces the number of executed instructions on a graphics processor for the traversal operation by 2.15X for visible rays. The memory bandwidth required for traversal is also reduced by a factor of 1.3X for visible rays.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116997763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
CMA: Chip multi-accelerator CMA:芯片多加速器
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521152
Dominik Auras, Sylvain Girbal, H. Berry, O. Temam, S. Yehia
Custom acceleration has been a standard choice in embedded systems thanks to the power density and performance efficiency it provides. Parallelism is another orthogonal scalability path that efficiently overcomes the increasing limitation of frequency scaling in current general-purpose architectures. In this paper we propose a multi-accelerator architecture that combines the best of both worlds, parallelism and custom acceleration, while addressing the programmability inconvenience of heterogeneous multiprocessing systems. A Chip Multi-Accelerator (CMA) is a regular parallel architecture where each core is complemented with a custom accelerator to speed up specific functions. Furthermore, by using techniques to efficiently merge more than one custom accelerator together, we are able to cram as many accelerators as needed by the application or a domain of applications. We demonstrate our approach on a Software Defined Radio (SDR) case study. We show that starting from a baseline description of several SDR waveforms and candidate tasks for acceleration, we are able to map the different waveforms on the heterogeneous multi-accelerator architecture while keeping a logical view of a regular multi-core architecture, thus simplifying the mapping of the waveforms onto the multi-accelerator.
由于其提供的功率密度和性能效率,自定义加速已成为嵌入式系统中的标准选择。并行是另一种正交可伸缩性路径,它有效地克服了当前通用架构中频率缩放日益增加的限制。在本文中,我们提出了一种多加速器体系结构,它结合了并行性和自定义加速的优点,同时解决了异构多处理系统在可编程性方面的不便。芯片多加速器(CMA)是一种常规的并行架构,其中每个核心都有一个定制加速器来加速特定功能。此外,通过使用有效地将多个自定义加速器合并在一起的技术,我们能够根据应用程序或应用程序领域的需要填充尽可能多的加速器。我们在软件定义无线电(SDR)案例研究中演示了我们的方法。我们展示了从几种SDR波形和候选加速任务的基线描述开始,我们能够在异构多加速器架构上映射不同的波形,同时保持常规多核架构的逻辑视图,从而简化了波形到多加速器的映射。
{"title":"CMA: Chip multi-accelerator","authors":"Dominik Auras, Sylvain Girbal, H. Berry, O. Temam, S. Yehia","doi":"10.1109/SASP.2010.5521152","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521152","url":null,"abstract":"Custom acceleration has been a standard choice in embedded systems thanks to the power density and performance efficiency it provides. Parallelism is another orthogonal scalability path that efficiently overcomes the increasing limitation of frequency scaling in current general-purpose architectures. In this paper we propose a multi-accelerator architecture that combines the best of both worlds, parallelism and custom acceleration, while addressing the programmability inconvenience of heterogeneous multiprocessing systems. A Chip Multi-Accelerator (CMA) is a regular parallel architecture where each core is complemented with a custom accelerator to speed up specific functions. Furthermore, by using techniques to efficiently merge more than one custom accelerator together, we are able to cram as many accelerators as needed by the application or a domain of applications. We demonstrate our approach on a Software Defined Radio (SDR) case study. We show that starting from a baseline description of several SDR waveforms and candidate tasks for acceleration, we are able to map the different waveforms on the heterogeneous multi-accelerator architecture while keeping a logical view of a regular multi-core architecture, thus simplifying the mapping of the waveforms onto the multi-accelerator.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114304128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Accelerating DNA analysis applications on GPU clusters 加速GPU集群上的DNA分析应用
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521145
Antonino Tumeo, Oreste Villa
DNA analysis is an emerging application of high performance bioinformatics. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs). We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.
DNA分析是高性能生物信息学的一个新兴应用。现代测序设备能够在几个小时内提供大量输入数据流,这些数据流需要与指数增长的已知片段数据库进行匹配。有效而快速地识别这些模式的能力可能会扩大生物学家进行研究的规模和范围。Aho-Corasick是一种精确的多模式匹配算法,通常是该应用程序的基础。在本文中,我们提出了一种高效的Aho-Corasick算法,用于图形处理单元(gpu)加速的高性能集群。我们讨论了如何划分和调整算法以适应Tesla C1060 GPU,然后提出了基于MPI的异构高性能集群实现。我们将此实现与MPI和基于pthread的MPI实现进行比较,讨论稳定性与性能以及解决方案的可扩展性,同时考虑到不同节点之间的带宽等方面。
{"title":"Accelerating DNA analysis applications on GPU clusters","authors":"Antonino Tumeo, Oreste Villa","doi":"10.1109/SASP.2010.5521145","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521145","url":null,"abstract":"DNA analysis is an emerging application of high performance bioinformatics. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs). We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129492264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Design of a custom VEE core in a chip multiprocessor 芯片多处理器中定制VEE核心的设计
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521138
Dan Upton, K. Hazelwood
Chip multiprocessors provide an opportunity for continuing performance growth in the face of limited single-thread parallelism. Although the best design path for such chips remains open, application-specific core designs have shown promise. This work considers the design of an application-specific core for a virtual execution environment. We use Pin, a widely-used dynamic binary instrumentation system, as a representative process-level VEE. Through a combination of microarchitectural simulation and hardware performance counters, we profile the VEE in terms of cache behavior, functional unit usage, and branch predictor behavior, and compare its performance to the performance of benchmark applications. We then show that running the VEE on our specialized core uses up to 15% less power per cycle and up to 5% less energy overall than running the same VEE on a general-purpose core.
面对有限的单线程并行性,芯片多处理器为持续的性能增长提供了机会。虽然这种芯片的最佳设计路径仍然开放,但特定于应用程序的核心设计已经显示出前景。这项工作考虑了为虚拟执行环境设计特定于应用程序的核心。我们使用Pin,一个广泛使用的动态二进制仪表系统,作为代表性的过程级VEE。通过微架构模拟和硬件性能计数器的组合,我们从缓存行为、功能单元使用和分支预测器行为方面对VEE进行了分析,并将其性能与基准应用程序的性能进行了比较。然后,我们展示了在我们的专用核心上运行VEE比在通用核心上运行相同的VEE每个周期节省高达15%的功率,总体上节省高达5%的能量。
{"title":"Design of a custom VEE core in a chip multiprocessor","authors":"Dan Upton, K. Hazelwood","doi":"10.1109/SASP.2010.5521138","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521138","url":null,"abstract":"Chip multiprocessors provide an opportunity for continuing performance growth in the face of limited single-thread parallelism. Although the best design path for such chips remains open, application-specific core designs have shown promise. This work considers the design of an application-specific core for a virtual execution environment. We use Pin, a widely-used dynamic binary instrumentation system, as a representative process-level VEE. Through a combination of microarchitectural simulation and hardware performance counters, we profile the VEE in terms of cache behavior, functional unit usage, and branch predictor behavior, and compare its performance to the performance of benchmark applications. We then show that running the VEE on our specialized core uses up to 15% less power per cycle and up to 5% less energy overall than running the same VEE on a general-purpose core.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133865970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient design and generation of a multi-facet arbiter 多面仲裁器的高效设计与生成
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521137
J. Jou, Yun-Lung Lee, Sih-Sian Wu
Based on the arbiter template developed in [1], we presented an efficient, modular, and scalable decentralized parallel design of a new multi-facet arbiter. Moreover, with this modular and reusable hardware design, we have implemented a parametric arbiter generator that automatically generates various multi-facet arbiters. With the decentralized parallel design and the generator, not only a fastest and smallest round-robin arbiter but also other type arbiters were designed and generated on the fly. The experiment results were given to show the designs' excellent performances.
基于[1]中开发的仲裁器模板,我们提出了一种高效,模块化和可扩展的新型多面仲裁器的分散并行设计。此外,通过这种模块化和可重用的硬件设计,我们实现了一个参数仲裁器生成器,可以自动生成各种多面仲裁器。利用分散式并行设计和生成器,不仅设计了最快、最小的轮循仲裁器,还动态地设计和生成了其他类型的仲裁器。实验结果表明,该设计具有良好的性能。
{"title":"Efficient design and generation of a multi-facet arbiter","authors":"J. Jou, Yun-Lung Lee, Sih-Sian Wu","doi":"10.1109/SASP.2010.5521137","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521137","url":null,"abstract":"Based on the arbiter template developed in [1], we presented an efficient, modular, and scalable decentralized parallel design of a new multi-facet arbiter. Moreover, with this modular and reusable hardware design, we have implemented a parametric arbiter generator that automatically generates various multi-facet arbiters. With the decentralized parallel design and the generator, not only a fastest and smallest round-robin arbiter but also other type arbiters were designed and generated on the fly. The experiment results were given to show the designs' excellent performances.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133097080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
I-cache configurability for temperature reduction through replicated cache partitioning 通过复制缓存分区降低温度的I-cache可配置性
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521143
M. Paul, Peter Petrov
On-chip caches have been known to be a major contributor to leakage power as they occupy a sizable fraction of the chip's real estate and as such have been the target of power optimization techniques. However, many of these techniques do not consider the effects of temperature on leakage power and can hence be suboptimal since leakage power rises rapidly with temperature. When large fractions of the cache are disabled and only a small partition is used, the power density increases significantly which leads to increased temperature and leakage. We propose a temperature reduction methodology that leverages recently introduced configurable caches, in order to not only assign to the task a cache partition commensurate to its current demand but also to minimize the associated power density and temperature. In order to counteract the effect of elevated power density and achieve temperature reductions, in the proposed technique each such cache partition is replicated and only one of the replicas is active at any time. The inactive partition replicas are placed into a low-power drowsy mode while the primary partition services the task's instruction requests. By periodically switching the tasks association between replica cache partitions, the power density and hence the temperature are reduced.
众所周知,片上缓存是泄漏功率的主要贡献者,因为它们占据了芯片实际空间的相当大一部分,因此一直是功耗优化技术的目标。然而,许多这些技术没有考虑温度对泄漏功率的影响,因此可能是次优的,因为泄漏功率随着温度的升高而迅速上升。当大部分缓存被禁用,只使用一小部分分区时,功率密度显著增加,导致温度升高和泄漏。我们提出了一种温度降低方法,利用最近引入的可配置缓存,不仅为任务分配与其当前需求相称的缓存分区,而且还将相关的功率密度和温度降至最低。为了抵消功率密度升高的影响并实现温度降低,在所提出的技术中,每个这样的缓存分区都被复制,并且在任何时候只有一个副本是活动的。当主分区为任务的指令请求提供服务时,将非活动分区副本置于低功耗休眠模式。通过定期切换副本缓存分区之间的任务关联,可以降低功率密度,从而降低温度。
{"title":"I-cache configurability for temperature reduction through replicated cache partitioning","authors":"M. Paul, Peter Petrov","doi":"10.1109/SASP.2010.5521143","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521143","url":null,"abstract":"On-chip caches have been known to be a major contributor to leakage power as they occupy a sizable fraction of the chip's real estate and as such have been the target of power optimization techniques. However, many of these techniques do not consider the effects of temperature on leakage power and can hence be suboptimal since leakage power rises rapidly with temperature. When large fractions of the cache are disabled and only a small partition is used, the power density increases significantly which leads to increased temperature and leakage. We propose a temperature reduction methodology that leverages recently introduced configurable caches, in order to not only assign to the task a cache partition commensurate to its current demand but also to minimize the associated power density and temperature. In order to counteract the effect of elevated power density and achieve temperature reductions, in the proposed technique each such cache partition is replicated and only one of the replicas is active at any time. The inactive partition replicas are placed into a low-power drowsy mode while the primary partition services the task's instruction requests. By periodically switching the tasks association between replica cache partitions, the power density and hence the temperature are reduced.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134622081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A dynamically reconfigurable asynchronous processor 动态可重构的异步处理器
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521141
Khodor Ahmad Fawaz, T. Arslan, S. Khawam, M. Muir, I. Nousias, Iain A. B. Lindsay, A. Erdogan
The main design requirements for high-throughput mobile applications are energy efficiency and programmability. This paper presents a novel dynamically reconfigurable processor that targets these requirements. Our processor consists of a heterogeneous array of coarse grain asynchronous cells. The architecture maintains most of the benefits of custom asynchronous design, while also providing programmability via conventional high-level languages. Results show that our processor delivers considerably lower power consumption when compared to a market leading VLIW and a low-power ARM processor, while maintaining their throughput performance. For example, our processor resulted in a reduction in power consumption over the ARM7 processor of over 9 times when running the bilinear demosaicing algorithm at the same throughput. Our processor was also compared to an equivalent synchronous design, resulting in a power reduction of up to 15%.
高吞吐量移动应用程序的主要设计要求是能效和可编程性。针对这些需求,本文提出了一种新的动态可重构处理器。我们的处理器由一组粗粮异步单元组成。该体系结构保留了自定义异步设计的大部分优点,同时还通过传统的高级语言提供可编程性。结果表明,与市场领先的VLIW和低功耗ARM处理器相比,我们的处理器提供了相当低的功耗,同时保持了它们的吞吐量性能。例如,在相同吞吐量下运行双线性反马赛克算法时,我们的处理器使功耗比ARM7处理器降低了9倍以上。我们的处理器还与同等的同步设计进行了比较,结果功耗降低了15%。
{"title":"A dynamically reconfigurable asynchronous processor","authors":"Khodor Ahmad Fawaz, T. Arslan, S. Khawam, M. Muir, I. Nousias, Iain A. B. Lindsay, A. Erdogan","doi":"10.1109/SASP.2010.5521141","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521141","url":null,"abstract":"The main design requirements for high-throughput mobile applications are energy efficiency and programmability. This paper presents a novel dynamically reconfigurable processor that targets these requirements. Our processor consists of a heterogeneous array of coarse grain asynchronous cells. The architecture maintains most of the benefits of custom asynchronous design, while also providing programmability via conventional high-level languages. Results show that our processor delivers considerably lower power consumption when compared to a market leading VLIW and a low-power ARM processor, while maintaining their throughput performance. For example, our processor resulted in a reduction in power consumption over the ARM7 processor of over 9 times when running the bilinear demosaicing algorithm at the same throughput. Our processor was also compared to an equivalent synchronous design, resulting in a power reduction of up to 15%.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128338758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
FPGA and GPU implementation of large scale SpMV 大规模SpMV的FPGA和GPU实现
Pub Date : 2010-06-01 DOI: 10.1109/SASP.2010.5521144
Yi Shan, Tianji Wu, Yu Wang, Bo Wang, Zilong Wang, Ningyi Xu, Huazhong Yang
Sparse matrix-vector multiplication (SpMV) is a fundamental operation for many applications. Many studies have been done to implement the SpMV on different platforms, while few work focused on the very large scale datasets with millions of dimensions. This paper addresses the challenges of implementing large scale SpMV with FPGA and GPU in the application of web link graph analysis. In the FPGA implementation, we designed the task partition and memory hierarchy according to the analysis of datasets scale and their access pattern. In the GPU implementation, we designed a fast and scalable SpMV routine with three passes, using a modified Compressed Sparse Row format. Results show that FPGA and GPU implementation achieves about 29x and 30x speedup on a StratixII EP2S180 FPGA and Radeon 5870 Graphic Card respectively compared with a Phenom 9550 CPU.
稀疏矩阵向量乘法(SpMV)是许多应用程序的基本运算。在不同平台上实现SpMV的研究很多,但很少有研究关注具有数百万维度的超大规模数据集。本文讨论了在web链接图分析应用中,利用FPGA和GPU实现大规模SpMV所面临的挑战。在FPGA实现中,根据对数据集规模和访问模式的分析,设计了任务分区和内存层次结构。在GPU实现中,我们设计了一个快速和可扩展的SpMV例程,使用改进的压缩稀疏行格式。结果表明,在StratixII EP2S180 FPGA和Radeon 5870图形卡上实现的FPGA和GPU分别比在Phenom 9550 CPU上实现的加速分别提高了29倍和30倍。
{"title":"FPGA and GPU implementation of large scale SpMV","authors":"Yi Shan, Tianji Wu, Yu Wang, Bo Wang, Zilong Wang, Ningyi Xu, Huazhong Yang","doi":"10.1109/SASP.2010.5521144","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521144","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a fundamental operation for many applications. Many studies have been done to implement the SpMV on different platforms, while few work focused on the very large scale datasets with millions of dimensions. This paper addresses the challenges of implementing large scale SpMV with FPGA and GPU in the application of web link graph analysis. In the FPGA implementation, we designed the task partition and memory hierarchy according to the analysis of datasets scale and their access pattern. In the GPU implementation, we designed a fast and scalable SpMV routine with three passes, using a modified Compressed Sparse Row format. Results show that FPGA and GPU implementation achieves about 29x and 30x speedup on a StratixII EP2S180 FPGA and Radeon 5870 Graphic Card respectively compared with a Phenom 9550 CPU.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130114735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
期刊
2010 IEEE 8th Symposium on Application Specific Processors (SASP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1