首页 > 最新文献

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)最新文献

英文 中文
Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs 在消费级CUDA gpu上实现经济的双重工作精度矩阵乘法
N. Fujimoto
Dot product faithfully rounded after "as if" computed in $K$-fold working precision (K>=2)is known to be computable only with floating-point numbers defined in IEEE 754 floating-point standard.This paper presents a CUDA GPU implementation of two-fold working precision matrix multiplication based on the dot product computation method.Experimental results on a GeForce GTX580 and a GTX560Ti show that the proposed implementation has 1.84 to 1.95 timeshigher GFLOPS performance in two-fold working precision compared to the performance of CUBLAS dgemm in double-precision on a Tesla C2070 high-end GPU.The proposed implementation can be used to obtain higher performance in pseudo double-precision with low cost consumer-level GPUs whose double-precision native performance is limited.
在“as if”之后按$K$计算的点积忠实地四舍五入-fold工作精度(K>=2)已知只能用IEEE 754浮点标准中定义的浮点数计算。本文提出了一种基于点积计算方法的两倍工作精度矩阵乘法的CUDA GPU实现。在GeForce GTX580和GTX560Ti上的实验结果表明,与CUBLAS dgemm在Tesla C2070高端GPU上的双精度性能相比,该实现在两倍工作精度下的GFLOPS性能提高了1.84 ~ 1.95倍。该实现可以在低成本消费级gpu的双精度性能有限的情况下获得更高的伪双精度性能。
{"title":"Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs","authors":"N. Fujimoto","doi":"10.1109/WAMCA.2011.18","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.18","url":null,"abstract":"Dot product faithfully rounded after \"as if\" computed in $K$-fold working precision (K>=2)is known to be computable only with floating-point numbers defined in IEEE 754 floating-point standard.This paper presents a CUDA GPU implementation of two-fold working precision matrix multiplication based on the dot product computation method.Experimental results on a GeForce GTX580 and a GTX560Ti show that the proposed implementation has 1.84 to 1.95 timeshigher GFLOPS performance in two-fold working precision compared to the performance of CUBLAS dgemm in double-precision on a Tesla C2070 high-end GPU.The proposed implementation can be used to obtain higher performance in pseudo double-precision with low cost consumer-level GPUs whose double-precision native performance is limited.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123108600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Large Scale Kronecker Product on Supercomputers 超级计算机上的大规模克罗内克产品
C. Tadonki
The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.
Kronecker积,也称为张量积,是一种基本的矩阵代数运算,它被广泛用作表达许多相互作用或表示的卷积的自然形式。给定一组矩阵,我们需要用一个向量乘以它们的克罗内克积。该运算是迭代算法的关键核心,需要高效计算。在之前的工作中,我们提出了一个成本最优的并行算法来解决这个问题,无论是在浮点计算时间和处理器间通信步骤方面。然而,只有当我们真正考虑(本地)对数广播时,才能实现数据传输的下限。在实践中,我们考虑一个通信回路。因此,关注每次广播的实际成本变得非常重要。由于这种本地广播是由每个处理器同时执行的,因此在大量处理器(超级计算机)上,情况变得越来越糟。本文从两个方面来解决这个问题。一方面,我们提出了一种构造与理论下界差距最小的虚拟拓扑的方法。另一方面,我们考虑一种混合实现,它的优点是减少了通信节点的数量。我们用大型SMP 8核超级计算机上的一些基准测试来说明我们的工作。
{"title":"Large Scale Kronecker Product on Supercomputers","authors":"C. Tadonki","doi":"10.1109/WAMCA.2011.10","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.10","url":null,"abstract":"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130481337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy 基于重映射策略的片上SNUCA缓存的自适应功耗优化
A. Mandke, B. Amrutur, Y. Srikant
Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage powerconsumption of on-chip caches has already become a major power consuming component of the memory subsystem.  We propose to reduce leakage powerconsumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unusedcache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communicationcosts to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, theremap policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively.
技术的进步增加了芯片多核平台(cmp)上的内核数量和缓存大小。因此,片上高速缓存的泄漏功耗已经成为内存子系统的主要功耗组件。我们建议通过动态改变所使用的缓存片的数量和关闭未使用的缓存片来减少平铺CMP上静态非均匀缓存架构(SNUCA)中的泄漏功耗。磁贴中的缓存片包括该磁贴中存在的所有缓存库。考虑到通信成本,将关闭的缓存片重新映射,以在对执行时间影响最小的情况下减少缓存使用。这节省了关闭L2缓存片的泄漏功耗。平均而言,与可扩展平铺式CMP上的静态和动态NUCA (DNUCA)缓存策略相比,mapp策略分别节省了41%和49%的EDP。
{"title":"Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy","authors":"A. Mandke, B. Amrutur, Y. Srikant","doi":"10.1109/WAMCA.2011.14","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.14","url":null,"abstract":"Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage powerconsumption of on-chip caches has already become a major power consuming component of the memory subsystem.  We propose to reduce leakage powerconsumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unusedcache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communicationcosts to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, theremap policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122828560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications 面向并行应用的片上网络进程映射问题的评价
Cíntia P. Avelar, Poliana A. C. Oliveira, H. Freitas, P. Navaux
Process mapping on Networks-on-Chip (NoC) is an important issue for the future many-core processors. Mapping strategies can increase performance and scalability by optimizing the communication cost. However, parallel applications have a large set of collective communication performing a high traffic on the Network-on-Chip. Therefore, our goal in this paper is to evaluate the problem related to the process mapping for parallel applications. The results show that for different mappings the performance is similar. The reason can be explained by collective communication due to the high number of packets exchanged by all routers. Our evaluation shows that topology and routing protocol can influence the process mapping. Consequently, for different NoC architectures different mapping strategies must be evaluated.
片上网络(NoC)上的进程映射是未来多核处理器的一个重要问题。映射策略可以通过优化通信成本来提高性能和可伸缩性。然而,并行应用程序有大量的集体通信,在片上网络上执行高流量。因此,我们在本文中的目标是评估与并行应用程序的进程映射相关的问题。结果表明,对于不同的映射,性能是相似的。其原因可以用集体通信来解释,因为所有路由器之间交换的数据包数量很大。我们的评估表明,拓扑和路由协议可以影响过程映射。因此,对于不同的NoC体系结构,必须评估不同的映射策略。
{"title":"Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications","authors":"Cíntia P. Avelar, Poliana A. C. Oliveira, H. Freitas, P. Navaux","doi":"10.1109/WAMCA.2011.13","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.13","url":null,"abstract":"Process mapping on Networks-on-Chip (NoC) is an important issue for the future many-core processors. Mapping strategies can increase performance and scalability by optimizing the communication cost. However, parallel applications have a large set of collective communication performing a high traffic on the Network-on-Chip. Therefore, our goal in this paper is to evaluate the problem related to the process mapping for parallel applications. The results show that for different mappings the performance is similar. The reason can be explained by collective communication due to the high number of packets exchanged by all routers. Our evaluation shows that topology and routing protocol can influence the process mapping. Consequently, for different NoC architectures different mapping strategies must be evaluated.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"305 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133716981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines 基于跟踪的可视化是一种理解多核机器中应用程序I/O性能的工具
Rodrigo Kassick, F. Boito, M. Diener, P. Navaux, Y. Denneulin, C. Schepke, N. Maillard, Carla Osthoff, P. Grunmann, P. Dias, J. Panetta
This paper presents the use of trace-based performance visualization of a large scale atmospheric model, the Ocean-Land-Atmosphere Model (OLAM). The trace was obtained with the libRastro library, and the visualization was done with Paj´e. The use of visualization aimed to analyze OLAM's performance and to identify its bottlenecks. Especially, we are interested in the model's I/O operations, since it was proved to be the main issue for the model's performance. We show that most of the time spent in the output routine is spent in the close operation. With this information, we delayed this operation until the next output phase, obtaining improved I/O performance.
本文介绍了基于迹线的大尺度大气模式——海洋-陆地-大气模式(OLAM)性能可视化的应用。使用libRastro库获得轨迹,并使用Paj´e进行可视化。可视化的使用旨在分析OLAM的性能并确定其瓶颈。特别是,我们对模型的I/O操作感兴趣,因为它被证明是模型性能的主要问题。我们展示了在输出例程中花费的大部分时间都花在了close操作上。有了这些信息,我们将此操作延迟到下一个输出阶段,从而获得改进的I/O性能。
{"title":"Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines","authors":"Rodrigo Kassick, F. Boito, M. Diener, P. Navaux, Y. Denneulin, C. Schepke, N. Maillard, Carla Osthoff, P. Grunmann, P. Dias, J. Panetta","doi":"10.1109/WAMCA.2011.12","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.12","url":null,"abstract":"This paper presents the use of trace-based performance visualization of a large scale atmospheric model, the Ocean-Land-Atmosphere Model (OLAM). The trace was obtained with the libRastro library, and the visualization was done with Paj´e. The use of visualization aimed to analyze OLAM's performance and to identify its bottlenecks. Especially, we are interested in the model's I/O operations, since it was proved to be the main issue for the model's performance. We show that most of the time spent in the output routine is spent in the close operation. With this information, we delayed this operation until the next output phase, obtaining improved I/O performance.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123085245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1