Dot product faithfully rounded after "as if" computed in $K$-fold working precision (K>=2)is known to be computable only with floating-point numbers defined in IEEE 754 floating-point standard.This paper presents a CUDA GPU implementation of two-fold working precision matrix multiplication based on the dot product computation method.Experimental results on a GeForce GTX580 and a GTX560Ti show that the proposed implementation has 1.84 to 1.95 timeshigher GFLOPS performance in two-fold working precision compared to the performance of CUBLAS dgemm in double-precision on a Tesla C2070 high-end GPU.The proposed implementation can be used to obtain higher performance in pseudo double-precision with low cost consumer-level GPUs whose double-precision native performance is limited.
{"title":"Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs","authors":"N. Fujimoto","doi":"10.1109/WAMCA.2011.18","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.18","url":null,"abstract":"Dot product faithfully rounded after \"as if\" computed in $K$-fold working precision (K>=2)is known to be computable only with floating-point numbers defined in IEEE 754 floating-point standard.This paper presents a CUDA GPU implementation of two-fold working precision matrix multiplication based on the dot product computation method.Experimental results on a GeForce GTX580 and a GTX560Ti show that the proposed implementation has 1.84 to 1.95 timeshigher GFLOPS performance in two-fold working precision compared to the performance of CUBLAS dgemm in double-precision on a Tesla C2070 high-end GPU.The proposed implementation can be used to obtain higher performance in pseudo double-precision with low cost consumer-level GPUs whose double-precision native performance is limited.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123108600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.
{"title":"Large Scale Kronecker Product on Supercomputers","authors":"C. Tadonki","doi":"10.1109/WAMCA.2011.10","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.10","url":null,"abstract":"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130481337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage powerconsumption of on-chip caches has already become a major power consuming component of the memory subsystem. We propose to reduce leakage powerconsumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unusedcache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communicationcosts to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, theremap policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively.
{"title":"Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy","authors":"A. Mandke, B. Amrutur, Y. Srikant","doi":"10.1109/WAMCA.2011.14","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.14","url":null,"abstract":"Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage powerconsumption of on-chip caches has already become a major power consuming component of the memory subsystem. We propose to reduce leakage powerconsumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unusedcache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communicationcosts to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, theremap policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122828560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cíntia P. Avelar, Poliana A. C. Oliveira, H. Freitas, P. Navaux
Process mapping on Networks-on-Chip (NoC) is an important issue for the future many-core processors. Mapping strategies can increase performance and scalability by optimizing the communication cost. However, parallel applications have a large set of collective communication performing a high traffic on the Network-on-Chip. Therefore, our goal in this paper is to evaluate the problem related to the process mapping for parallel applications. The results show that for different mappings the performance is similar. The reason can be explained by collective communication due to the high number of packets exchanged by all routers. Our evaluation shows that topology and routing protocol can influence the process mapping. Consequently, for different NoC architectures different mapping strategies must be evaluated.
{"title":"Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications","authors":"Cíntia P. Avelar, Poliana A. C. Oliveira, H. Freitas, P. Navaux","doi":"10.1109/WAMCA.2011.13","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.13","url":null,"abstract":"Process mapping on Networks-on-Chip (NoC) is an important issue for the future many-core processors. Mapping strategies can increase performance and scalability by optimizing the communication cost. However, parallel applications have a large set of collective communication performing a high traffic on the Network-on-Chip. Therefore, our goal in this paper is to evaluate the problem related to the process mapping for parallel applications. The results show that for different mappings the performance is similar. The reason can be explained by collective communication due to the high number of packets exchanged by all routers. Our evaluation shows that topology and routing protocol can influence the process mapping. Consequently, for different NoC architectures different mapping strategies must be evaluated.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"305 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133716981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rodrigo Kassick, F. Boito, M. Diener, P. Navaux, Y. Denneulin, C. Schepke, N. Maillard, Carla Osthoff, P. Grunmann, P. Dias, J. Panetta
This paper presents the use of trace-based performance visualization of a large scale atmospheric model, the Ocean-Land-Atmosphere Model (OLAM). The trace was obtained with the libRastro library, and the visualization was done with Paj´e. The use of visualization aimed to analyze OLAM's performance and to identify its bottlenecks. Especially, we are interested in the model's I/O operations, since it was proved to be the main issue for the model's performance. We show that most of the time spent in the output routine is spent in the close operation. With this information, we delayed this operation until the next output phase, obtaining improved I/O performance.
{"title":"Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines","authors":"Rodrigo Kassick, F. Boito, M. Diener, P. Navaux, Y. Denneulin, C. Schepke, N. Maillard, Carla Osthoff, P. Grunmann, P. Dias, J. Panetta","doi":"10.1109/WAMCA.2011.12","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.12","url":null,"abstract":"This paper presents the use of trace-based performance visualization of a large scale atmospheric model, the Ocean-Land-Atmosphere Model (OLAM). The trace was obtained with the libRastro library, and the visualization was done with Paj´e. The use of visualization aimed to analyze OLAM's performance and to identify its bottlenecks. Especially, we are interested in the model's I/O operations, since it was proved to be the main issue for the model's performance. We show that most of the time spent in the output routine is spent in the close operation. With this information, we delayed this operation until the next output phase, obtaining improved I/O performance.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123085245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}