首页 > 最新文献

2010 First International Conference on Networking and Computing最新文献

英文 中文
Matrix Multiply-Add in Min-plus Algebra on a Short-Vector SIMD Processor of Cell/B.E. Cell/B.E.短矢量SIMD处理器上最小加代数的矩阵乘加
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.29
Kazuya Matsumoto, S. Sedukhin
It is well-known that the all-pairs shortest paths problem has a similar algorithmic characteristic to the classical matrix-matrix multiply-add (MMA) problem, one of the differences between the two problems is in the underlying algebra: the matrix multiply-add uses linear (+, x)-algebra whereas the all-pairs shortest paths problem uses (min, +)-algebra. This paper presents an implementation of 64×64 matrix multiply-add kernel in (min, +)-algebra on a short-vector SIMD processor, the so-called Synergistic Processing Element (SPE), of the Cell Broadband Engine (Cell/B.E.). Our implementation for the shortest paths problem adopts an existing fast algorithm of matrix multiply-add with a reduction of the number of required registers. The MMA implementation in (min, +)-algebra achieves the speed of 8.502 Gflop/s, which is about three times as low as that of the (+, x)-algebra MMA and is very close to the theoretical estimation based on the required number of instructions.
众所周知,全对最短路径问题与经典的矩阵-矩阵乘加(MMA)问题具有相似的算法特征,两者的区别之一在于底层代数:矩阵乘加问题使用线性(+,x)代数,而全对最短路径问题使用(min, +)代数。本文提出了在Cell宽带引擎(Cell/B.E.)的短向量SIMD处理器上实现(min, +)代数中的64×64矩阵乘加核,即所谓的协同处理单元(SPE)。我们对最短路径问题的实现采用了一种现有的快速矩阵乘加算法,减少了所需寄存器的数量。(min, +)-代数的MMA实现速度为8.502 Gflop/s,比(+,x)-代数的MMA低约三倍,非常接近基于所需指令数的理论估计。
{"title":"Matrix Multiply-Add in Min-plus Algebra on a Short-Vector SIMD Processor of Cell/B.E.","authors":"Kazuya Matsumoto, S. Sedukhin","doi":"10.1109/IC-NC.2010.29","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.29","url":null,"abstract":"It is well-known that the all-pairs shortest paths problem has a similar algorithmic characteristic to the classical matrix-matrix multiply-add (MMA) problem, one of the differences between the two problems is in the underlying algebra: the matrix multiply-add uses linear (+, x)-algebra whereas the all-pairs shortest paths problem uses (min, +)-algebra. This paper presents an implementation of 64×64 matrix multiply-add kernel in (min, +)-algebra on a short-vector SIMD processor, the so-called Synergistic Processing Element (SPE), of the Cell Broadband Engine (Cell/B.E.). Our implementation for the shortest paths problem adopts an existing fast algorithm of matrix multiply-add with a reduction of the number of required registers. The MMA implementation in (min, +)-algebra achieves the speed of 8.502 Gflop/s, which is about three times as low as that of the (+, x)-algebra MMA and is very close to the theoretical estimation based on the required number of instructions.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124097264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Evaluation Framework for GPU Performance Based on OpenCL Standard 基于OpenCL标准的GPU性能评估框架
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.32
Martin Jurecko, J. Kocisová, J. Buša, T. Kasanický, M. Domiter, M. Zvada
There are many projects focused on performance measurements of GPUs but there is no unifying test framework that could be used for evaluating generic floating point intensive applications. This work describes the testing suite for evaluating GPUs that measures raw performance and numerical precision of a subset of OpenCL operations, and analyzes results obtained from several commonly available high-end GPUs.
有许多项目专注于gpu的性能测量,但没有统一的测试框架可以用来评估通用的浮点密集型应用程序。本工作描述了用于评估gpu的测试套件,该测试套件测量了OpenCL操作子集的原始性能和数值精度,并分析了从几种常用的高端gpu获得的结果。
{"title":"Evaluation Framework for GPU Performance Based on OpenCL Standard","authors":"Martin Jurecko, J. Kocisová, J. Buša, T. Kasanický, M. Domiter, M. Zvada","doi":"10.1109/IC-NC.2010.32","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.32","url":null,"abstract":"There are many projects focused on performance measurements of GPUs but there is no unifying test framework that could be used for evaluating generic floating point intensive applications. This work describes the testing suite for evaluating GPUs that measures raw performance and numerical precision of a subset of OpenCL operations, and analyzes results obtained from several commonly available high-end GPUs.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123857511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Proposition of Criteria for Aborting Transaction Based on Log Data Size in LogTM LogTM中基于日志数据大小的事务中止准则的提出
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.51
Hiroki Asai, Tomoaki Tsumura, H. Matsuo
Lock-based synchronization techniques are commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilities. Hence, LogTM has been proposed and studied for lock-free synchronization. LogTM is a kind of hardware transactional memory. In LogTM, transactions are executed speculatively to ensure serializability and atomicity. LogTM stores original values in a log before it is modified by a transaction. If a transaction accesses a shared datum which has been accessed by another transaction running in parallel, LogTM detects it as conflict and restores all data from the associated log and restarts the transaction. This is called aborting. On abort, the costs for restoring data from a log increases in proportion to the data size on the log. However, LogTM selects which transaction should be aborted by their initiated time. Hence, if conflicts occur frequently, it may degrades the performance. This paper proposes a criterion for selecting which transaction should be aborted taking account of data size in each log. In addition, another criterion which takes account of degree of conflict is also proposed. The result of the experiment with SPLASH-2 benchmark suite programs shows that the proposed methods improve the performance 2.7% in maximum.
基于锁的同步技术通常用于多核处理器上的并行编程。但是,锁可能导致死锁和较差的可伸缩性。因此,LogTM被提出并研究用于无锁同步。LogTM是一种硬件事务性内存。在LogTM中,事务是推测性地执行的,以确保可序列化性和原子性。LogTM在事务修改日志之前将原始值存储在日志中。如果一个事务访问的共享数据已经被并行运行的另一个事务访问过,LogTM将其检测为冲突,并从关联日志中恢复所有数据,并重新启动该事务。这叫做流产。中止时,从日志中恢复数据的成本与日志上的数据大小成比例地增加。然而,LogTM选择哪些事务应该在其初始化时间终止。因此,如果冲突频繁发生,可能会降低性能。本文提出了一个考虑每个日志中的数据大小来选择应该终止哪个事务的标准。此外,还提出了另一种考虑冲突程度的判据。在SPLASH-2基准套件程序上的实验结果表明,所提出的方法最大提高了2.7%的性能。
{"title":"Proposition of Criteria for Aborting Transaction Based on Log Data Size in LogTM","authors":"Hiroki Asai, Tomoaki Tsumura, H. Matsuo","doi":"10.1109/IC-NC.2010.51","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.51","url":null,"abstract":"Lock-based synchronization techniques are commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilities. Hence, LogTM has been proposed and studied for lock-free synchronization. LogTM is a kind of hardware transactional memory. In LogTM, transactions are executed speculatively to ensure serializability and atomicity. LogTM stores original values in a log before it is modified by a transaction. If a transaction accesses a shared datum which has been accessed by another transaction running in parallel, LogTM detects it as conflict and restores all data from the associated log and restarts the transaction. This is called aborting. On abort, the costs for restoring data from a log increases in proportion to the data size on the log. However, LogTM selects which transaction should be aborted by their initiated time. Hence, if conflicts occur frequently, it may degrades the performance. This paper proposes a criterion for selecting which transaction should be aborted taking account of data size in each log. In addition, another criterion which takes account of degree of conflict is also proposed. The result of the experiment with SPLASH-2 benchmark suite programs shows that the proposed methods improve the performance 2.7% in maximum.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116255415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CODIE: Continuation-Based Overlapping Data-Transfers with Instruction Execution CODIE:基于连续的重叠数据传输与指令执行
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.26
T. Miyoshi, Kenji Kise, H. Irie, T. Yoshinaga
In this paper, a runtime system termed CODIE is proposed to execute sequential part of programs efficiently in a many-core architecture. All independent processing elements in a many-core architecture use a shared network and off-chip memory. Therefore, contentions on such resources substantially degrade the system performance. On the CODIE system, when a cache miss occurs, the system first initiates a data transfer operation. Next, the system creates a continuation of executing instructions related to the missing data. The continuation is stored into the buffer, and the instructions not related to the missing data are executed subsequently. In other words, data transfer and instruction executions can be performed simultaneously. In this way, the effect of the overhead of the updating cache entry (increased by memory access contention) is tolerated. The results of evaluation show that the proposed CODIE system realizes a 1.86x speed up of the execution of the sequential write/read program on the M-Core architecture at 36 cores and a 1.97x speed up of the execution of the blacks holes(from PARSEC benchmark suite) on the Cell/BE processor with 6 SPEs.
为了在多核体系结构中高效地执行程序的顺序部分,本文提出了一种称为CODIE的运行时系统。多核架构中的所有独立处理元素都使用共享网络和片外存储器。因此,这些资源上的争用会大大降低系统性能。在CODIE系统上,当缓存丢失时,系统首先启动数据传输操作。接下来,系统创建与丢失数据相关的执行指令的延续。延续被存储到缓冲区中,随后执行与丢失数据无关的指令。换句话说,数据传输和指令执行可以同时执行。这样,更新缓存条目的开销(由于内存访问争用而增加)的影响是可以容忍的。评估结果表明,CODIE系统在36核M-Core架构上实现了1.86倍的顺序写/读程序执行速度,在6个spe的Cell/BE处理器上实现了1.97倍的黑洞(来自PARSEC基准套件)执行速度。
{"title":"CODIE: Continuation-Based Overlapping Data-Transfers with Instruction Execution","authors":"T. Miyoshi, Kenji Kise, H. Irie, T. Yoshinaga","doi":"10.1109/IC-NC.2010.26","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.26","url":null,"abstract":"In this paper, a runtime system termed CODIE is proposed to execute sequential part of programs efficiently in a many-core architecture. All independent processing elements in a many-core architecture use a shared network and off-chip memory. Therefore, contentions on such resources substantially degrade the system performance. On the CODIE system, when a cache miss occurs, the system first initiates a data transfer operation. Next, the system creates a continuation of executing instructions related to the missing data. The continuation is stored into the buffer, and the instructions not related to the missing data are executed subsequently. In other words, data transfer and instruction executions can be performed simultaneously. In this way, the effect of the overhead of the updating cache entry (increased by memory access contention) is tolerated. The results of evaluation show that the proposed CODIE system realizes a 1.86x speed up of the execution of the sequential write/read program on the M-Core architecture at 36 cores and a 1.97x speed up of the execution of the blacks holes(from PARSEC benchmark suite) on the Cell/BE processor with 6 SPEs.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122034265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Traffic Analysis Using Cardinalities and Header Information 使用基数和标题信息的流量分析
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.36
Y. Shomura, K. Yoshida, Akira Sato, Satoshi Matsumoto, K. Itano
Recently, the variety and vastness of computer networks have increased rapidly. To keep networks stable and reliable, network administrators have to understand the nature of network traffic flows. We have developed a cardinality-analysis method that analyzes cardinalities in TCP/IP headers. The cardinalities can be used to detect abnormal traffic such as DDoS attacks and Internet worms. However there is much unclassified traffic remaining. In this paper, we propose further analysis that consists of two parts: 1) select service port numbers and 2) analyze the volume of inflow and outflow for each service along with packet sizes. The method proposed can analyze the behavior of hosts and services in detail. We applied the proposed analysis to the traffic captured at the University of Tsukuba’s campus network and demonstrated the ability of classifying services into four groups: download type, upload type, both way type, and control or real time communication type, which normally can’t be classified by cardinality analysis.
最近,计算机网络的种类和规模迅速增加。为了保证网络的稳定可靠,网络管理员必须了解网络流量的性质。我们已经开发了一种基数分析方法来分析TCP/IP报头中的基数。基数可用于检测异常流量,如DDoS攻击和Internet蠕虫。然而,仍然有许多未分类的通信。在本文中,我们提出了进一步的分析,包括两个部分:1)选择服务端口号,2)分析每个服务的流入和流出量以及数据包大小。该方法可以对主机和服务的行为进行详细的分析。我们将提出的分析应用于筑波大学校园网捕获的流量,并展示了将服务分类为四组的能力:下载类型,上传类型,双向类型以及控制或实时通信类型,这些通常无法通过基数分析进行分类。
{"title":"A Traffic Analysis Using Cardinalities and Header Information","authors":"Y. Shomura, K. Yoshida, Akira Sato, Satoshi Matsumoto, K. Itano","doi":"10.1109/IC-NC.2010.36","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.36","url":null,"abstract":"Recently, the variety and vastness of computer networks have increased rapidly. To keep networks stable and reliable, network administrators have to understand the nature of network traffic flows. We have developed a cardinality-analysis method that analyzes cardinalities in TCP/IP headers. The cardinalities can be used to detect abnormal traffic such as DDoS attacks and Internet worms. However there is much unclassified traffic remaining. In this paper, we propose further analysis that consists of two parts: 1) select service port numbers and 2) analyze the volume of inflow and outflow for each service along with packet sizes. The method proposed can analyze the behavior of hosts and services in detail. We applied the proposed analysis to the traffic captured at the University of Tsukuba’s campus network and demonstrated the ability of classifying services into four groups: download type, upload type, both way type, and control or real time communication type, which normally can’t be classified by cardinality analysis.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121844093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
IEEE802.11b/g Standard: Theoretical Maximum Throughput IEEE802.11b/g标准:理论最大吞吐量
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.40
J. Bordim, A. V. Barbosa, Marcos F. Caetano, P. S. Barreto
Estimating the throughput of an WiFi connection can be quite complex, even when considering simplified scenarios. Indeed, the varying number of parameters specified in the standards makes it hard to understand their impact in terms of delay and throughput. The main contribution of this work is to present a simple scheme to compute the exact maximum throughput for an IEEE 802.11g network. The proposed scheme incorporates all the timings and settings which allows one to calculate the throughput for different channel spacing and modulation techniques specified in the standard. Numerical and experimental results showing the accuracy of the proposed scheme are also presented.
估计WiFi连接的吞吐量可能相当复杂,即使在考虑简化的场景时也是如此。实际上,标准中指定的参数数量的变化使得很难理解它们在延迟和吞吐量方面的影响。这项工作的主要贡献是提出了一个简单的方案来计算IEEE 802.11g网络的确切最大吞吐量。所提出的方案结合了所有的时序和设置,使人们能够计算标准中规定的不同信道间隔和调制技术的吞吐量。数值和实验结果表明了该方法的准确性。
{"title":"IEEE802.11b/g Standard: Theoretical Maximum Throughput","authors":"J. Bordim, A. V. Barbosa, Marcos F. Caetano, P. S. Barreto","doi":"10.1109/IC-NC.2010.40","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.40","url":null,"abstract":"Estimating the throughput of an WiFi connection can be quite complex, even when considering simplified scenarios. Indeed, the varying number of parameters specified in the standards makes it hard to understand their impact in terms of delay and throughput. The main contribution of this work is to present a simple scheme to compute the exact maximum throughput for an IEEE 802.11g network. The proposed scheme incorporates all the timings and settings which allows one to calculate the throughput for different channel spacing and modulation techniques specified in the standard. Numerical and experimental results showing the accuracy of the proposed scheme are also presented.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"274 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127549366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A New Wireless TCP Issue in Cognitive Radio Networks 认知无线网络中一个新的无线TCP问题
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.37
Yunlei Cheng, E. Wu, Gen-Huey Chen
In the recent years, researches focus on decay of TCP throughput over wireless links, and many wireless TCP solutions are proposed to deal with this issue. On the other hand, followed by the improvement of hardware technology, new network structures and mechanisms are proposed to enhance wireless communications, for example, the Cognitive Radio (CR) networks. However this new network architecture causes a new problem, which is not solved in wireless TCPs. In this paper, we identify a new issue that impacts the TCP performance over CR networks, which we call Bandwidth Variation. A cross layer solution is proposed to deal with this new issue. Both numerical and simulation results are resented to demonstrate the effectiveness of the proposed solution.
近年来,人们对无线链路上TCP吞吐量衰减问题进行了研究,提出了许多无线TCP解决方案来解决这一问题。另一方面,随着硬件技术的改进,提出了新的网络结构和机制来增强无线通信,如认知无线电(CR)网络。然而,这种新的网络结构带来了一个新的问题,这在无线tcp协议中没有得到解决。在本文中,我们发现了一个影响CR网络上TCP性能的新问题,我们称之为带宽变化。针对这一新问题,提出了一种跨层解决方案。数值和仿真结果验证了该方法的有效性。
{"title":"A New Wireless TCP Issue in Cognitive Radio Networks","authors":"Yunlei Cheng, E. Wu, Gen-Huey Chen","doi":"10.1109/IC-NC.2010.37","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.37","url":null,"abstract":"In the recent years, researches focus on decay of TCP throughput over wireless links, and many wireless TCP solutions are proposed to deal with this issue. On the other hand, followed by the improvement of hardware technology, new network structures and mechanisms are proposed to enhance wireless communications, for example, the Cognitive Radio (CR) networks. However this new network architecture causes a new problem, which is not solved in wireless TCPs. In this paper, we identify a new issue that impacts the TCP performance over CR networks, which we call Bandwidth Variation. A cross layer solution is proposed to deal with this new issue. Both numerical and simulation results are resented to demonstrate the effectiveness of the proposed solution.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128291062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems Using GPU 利用GPU加速非对称特征值问题的Hessenberg约简
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.52
Jun-ichi Muramatsu, Shaoliang Zhang, Yusaku Yamamoto
Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the Hessenberg reduction step and consider accelerating it using GPU. Our main strategy is to use the CUBLAS, an optimized BLAS library for GPU. However, since Hessenberg reduction requires operations not supported by CUBLAS, we combine CPU and GPU to perform the computation. We propose two approaches for combining CPU and GPU: the one that performs as much work as possible on GPU and the one that aggressively assigns computation of small-size matrices to CPU. Experimental results show that the latter approach is considerably faster than the former. Compared with the computation on the Core i7 processor with 4 cores, the latter approach with the Tesla C1060 GPU and the Core i7 processor achieves 2.8 times speedup when computing the Hessenberg form of a 4,800 $times$ 4,800 real matrix.
大规模密集非对称特征值问题的求解是汽车振动分析、电子衍射图分析等科学和工程计算领域的重要问题。在本研究中,我们主要关注海森伯格约简步骤,并考虑使用GPU加速它。我们的主要策略是使用CUBLAS,一个为GPU优化的BLAS库。然而,由于海森伯格约简需要CUBLAS不支持的操作,我们将CPU和GPU结合起来进行计算。我们提出了两种结合CPU和GPU的方法:一种是在GPU上执行尽可能多的工作,另一种是积极地将小尺寸矩阵的计算分配给CPU。实验结果表明,后一种方法比前一种方法要快得多。与在4核Core i7处理器上的计算相比,后一种方法在计算4800 $times$ 4800实矩阵的Hessenberg形式时,使用Tesla C1060 GPU和Core i7处理器的速度提高了2.8倍。
{"title":"Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems Using GPU","authors":"Jun-ichi Muramatsu, Shaoliang Zhang, Yusaku Yamamoto","doi":"10.1109/IC-NC.2010.52","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.52","url":null,"abstract":"Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the Hessenberg reduction step and consider accelerating it using GPU. Our main strategy is to use the CUBLAS, an optimized BLAS library for GPU. However, since Hessenberg reduction requires operations not supported by CUBLAS, we combine CPU and GPU to perform the computation. We propose two approaches for combining CPU and GPU: the one that performs as much work as possible on GPU and the one that aggressively assigns computation of small-size matrices to CPU. Experimental results show that the latter approach is considerably faster than the former. Compared with the computation on the Core i7 processor with 4 cores, the latter approach with the Tesla C1060 GPU and the Core i7 processor achieves 2.8 times speedup when computing the Hessenberg form of a 4,800 $times$ 4,800 real matrix.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"58 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132836842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Implementations of Parallel Computation of Euclidean Distance Map in Multicore Processors and GPUs 欧几里得距离图在多核处理器和gpu中的并行计算实现
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.55
Duhu Man, K. Uda, Hironobu Ueyama, Yasuaki Ito, K. Nakano
Given a 2-D binary image of size $n times n$, Euclidean Distance Map (EDM) is a 2-D array of the same size such that each element is storing the Euclidean distance to the nearest black pixel. It is known that a sequential algorithm can compute the EDM in $O(n^2)$ and thus this algorithm is optimal. Also, work-time optimal parallel algorithms for shared memory model have been presented. However, these algorithms are too complicated to implement in existing shared memory parallel machines. The main contribution of this paper is to develop a simple parallel algorithm for the EDM and implement it in two parallel platforms: multicore processors and a Graphics Processing Unit (GPU). More specifically, we have implemented our parallel algorithm in a Linux server with four Intel hexad-core processors (Intel Xeon X7460 2.66GHz). We have also implemented it in a modern GPU system, Tesla C1060, respectively. The experimental results have shown that, for an input binary image with size of $10000times 10000$, our implementation in the multi-core system achieves a speedup factor of 18 over the performance of a sequential algorithm using a single processor in the same system. Meanwhile, for the same input binary image, our implementation on the GPU achieves a speedup factor of 5 over the sequential algorithm implementation.
给定一个大小为$n 乘以n$的二维二值图像,欧几里得距离图(EDM)是一个相同大小的二维数组,使得每个元素都存储到最近的黑色像素的欧几里得距离。已知顺序算法可以在$O(n^2)$内计算EDM,因此该算法是最优的。在此基础上,提出了共享内存模型的工作时间优化并行算法。然而,这些算法过于复杂,无法在现有的共享内存并行机中实现。本文的主要贡献是为EDM开发了一个简单的并行算法,并在两个并行平台上实现:多核处理器和图形处理单元(GPU)。更具体地说,我们在具有四个Intel六核处理器(Intel Xeon X7460 2.66GHz)的Linux服务器上实现了并行算法。我们还分别在现代GPU系统Tesla C1060中实现了它。实验结果表明,对于大小为$10000 × 10000$的输入二进制图像,我们在多核系统中的实现比在同一系统中使用单处理器的顺序算法的性能提高了18倍。同时,对于相同的输入二值图像,我们在GPU上的实现实现了比顺序算法实现5倍的加速系数。
{"title":"Implementations of Parallel Computation of Euclidean Distance Map in Multicore Processors and GPUs","authors":"Duhu Man, K. Uda, Hironobu Ueyama, Yasuaki Ito, K. Nakano","doi":"10.1109/IC-NC.2010.55","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.55","url":null,"abstract":"Given a 2-D binary image of size $n times n$, Euclidean Distance Map (EDM) is a 2-D array of the same size such that each element is storing the Euclidean distance to the nearest black pixel. It is known that a sequential algorithm can compute the EDM in $O(n^2)$ and thus this algorithm is optimal. Also, work-time optimal parallel algorithms for shared memory model have been presented. However, these algorithms are too complicated to implement in existing shared memory parallel machines. The main contribution of this paper is to develop a simple parallel algorithm for the EDM and implement it in two parallel platforms: multicore processors and a Graphics Processing Unit (GPU). More specifically, we have implemented our parallel algorithm in a Linux server with four Intel hexad-core processors (Intel Xeon X7460 2.66GHz). We have also implemented it in a modern GPU system, Tesla C1060, respectively. The experimental results have shown that, for an input binary image with size of $10000times 10000$, our implementation in the multi-core system achieves a speedup factor of 18 over the performance of a sequential algorithm using a single processor in the same system. Meanwhile, for the same input binary image, our implementation on the GPU achieves a speedup factor of 5 over the sequential algorithm implementation.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"36 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124989381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Efficient Canny Edge Detection Using a GPU 使用GPU进行高效的边缘检测
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.13
Kohei Ogawa, Yasuaki Ito, K. Nakano
Recent GPUs, which have many processing units connected with a global memory, can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture). The main contribution of this paper is to implement a Canny edge detection algorithm on CUDA. The experimental result shows that our implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.
最近的gpu,它有许多处理单元连接到一个全局存储器,可以用于通用的并行计算。用户可以使用称为CUDA(计算统一设备架构)的编程架构在gpu上开发并行程序。本文的主要贡献是在CUDA上实现了一种Canny边缘检测算法。实验结果表明,我们在CUDA上实现的Canny边缘检测算法比传统软件实现的速度提高了61倍。
{"title":"Efficient Canny Edge Detection Using a GPU","authors":"Kohei Ogawa, Yasuaki Ito, K. Nakano","doi":"10.1109/IC-NC.2010.13","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.13","url":null,"abstract":"Recent GPUs, which have many processing units connected with a global memory, can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture). The main contribution of this paper is to implement a Canny edge detection algorithm on CUDA. The experimental result shows that our implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123882676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
期刊
2010 First International Conference on Networking and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1