首页 > 最新文献

2010 First International Conference on Networking and Computing最新文献

英文 中文
Screen-Space Ambient Occlusion through Summed-Area Tables 通过求和区域表的屏幕空间环境遮挡
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.18
M. Slomp, Toru Tamaki, K. Kaneda
There is an increasing demand for high quality real time graphics nowadays. Shadows play an important role to the realism of computer-generated images, enhancing depth, curvature and localization senses. Due to their global nature, shadows introduce overwhelming complexity to rendering algorithms. Recently, screen-space ambient occlusion techniques started to flourish, and are now the de facto standard for real-time dynamic shadow synthesis. A few issues remain, though, such as the sampling quality and noise artifacts. The contributions of this work are two-folded: a variation of screen-space ambient occlusion that uses Summed-Area Tables, yielding to satisfactory results yet performing better than previous attempts, and serves as a new application to the arsenal of Summed-Area Tables.
如今,人们对高质量实时图形的需求越来越大。阴影对计算机生成的图像的真实感起着重要的作用,增强了深度、曲率和定位感。由于它们的全局性,阴影给渲染算法带来了压倒性的复杂性。最近,屏幕空间环境遮挡技术开始蓬勃发展,现在是实时动态阴影合成的事实上的标准。但是,仍然存在一些问题,例如采样质量和噪声伪影。这项工作的贡献是双重的:一种使用求和面积表的屏幕空间环境遮挡的变化,产生了令人满意的结果,但比以前的尝试表现得更好,并作为求和面积表库的新应用。
{"title":"Screen-Space Ambient Occlusion through Summed-Area Tables","authors":"M. Slomp, Toru Tamaki, K. Kaneda","doi":"10.1109/IC-NC.2010.18","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.18","url":null,"abstract":"There is an increasing demand for high quality real time graphics nowadays. Shadows play an important role to the realism of computer-generated images, enhancing depth, curvature and localization senses. Due to their global nature, shadows introduce overwhelming complexity to rendering algorithms. Recently, screen-space ambient occlusion techniques started to flourish, and are now the de facto standard for real-time dynamic shadow synthesis. A few issues remain, though, such as the sampling quality and noise artifacts. The contributions of this work are two-folded: a variation of screen-space ambient occlusion that uses Summed-Area Tables, yielding to satisfactory results yet performing better than previous attempts, and serves as a new application to the arsenal of Summed-Area Tables.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124921439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Pattern-Based Systematic Task Mapping for Many-Core Processors 基于模式的多核处理器系统任务映射
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.33
Shintarou Sano, M. Sano, Shimpei Sato, T. Miyoshi, Kenji Kise
The Network-on-Chip (NoC) is a promising interconnection for many-core processors. On the NoC-based many core processors, the network performance of multi-thread programs depends on the method of task mapping. In this paper, we propose a pattern-based task mapping method in order to improve the performance of many-core processors. Evaluation of the proposed method using a detailed software simulator reveals an average performance improvement of at least 4.4%, as compared with standard task mapping using NAS parallel benchmarks.
片上网络(NoC)是一个很有前途的多核处理器互连技术。在基于cpu的多核处理器上,多线程程序的网络性能取决于任务映射的方法。为了提高多核处理器的性能,本文提出了一种基于模式的任务映射方法。与使用NAS并行基准的标准任务映射相比,使用详细的软件模拟器对所提出的方法进行的评估显示,平均性能至少提高了4.4%。
{"title":"Pattern-Based Systematic Task Mapping for Many-Core Processors","authors":"Shintarou Sano, M. Sano, Shimpei Sato, T. Miyoshi, Kenji Kise","doi":"10.1109/IC-NC.2010.33","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.33","url":null,"abstract":"The Network-on-Chip (NoC) is a promising interconnection for many-core processors. On the NoC-based many core processors, the network performance of multi-thread programs depends on the method of task mapping. In this paper, we propose a pattern-based task mapping method in order to improve the performance of many-core processors. Evaluation of the proposed method using a detailed software simulator reveals an average performance improvement of at least 4.4%, as compared with standard task mapping using NAS parallel benchmarks.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123959877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Node-to-Set Disjoint-Paths Routing in Recursive Dual-Net 递归双网中节点到集不相交路径路由
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.11
Yamin Li, S. Peng, Wanming Chu
Recursive dual-net (RDN) is a newly proposed interconnection network for massive parallel computers. The RDN is based on recursive dual-construction of a symmetric base-network. A {bm{${k}$}}-level dual-construction for {bm{${k>0}$}} creates a network containing {bm{${(2n_0)^{2^k}/2}$}} nodes with node-degree {bm{${d_0+k}$}}, where {bm{${n_0}$}} and {bm{${d_0}$}} are the number of nodes and the node-degree of the base network, respectively. The RDN is node and edge symmetric and can contain huge number of nodes with small node-degree and short diameter. Node-to-set disjoint-paths routing is fundamental and has many applications for fault-tolerant and secure communication in a network. In this paper, we propose an efficient algorithm for node-to-set disjoint-paths routing on RDN.
递归双网(RDN)是一种新提出的大规模并行计算机互连网络。RDN基于对称基网的递归双重构造。对{bm{${k>0}$}}进行{bm{${k}$}}级双结构,创建一个包含{bm{${(2n_0)^{2^k}/2}$}}节点的网络,节点度为{bm{${n_0}$}}和{bm{${d_0}$}},其中{bm{${n_0}$}和{bm{${d_0}$}分别表示基本网络的节点数和节点度。RDN是节点和边缘对称的,可以包含大量节点,节点度小,直径短。节点到集合的分离路径路由是基本的,在网络中有许多容错和安全通信的应用。在本文中,我们提出了一种有效的RDN上节点到集合不相交路径路由算法。
{"title":"Node-to-Set Disjoint-Paths Routing in Recursive Dual-Net","authors":"Yamin Li, S. Peng, Wanming Chu","doi":"10.1109/IC-NC.2010.11","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.11","url":null,"abstract":"Recursive dual-net (RDN) is a newly proposed interconnection network for massive parallel computers. The RDN is based on recursive dual-construction of a symmetric base-network. A {bm{${k}$}}-level dual-construction for {bm{${k>0}$}} creates a network containing {bm{${(2n_0)^{2^k}/2}$}} nodes with node-degree {bm{${d_0+k}$}}, where {bm{${n_0}$}} and {bm{${d_0}$}} are the number of nodes and the node-degree of the base network, respectively. The RDN is node and edge symmetric and can contain huge number of nodes with small node-degree and short diameter. Node-to-set disjoint-paths routing is fundamental and has many applications for fault-tolerant and secure communication in a network. In this paper, we propose an efficient algorithm for node-to-set disjoint-paths routing on RDN.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129285530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Parallel Matrix-Matrix Multiplication Based on HPL with a GPU-Accelerated PC Cluster 基于gpu加速PC集群的HPL并行矩阵-矩阵乘法
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.39
Qin Wang, Junichi Ohmura, Shan Axida, T. Miyoshi, H. Irie, T. Yoshinaga
In this paper, we propose an approach for significantly improving the performance of parallel matrix-matrix multiplication using a GPU-accelerated cluster. For one node, we implement a CPUs-GPU parallel double-precision general matrix-matrix multiplication (dgemm) operation and achieve a performance improvement of 32% as compared to the GPU-only case and 56% as compared to the CPUs-only case. For the entire cluster, we use the overlap GPU acceleration solution to high-performance Linpack (HPL), which eliminates the close dependency between the LU decomposition and the dgemm operation, and achieve a performance improvement of 5.72% as compared to the flat GPU acceleration case.
在本文中,我们提出了一种使用gpu加速集群显著提高并行矩阵-矩阵乘法性能的方法。对于一个节点,我们实现了cpu - gpu并行双精度一般矩阵-矩阵乘法(dgemm)操作,与仅gpu情况相比,性能提高了32%,与仅cpu情况相比,性能提高了56%。对于整个集群,我们将重叠GPU加速解决方案用于高性能Linpack (HPL),该解决方案消除了LU分解与dgemm运算之间的密切依赖关系,与平面GPU加速情况相比,性能提高了5.72%。
{"title":"Parallel Matrix-Matrix Multiplication Based on HPL with a GPU-Accelerated PC Cluster","authors":"Qin Wang, Junichi Ohmura, Shan Axida, T. Miyoshi, H. Irie, T. Yoshinaga","doi":"10.1109/IC-NC.2010.39","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.39","url":null,"abstract":"In this paper, we propose an approach for significantly improving the performance of parallel matrix-matrix multiplication using a GPU-accelerated cluster. For one node, we implement a CPUs-GPU parallel double-precision general matrix-matrix multiplication (dgemm) operation and achieve a performance improvement of 32% as compared to the GPU-only case and 56% as compared to the CPUs-only case. For the entire cluster, we use the overlap GPU acceleration solution to high-performance Linpack (HPL), which eliminates the close dependency between the LU decomposition and the dgemm operation, and achieve a performance improvement of 5.72% as compared to the flat GPU acceleration case.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Power Saving in Mobile Devices Using Context-Aware Resource Control 使用上下文感知资源控制的移动设备节能
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.50
Kosuke Nishihara, K. Ishizaka, J. Sakai
We present an effective power reduction scheme for recent mobile devices, e.g., Android devices, which tend to have problems with battery life because some of their applications may be running continuous sensor operations. We propose a context-aware method to determine the minimum set of resources (processor cores and peripherals) that results in meeting a given level of performance. With it, unnecessary processor cores and peripherals can be switched-off without degrading overall performance. Our experimental results indicate that its use can result in a 45% reduction in total power consumption. Since our method does not require applications to be modified, it can even be used easily with downloaded applications.
我们提出了一种有效的节能方案,用于最近的移动设备,例如Android设备,这些设备往往有电池寿命问题,因为它们的一些应用程序可能会运行连续的传感器操作。我们提出了一种上下文感知的方法来确定满足给定性能水平的最小资源集(处理器核心和外围设备)。有了它,可以关闭不必要的处理器核心和外围设备,而不会降低整体性能。我们的实验结果表明,它的使用可以导致总功耗降低45%。由于我们的方法不需要修改应用程序,它甚至可以很容易地与下载的应用程序一起使用。
{"title":"Power Saving in Mobile Devices Using Context-Aware Resource Control","authors":"Kosuke Nishihara, K. Ishizaka, J. Sakai","doi":"10.1109/IC-NC.2010.50","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.50","url":null,"abstract":"We present an effective power reduction scheme for recent mobile devices, e.g., Android devices, which tend to have problems with battery life because some of their applications may be running continuous sensor operations. We propose a context-aware method to determine the minimum set of resources (processor cores and peripherals) that results in meeting a given level of performance. With it, unnecessary processor cores and peripherals can be switched-off without degrading overall performance. Our experimental results indicate that its use can result in a 45% reduction in total power consumption. Since our method does not require applications to be modified, it can even be used easily with downloaded applications.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114883095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Compressing Floating-Point Number Stream for Numerical Applications 用于数值应用的浮点数流压缩
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.24
Hisanobu Tomari, M. Inaba, K. Hiraki
A cluster of commodity computers and general-purpose computers with accelerators such as GPGPUs are now common platforms to solve computationally intensive tasks like scientific simulations. Both technologies provide users with high performance at relatively low cost. However, the low bandwidth of interconnect compared to the computing performance hinders efficient operation of both cluster and accelerator in the case of many algorithms that require heavy data transmission. For clusters the network is one of the major performance bottlenecks, and for accelerators the peripheral bus to transfer data from host to the memory on the accelerator card is. In this paper, we propose a method of accelerating the performance of floating-point intensive algorithms by compressing the floating point number stream. With the efficient software encoder and hardware decoder, the method eliminates redundancy in the exponential part in the array of numbers on the stream and compacts the entire array to 82.8% of its original size at theoretical limit. The compression ratio is better than Gzip or Bzip2 for floating point numbers. The reduction in communication time directly leads to the reduction in total application running time for programs whose processing time is largely dominated by communication performance. We implemented a high-speed decoder using FPGA that operates at over 6 GB/s. We estimated the application performance using FFT and matrix multiplication on a cluster and the GRAPE-DR accelerator respectively, and our approach is useful in both configurations.
一组商用计算机和带有诸如gpgpu等加速器的通用计算机现在是解决科学模拟等计算密集型任务的通用平台。这两种技术都以相对较低的成本为用户提供了高性能。然而,在许多需要大量数据传输的算法中,与计算性能相比,互连的低带宽阻碍了集群和加速器的有效运行。对于集群来说,网络是主要的性能瓶颈之一,对于加速器来说,将数据从主机传输到加速器卡上的内存的外围总线是。本文提出了一种通过压缩浮点数流来提高浮点密集型算法性能的方法。该方法利用高效的软件编码器和硬件解码器,消除了流上数字数组中指数部分的冗余,在理论极限下将整个数组压缩到原始大小的82.8%。对于浮点数,压缩比优于Gzip或Bzip2。通信时间的减少直接导致程序的总应用程序运行时间的减少,这些程序的处理时间在很大程度上取决于通信性能。我们使用FPGA实现了一个运行速度超过6 GB/s的高速解码器。我们分别在集群和GRAPE-DR加速器上使用FFT和矩阵乘法来估计应用程序性能,我们的方法在这两种配置中都很有用。
{"title":"Compressing Floating-Point Number Stream for Numerical Applications","authors":"Hisanobu Tomari, M. Inaba, K. Hiraki","doi":"10.1109/IC-NC.2010.24","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.24","url":null,"abstract":"A cluster of commodity computers and general-purpose computers with accelerators such as GPGPUs are now common platforms to solve computationally intensive tasks like scientific simulations. Both technologies provide users with high performance at relatively low cost. However, the low bandwidth of interconnect compared to the computing performance hinders efficient operation of both cluster and accelerator in the case of many algorithms that require heavy data transmission. For clusters the network is one of the major performance bottlenecks, and for accelerators the peripheral bus to transfer data from host to the memory on the accelerator card is. In this paper, we propose a method of accelerating the performance of floating-point intensive algorithms by compressing the floating point number stream. With the efficient software encoder and hardware decoder, the method eliminates redundancy in the exponential part in the array of numbers on the stream and compacts the entire array to 82.8% of its original size at theoretical limit. The compression ratio is better than Gzip or Bzip2 for floating point numbers. The reduction in communication time directly leads to the reduction in total application running time for programs whose processing time is largely dominated by communication performance. We implemented a high-speed decoder using FPGA that operates at over 6 GB/s. We estimated the application performance using FFT and matrix multiplication on a cluster and the GRAPE-DR accelerator respectively, and our approach is useful in both configurations.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125729705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Open Web: Seamless Proxy Interconnection at the Switching Layer 开放Web:交换层的无缝代理互连
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.19
Yoshio Sakurauchi, R. McGeer, H. Takada
The Internet was designed around the end-to-end principle, mimicking in many ways the architecture of the old telephone network: services were accessed by naming the specific end-host offering the service. The demands of robustness, performance, and ubiquitous low latency for a worldwide population have led to an architecture where the names of services are largely symbolic, and do not name specific hosts or locations. Traffic is redirected onto a service network through the use of proxies. A typical example is a web proxy. Currently, proxies are generally accessed through layer 4-7 scripts and commands, such as the route command on Posix systems and, usually, manual configuration or Javascript code for a web proxy. This process is tedious and error-prone, and far from robust. New open protocols at the switching layer (layer 2) now enable far more robust and seamless packet redirection, without need for user configuration or unreliable scripts. In this paper, we describe Open web, a layer-2 redirection engine implemented as an application of the Open flow switch architecture.
互联网是围绕端到端原则设计的,在许多方面模仿了旧电话网的体系结构:通过命名提供服务的特定终端主机来访问服务。对全球人口的健壮性、性能和无处不在的低延迟的需求导致了一种体系结构,其中服务的名称在很大程度上是象征性的,并且不命名特定的主机或位置。通过使用代理将流量重定向到服务网络。一个典型的例子是web代理。目前,代理通常通过4-7层脚本和命令访问,例如Posix系统上的route命令,通常通过手动配置或Javascript代码访问web代理。这个过程冗长且容易出错,而且远非健壮。交换层(第2层)的新开放协议现在支持更加健壮和无缝的数据包重定向,而不需要用户配置或不可靠的脚本。在本文中,我们描述了Open web,一个二层重定向引擎,作为Open flow交换机架构的一个应用实现。
{"title":"Open Web: Seamless Proxy Interconnection at the Switching Layer","authors":"Yoshio Sakurauchi, R. McGeer, H. Takada","doi":"10.1109/IC-NC.2010.19","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.19","url":null,"abstract":"The Internet was designed around the end-to-end principle, mimicking in many ways the architecture of the old telephone network: services were accessed by naming the specific end-host offering the service. The demands of robustness, performance, and ubiquitous low latency for a worldwide population have led to an architecture where the names of services are largely symbolic, and do not name specific hosts or locations. Traffic is redirected onto a service network through the use of proxies. A typical example is a web proxy. Currently, proxies are generally accessed through layer 4-7 scripts and commands, such as the route command on Posix systems and, usually, manual configuration or Javascript code for a web proxy. This process is tedious and error-prone, and far from robust. New open protocols at the switching layer (layer 2) now enable far more robust and seamless packet redirection, without need for user configuration or unreliable scripts. In this paper, we describe Open web, a layer-2 redirection engine implemented as an application of the Open flow switch architecture.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"15 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128866854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Improving Hybrid OpenCL Performance by High Speed Networks 高速网络提高混合OpenCL性能
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.42
Ryo Aoki, S. Oikawa, Ryoji Tsuchiyama, Takashi Nakamura
We developed Hybrid OpenCL, which enables the connection between different OpenCL implementations over the network. Hybrid OpenCL consists of two elements, a runtime system that provides the abstraction of different OpenCL implementations and a bridge program that connects multiple OpenCL runtime systems over the network. Problems in OpenCL are not being able to use different OpenCL devices from a single OpenCL runtime and being limited the number of OpenCL devices that we can use to the number of internal bus slots. Hybrid OpenCL enables the construction of the scalable OpenCL environments. It enables applications written in OpenCL to be easily ported to high performance cluster computers, thus, Hybrid OpenCL can provide more various parallel computing platforms and the progress of utility value of OpenCL applications. This paper describes the improvement of Hybrid OpenCL by using high speed networks and its results from experimentation. The experimental results show that high speed networks reduce the overhead introduced by Hybrid OpenCL, and InfiniBand SDP shows the best performance among the results.
我们开发了混合OpenCL,它允许在网络上不同的OpenCL实现之间进行连接。混合OpenCL由两个元素组成,一个是提供不同OpenCL实现抽象的运行时系统,另一个是通过网络连接多个OpenCL运行时系统的桥接程序。OpenCL的问题是不能从一个OpenCL运行时使用不同的OpenCL设备,并且我们可以使用的OpenCL设备的数量被限制为内部总线插槽的数量。混合OpenCL使得构建可扩展的OpenCL环境成为可能。它使用OpenCL编写的应用程序可以很容易地移植到高性能的集群计算机上,从而提供了更多的并行计算平台,提高了OpenCL应用程序的实用价值。本文介绍了高速网络对混合OpenCL的改进及其实验结果。实验结果表明,高速网络减少了混合OpenCL带来的开销,其中InfiniBand SDP表现出最好的性能。
{"title":"Improving Hybrid OpenCL Performance by High Speed Networks","authors":"Ryo Aoki, S. Oikawa, Ryoji Tsuchiyama, Takashi Nakamura","doi":"10.1109/IC-NC.2010.42","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.42","url":null,"abstract":"We developed Hybrid OpenCL, which enables the connection between different OpenCL implementations over the network. Hybrid OpenCL consists of two elements, a runtime system that provides the abstraction of different OpenCL implementations and a bridge program that connects multiple OpenCL runtime systems over the network. Problems in OpenCL are not being able to use different OpenCL devices from a single OpenCL runtime and being limited the number of OpenCL devices that we can use to the number of internal bus slots. Hybrid OpenCL enables the construction of the scalable OpenCL environments. It enables applications written in OpenCL to be easily ported to high performance cluster computers, thus, Hybrid OpenCL can provide more various parallel computing platforms and the progress of utility value of OpenCL applications. This paper describes the improvement of Hybrid OpenCL by using high speed networks and its results from experimentation. The experimental results show that high speed networks reduce the overhead introduced by Hybrid OpenCL, and InfiniBand SDP shows the best performance among the results.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133844517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Maximizing Image Utilization in Photomosaics 最大限度地提高图像利用率的photomosics
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.17
M. Mikamo, M. Slomp, Shun Yanase, B. Raytchev, Toru Tamaki, K. Kaneda
Non-photo realistic rendering (NPR) is an appealing subject in computer graphics with a wide array of applications. As opposed to photo realistic rendering, NPR focuses on highlighting features and artistic traits instead of physical accuracy. Photo mosaic generation is one of the most popular NPR techniques, where a single image is assembled from several smaller ones. Visual responses change depending on the proximity to the photo mosaic, leading to many creative prospects for publicity. Synthesizing photo mosaics typically requires very large image databases in order to produce pleasing results. Moreover, repetitions are allowed to occur which may locally bias the mosaic. This paper provides alternatives to prevent repetitions while still being robust enough to work with coarse image subsets. Three approaches were devised for the matching stage of photo mosaics: a greedy-based procedural algorithm, simulated annealing and Soft Assign. We found that the latter two approaches deliver adequate arrangements in cases where only a restricted number of images is available.
非照片真实感渲染(NPR)是计算机图形学中一个具有广泛应用的热门课题。与照片逼真的渲染相反,NPR侧重于突出特征和艺术特征,而不是物理准确性。照片马赛克生成是最流行的NPR技术之一,其中一张图像由几个较小的图像组装而成。视觉反应的变化取决于接近照片马赛克,导致许多创造性的宣传前景。为了产生令人满意的结果,合成照片马赛克通常需要非常大的图像数据库。此外,允许出现可能局部偏置马赛克的重复。本文提供了防止重复的替代方法,同时仍然具有足够的鲁棒性来处理粗糙的图像子集。针对照片拼接的匹配阶段,设计了三种方法:基于贪婪的过程算法、模拟退火算法和软分配算法。我们发现后两种方法在只有有限数量的图像可用的情况下提供适当的安排。
{"title":"Maximizing Image Utilization in Photomosaics","authors":"M. Mikamo, M. Slomp, Shun Yanase, B. Raytchev, Toru Tamaki, K. Kaneda","doi":"10.1109/IC-NC.2010.17","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.17","url":null,"abstract":"Non-photo realistic rendering (NPR) is an appealing subject in computer graphics with a wide array of applications. As opposed to photo realistic rendering, NPR focuses on highlighting features and artistic traits instead of physical accuracy. Photo mosaic generation is one of the most popular NPR techniques, where a single image is assembled from several smaller ones. Visual responses change depending on the proximity to the photo mosaic, leading to many creative prospects for publicity. Synthesizing photo mosaics typically requires very large image databases in order to produce pleasing results. Moreover, repetitions are allowed to occur which may locally bias the mosaic. This paper provides alternatives to prevent repetitions while still being robust enough to work with coarse image subsets. Three approaches were devised for the matching stage of photo mosaics: a greedy-based procedural algorithm, simulated annealing and Soft Assign. We found that the latter two approaches deliver adequate arrangements in cases where only a restricted number of images is available.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132254147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An RSA Encryption Hardware Algorithm Using a Single DSP Block and a Single Block RAM on the FPGA 基于FPGA单块DSP和单块RAM的RSA加密硬件算法
Pub Date : 2010-11-17 DOI: 10.1109/IC-NC.2010.56
Bo Song, K. Kawakami, K. Nakano, Yasuaki Ito
The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048-bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex-6 family FPGA. The implementation results showed that our RSA module for 2048-bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.
本文的主要贡献是提出了一种基于Montgomery乘法的RSA加密/解密的高效硬件算法。现代fpga具有许多嵌入式DSP块(DSP48E1)和嵌入式内存块(BRAM)。我们的硬件算法支持2048位RSA加密/解密,设计为使用Xilinx Virtex-6系列FPGA中的一个DSP48E1,一个BRAM和几个逻辑块(切片)来实现。实现结果表明,我们的RSA模块对2048位的RSA加解密的运行时间为277.26ms。令人惊讶的是,DSP48E1中用于计算蒙哥马利乘法的乘法器在所有时钟周期中工作在97%以上的时钟周期中。因此,我们的实现接近最优,因为它在乘法上的开销只有不到3%,而且只要使用基于Montgomery乘法的算法,就不可能有进一步的改进。此外,由于我们的电路仅使用一个DSP48E1块和一个块RAM,我们可以在FPGA中实现多个RSA模块,这些模块可以并行工作以实现高吞吐量RSA加密/解密。
{"title":"An RSA Encryption Hardware Algorithm Using a Single DSP Block and a Single Block RAM on the FPGA","authors":"Bo Song, K. Kawakami, K. Nakano, Yasuaki Ito","doi":"10.1109/IC-NC.2010.56","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.56","url":null,"abstract":"The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048-bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex-6 family FPGA. The implementation results showed that our RSA module for 2048-bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130139740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
期刊
2010 First International Conference on Networking and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1