首页 > 最新文献

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)最新文献

英文 中文
Using surrogate-based modeling to predict optimal I/O parameters of applications at the extreme scale 使用基于代理的建模来预测极端规模下应用程序的最佳I/O参数
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097855
Michael Matheny, Stephen Herbein, N. Podhorszki, S. Klasky, M. Taufer
On petascale systems, the selection of optimal values for I/O parameters without taking into account the I/O size and pattern can cause the I/O time to dominate the simulation time, compromising the application's scalability. In this paper, we adopt and adapt an engineering method called surrogate-based modeling to efficiently search for the optimal I/O parameter values and accurately predict the associated I/O times at the extreme scale. Our approach allows us to address both the search and prediction in a short time, even when the application's I/O is large and exhibits irregular patterns.
在千万亿级系统上,选择I/O参数的最优值而不考虑I/O大小和模式可能会导致I/O时间支配模拟时间,从而损害应用程序的可伸缩性。在本文中,我们采用并调整了一种称为基于代理的建模的工程方法,以有效地搜索最优I/O参数值,并准确预测极端规模下的相关I/O次数。我们的方法允许我们在很短的时间内解决搜索和预测问题,即使应用程序的I/O很大并且呈现不规则的模式。
{"title":"Using surrogate-based modeling to predict optimal I/O parameters of applications at the extreme scale","authors":"Michael Matheny, Stephen Herbein, N. Podhorszki, S. Klasky, M. Taufer","doi":"10.1109/PADSW.2014.7097855","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097855","url":null,"abstract":"On petascale systems, the selection of optimal values for I/O parameters without taking into account the I/O size and pattern can cause the I/O time to dominate the simulation time, compromising the application's scalability. In this paper, we adopt and adapt an engineering method called surrogate-based modeling to efficiently search for the optimal I/O parameter values and accurately predict the associated I/O times at the extreme scale. Our approach allows us to address both the search and prediction in a short time, even when the application's I/O is large and exhibits irregular patterns.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Achieving cost effective cloud video services via fine grained multicore scheduling 通过细粒度多核调度实现低成本的云视频服务
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097843
Hao-Che Kao, Hao-Ping Kang, Che-Rung Lee, Kun-Hsien Lu, Shu-Hsin Chang
Cloud computing that possesses highly accessible and elastic computing resources perfectly matches the demands of video services, which employ massive storage and intensive computational power to store, transmit, compress, enhance, and analyze the videos, uploaded from commodity devices and surveillance cameras. However, most existing video processing programs are neither designed to run on parallel environments nor able to efficiently utilize the computational power of cloud platforms, which not only wastes the computing resources but also increases the cost of using cloud platforms. In this paper, we present three strategies to enhance the multicore utilization for video processing, namely producer-consumer model, intra-process overlapping, and inter-process overlapping. We experimented our strategies on a video enhancement program, which performs decoding, dehazing, and encoding, and the results showed the CPU utilization can be improved up to 31% for an 8 core instance, which can significantly reduce the cost in a long run.
云计算具有高度可访问性和弹性的计算资源,完全符合视频业务的需求。视频业务利用海量存储和密集的计算能力,对从商品设备和监控摄像头上传的视频进行存储、传输、压缩、增强和分析。然而,现有的大多数视频处理程序既没有设计成并行运行的环境,也不能有效地利用云平台的计算能力,这不仅浪费了计算资源,而且增加了使用云平台的成本。本文提出了三种提高视频处理多核利用率的策略,即生产者-消费者模型、进程内重叠和进程间重叠。我们在一个视频增强程序上实验了我们的策略,该程序执行解码、去雾和编码,结果表明,对于8核实例,CPU利用率可以提高31%,从长远来看,这可以显着降低成本。
{"title":"Achieving cost effective cloud video services via fine grained multicore scheduling","authors":"Hao-Che Kao, Hao-Ping Kang, Che-Rung Lee, Kun-Hsien Lu, Shu-Hsin Chang","doi":"10.1109/PADSW.2014.7097843","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097843","url":null,"abstract":"Cloud computing that possesses highly accessible and elastic computing resources perfectly matches the demands of video services, which employ massive storage and intensive computational power to store, transmit, compress, enhance, and analyze the videos, uploaded from commodity devices and surveillance cameras. However, most existing video processing programs are neither designed to run on parallel environments nor able to efficiently utilize the computational power of cloud platforms, which not only wastes the computing resources but also increases the cost of using cloud platforms. In this paper, we present three strategies to enhance the multicore utilization for video processing, namely producer-consumer model, intra-process overlapping, and inter-process overlapping. We experimented our strategies on a video enhancement program, which performs decoding, dehazing, and encoding, and the results showed the CPU utilization can be improved up to 31% for an 8 core instance, which can significantly reduce the cost in a long run.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129486029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combine thread with memory scheduling for maximizing performance in multi-core systems 将线程与内存调度相结合,以在多核系统中最大化性能
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097821
Gangyong Jia, Guangjie Han, Liang Shi, Jian Wan, Dong Dai
The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM's speed and throughput. Moreover, on multi-core platforms, DRAM memory shared by all cores usually suffers from the memory contention and interference problem, which can cause serious performance degradation and unfairness among parallel running threads. To address these problems, this paper proposes techniques to take both advantages of partitioning cores, threads and memory banks into groups to reduce interference among different groups and grouping the memory accesses of the same row together to reduce cache miss rate. A memory optimization framework combined thread scheduling with memory scheduling (CTMS) is proposed in this paper, which simultaneously minimizes memory access schedule length, memory access time and reduce interference to maximize performance for multi-core systems. Experimental results show CTMS is 12.6% shorter in memory access time, while improving 11.8% throughput on average. Moreover, CTMS also saves 5.8% of the energy consumption.
微处理器速度和DRAM速度之间越来越大的差距是计算机设计者面临的一个主要问题。为了缩小差距,有必要提高DRAM的速度和吞吐量。此外,在多核平台上,所有内核共享的DRAM内存通常存在内存争用和干扰问题,这可能导致严重的性能下降和并行运行线程之间的不公平。为了解决这些问题,本文提出了将内核、线程和内存库分组以减少组间的干扰和将同一行的内存访问分组在一起以减少缓存丢失率的技术。提出了一种线程调度与内存调度相结合的内存优化框架(CTMS),该框架在最小化内存访问调度长度、最小化内存访问时间和减少干扰的同时,实现了多核系统性能的最大化。实验结果表明,CTMS的内存访问时间缩短了12.6%,吞吐量平均提高了11.8%。此外,CTMS还节省了5.8%的能耗。
{"title":"Combine thread with memory scheduling for maximizing performance in multi-core systems","authors":"Gangyong Jia, Guangjie Han, Liang Shi, Jian Wan, Dong Dai","doi":"10.1109/PADSW.2014.7097821","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097821","url":null,"abstract":"The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM's speed and throughput. Moreover, on multi-core platforms, DRAM memory shared by all cores usually suffers from the memory contention and interference problem, which can cause serious performance degradation and unfairness among parallel running threads. To address these problems, this paper proposes techniques to take both advantages of partitioning cores, threads and memory banks into groups to reduce interference among different groups and grouping the memory accesses of the same row together to reduce cache miss rate. A memory optimization framework combined thread scheduling with memory scheduling (CTMS) is proposed in this paper, which simultaneously minimizes memory access schedule length, memory access time and reduce interference to maximize performance for multi-core systems. Experimental results show CTMS is 12.6% shorter in memory access time, while improving 11.8% throughput on average. Moreover, CTMS also saves 5.8% of the energy consumption.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128050976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Building a large-scale direct network with low-radix routers 使用低基数路由器构建大规模直连网络
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097830
Yong Su, Zheng Cao, Zhiguo Fan, Zhan Wang, Xiaoli Liu, Xiaobing Liu, Li Qiang, Xuejun An, Ninghui Sun
Communication locality is an important characteristic of parallel applications. A great deal of research shows that utilizing the characteristic will favor most applications. Aiming at communication locality, we present a hierarchical direct network topology to accelerate neighbor communication. Combining mesh topology and complete graph topology, it can be used to optimize local communication and build large-scale network with low radix routers. Analyzing the characteristic of hierarchical topology, we find the presented topology has high cost performance and excellent expandability. We also design two minimum path routing algorithms and compare them with Mesh, Dragonfly and PERCS topologies. The results show the saturated throughput of hierarchical topology is nearly 40% with uniform random trace and 70% with local communication model of 4K nodes. That indicates high scalability for applications with local communication and cost efficiency for uniform random trace.
通信局部性是并行应用的一个重要特征。大量研究表明,利用这一特性将有利于大多数应用。针对通信局部性,提出了一种分层直连网络拓扑结构,以加快邻居通信。将网状拓扑结构与完全图拓扑结构相结合,可用于优化局部通信,构建低基数路由器的大规模网络。分析了层次化拓扑的特点,发现该拓扑具有较高的性价比和良好的可扩展性。我们还设计了两种最小路径路由算法,并与Mesh、Dragonfly和PERCS拓扑进行了比较。结果表明,采用均匀随机路径时,分层拓扑的饱和吞吐量接近40%,采用4K节点的局部通信模型时,饱和吞吐量接近70%。这表明具有本地通信的应用程序具有很高的可扩展性和统一随机跟踪的成本效率。
{"title":"Building a large-scale direct network with low-radix routers","authors":"Yong Su, Zheng Cao, Zhiguo Fan, Zhan Wang, Xiaoli Liu, Xiaobing Liu, Li Qiang, Xuejun An, Ninghui Sun","doi":"10.1109/PADSW.2014.7097830","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097830","url":null,"abstract":"Communication locality is an important characteristic of parallel applications. A great deal of research shows that utilizing the characteristic will favor most applications. Aiming at communication locality, we present a hierarchical direct network topology to accelerate neighbor communication. Combining mesh topology and complete graph topology, it can be used to optimize local communication and build large-scale network with low radix routers. Analyzing the characteristic of hierarchical topology, we find the presented topology has high cost performance and excellent expandability. We also design two minimum path routing algorithms and compare them with Mesh, Dragonfly and PERCS topologies. The results show the saturated throughput of hierarchical topology is nearly 40% with uniform random trace and 70% with local communication model of 4K nodes. That indicates high scalability for applications with local communication and cost efficiency for uniform random trace.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132357679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Heterogeneous CPU-GPU computing for the finite volume method on 3D unstructured meshes 三维非结构化网格有限体积法的异构CPU-GPU计算
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097808
J. Langguth, Xing Cai
A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform traditional CPUs by far and thus relegate them to pure control functions instead of computations. For irregular applications however, the gap in relative performance can be much smaller, and sometimes even reversed. Thus, maximizing overall performance in such systems requires that full use of all available computational resources is made. In this paper we study the attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs and multiple GPUs. Finite volume methods are widely used numerical strategies for solving partial differential equations. The advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. Our focus lies in demonstrating how a workload distribution that maximizes overall performance can be derived from the actual performance attained by the different computing devices in the heterogeneous environment. We also highlight the dual role of partitioning software in reordering and partitioning the input mesh, thus giving rise to a new combined approach to partitioning.
在现代高性能计算环境中,最近的一个趋势是引入GPU和Xeon Phi等加速器,即针对高度并行应用程序进行优化并与cpu共存的专用计算设备。在具有可预测数据访问模式的常规计算密集型应用程序中,这些设备的性能通常远远超过传统cpu,因此将它们降级为纯控制功能而不是计算。然而,对于不规则的应用程序,相对性能的差距可能要小得多,有时甚至是相反的。因此,在这样的系统中最大化整体性能要求充分利用所有可用的计算资源。本文研究了在由cpu和多个gpu组成的异构系统中,以细胞为中心的有限体积法在三维非结构化四面体网格上可实现的性能。有限体积法是求解偏微分方程的一种广泛应用的数值方法。使用有限体积的优点包括对守恒定律的内置支持和对非结构化网格的适用性。我们的重点在于演示如何从异构环境中不同计算设备所获得的实际性能中推导出使总体性能最大化的工作负载分布。我们还强调了分区软件在输入网格的重新排序和分区中的双重作用,从而产生了一种新的组合分区方法。
{"title":"Heterogeneous CPU-GPU computing for the finite volume method on 3D unstructured meshes","authors":"J. Langguth, Xing Cai","doi":"10.1109/PADSW.2014.7097808","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097808","url":null,"abstract":"A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform traditional CPUs by far and thus relegate them to pure control functions instead of computations. For irregular applications however, the gap in relative performance can be much smaller, and sometimes even reversed. Thus, maximizing overall performance in such systems requires that full use of all available computational resources is made. In this paper we study the attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs and multiple GPUs. Finite volume methods are widely used numerical strategies for solving partial differential equations. The advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. Our focus lies in demonstrating how a workload distribution that maximizes overall performance can be derived from the actual performance attained by the different computing devices in the heterogeneous environment. We also highlight the dual role of partitioning software in reordering and partitioning the input mesh, thus giving rise to a new combined approach to partitioning.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124347832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Performance analysis of HPC applications with irregular tree data structures 不规则树状数据结构的高性能计算应用性能分析
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097837
A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros
Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.
利用八叉树数据结构的自适应网格细化(AMR)数值方法是高性能计算的重要应用,尤其是偏微分方程的求解。在实现这些类型的程序的高效版本上投入了大量的努力,其中的重点通常是在利用gpu和协处理器时提高多节点性能。相比之下,我们的分析旨在描述传统cpu上的这些工作负载,因为我们认为关键内核的单线程节点内性能仍然是实现大规模性能的关键因素。但是,特别是不规则的工作负载,如AMR方法,在通用处理器上表现出严重的利用率不足。在本文中,我们分析了两种最先进的,高度可扩展的自适应网格细化代码,一种基于快速多极方法(FMM),另一种基于有限元方法(FEM),在x86 CPU上运行时的单核性能。我们研究了标量实现和矢量化实现,以确定性能瓶颈。我们证明了向量化可以在实现高性能方面提供显著的好处。性能峰值的最大瓶颈是内核中非浮点指令的高比例。
{"title":"Performance analysis of HPC applications with irregular tree data structures","authors":"A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros","doi":"10.1109/PADSW.2014.7097837","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097837","url":null,"abstract":"Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114345779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ArPat: Accurate RFID reader positioning with mere boundary tags ArPat:仅用边界标签就能精确定位RFID阅读器
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097901
Guanglian Liu, Shigeng Zhang, Jianxin Wang, Xuan Liu
The Radio Frequency IDentification (RFID) technology provides a promising solution to location discovery in indoor environments. Existing RFID reader positioning algorithms usually use all the collected reference tags to determine the position of the target reader, and thus are time-consuming as well as susceptible to the communication irregularity between the reader and reference tags. Especially, they usually perform poorly when the target reader is near the wall or at the corner. In this paper, we propose ArPat, an Accurate RFID reader Positioning algorithm that uses mere boundary reference Tags to calculate the position of the reader. ArPat uses only boundary tags to determine the position of the target reader, which effectively mitigates the negative impact of communication irregularity on the localization accuracy. The localization accuracy of ArPat is higher than 0.2 ft when the space between references tags is 1 ft. Compared with state-of-the-art solutions for RFID reader positioning, ArPat improves localization accuracy by up to 42 percent and 36 percent on average. Furthermore, it uses a geometric approach rather than iterative optimization approaches employed by previous solutions, making it superior in time efficiency. Compared with previous solutions, the computational time of ArPat is nearly two orders of magnitude less. This is critical for a localization system to provide real time location discovery and tracking services.
射频识别(RFID)技术为室内环境中的位置发现提供了一个很有前途的解决方案。现有的RFID读写器定位算法通常使用收集到的所有参考标签来确定目标读写器的位置,这不仅耗时,而且容易受到读写器与参考标签之间通信不正常的影响。特别是当目标读者靠近墙壁或在角落时,它们通常表现不佳。在本文中,我们提出了一种精确的RFID阅读器定位算法ArPat,该算法仅使用边界参考标签来计算阅读器的位置。ArPat仅使用边界标签来确定目标阅读器的位置,有效减轻了通信不规范对定位精度的负面影响。当参考标签之间的间距为1英尺时,ArPat的定位精度高于0.2英尺。与最先进的RFID读取器定位解决方案相比,ArPat的定位精度平均提高了42%和36%。此外,它采用了几何方法,而不是以往的迭代优化方法,使其具有更好的时间效率。与以往的解决方案相比,ArPat的计算时间缩短了近两个数量级。这对于提供实时位置发现和跟踪服务的定位系统至关重要。
{"title":"ArPat: Accurate RFID reader positioning with mere boundary tags","authors":"Guanglian Liu, Shigeng Zhang, Jianxin Wang, Xuan Liu","doi":"10.1109/PADSW.2014.7097901","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097901","url":null,"abstract":"The Radio Frequency IDentification (RFID) technology provides a promising solution to location discovery in indoor environments. Existing RFID reader positioning algorithms usually use all the collected reference tags to determine the position of the target reader, and thus are time-consuming as well as susceptible to the communication irregularity between the reader and reference tags. Especially, they usually perform poorly when the target reader is near the wall or at the corner. In this paper, we propose ArPat, an Accurate RFID reader Positioning algorithm that uses mere boundary reference Tags to calculate the position of the reader. ArPat uses only boundary tags to determine the position of the target reader, which effectively mitigates the negative impact of communication irregularity on the localization accuracy. The localization accuracy of ArPat is higher than 0.2 ft when the space between references tags is 1 ft. Compared with state-of-the-art solutions for RFID reader positioning, ArPat improves localization accuracy by up to 42 percent and 36 percent on average. Furthermore, it uses a geometric approach rather than iterative optimization approaches employed by previous solutions, making it superior in time efficiency. Compared with previous solutions, the computational time of ArPat is nearly two orders of magnitude less. This is critical for a localization system to provide real time location discovery and tracking services.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Atomic reduction based sparse matrix-transpose vector multiplication on GPUs gpu上基于稀疏矩阵转置向量乘法的原子约简
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097920
Yuan Tao, Yangdong Deng, Shuai Mu, Mingfa Zhu, Limin Xiao, Li Ruan, Zhibin Huang
Sparse Matrix-Transpose Vector Product (SMTVP) is a frequently used computation pattern in High Performance Computing applications. It is typically solved by transposition followed by a Sparse Matrix-Vector Product (SMVP) in current linear algebra packages. However, the transposition process can be a serious bottleneck on modern parallel computing platforms. A previous work proposed a relatively complex data structure for efficiently computing SMTVP with multi-core CPUs, but it proved to be inefficient on GPUs. In this work, we show that the Compressed Sparse Row (CSR) based SMVP algorithm can also be efficient for SMTVP computation on modern GPUs. The proposed method exploits atomic operations to perform the reduce operation in the computation of each inner product of a row in the transposed matrix and the vector. Experimental results show that the simple technique can outperform the SMTVP flow of transposition plus SMVP released in the CUSPARSE package by up to 405-fold.
稀疏矩阵-转置向量积(SMTVP)是高性能计算应用中常用的一种计算模式。在现有的线性代数包中,通常采用变换后的稀疏矩阵向量积(SMVP)来求解。然而,在现代并行计算平台上,转换过程是一个严重的瓶颈。先前的工作提出了一种相对复杂的数据结构,用于在多核cpu上高效地计算SMTVP,但在gpu上被证明是低效的。在这项工作中,我们证明了基于压缩稀疏行(CSR)的SMVP算法也可以有效地在现代gpu上进行SMTVP计算。该方法利用原子运算在转置矩阵与向量的每一行内积的计算中执行约简运算。实验结果表明,该简单的技术比CUSPARSE包中释放的转置SMTVP流和SMVP流的性能提高了405倍。
{"title":"Atomic reduction based sparse matrix-transpose vector multiplication on GPUs","authors":"Yuan Tao, Yangdong Deng, Shuai Mu, Mingfa Zhu, Limin Xiao, Li Ruan, Zhibin Huang","doi":"10.1109/PADSW.2014.7097920","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097920","url":null,"abstract":"Sparse Matrix-Transpose Vector Product (SMTVP) is a frequently used computation pattern in High Performance Computing applications. It is typically solved by transposition followed by a Sparse Matrix-Vector Product (SMVP) in current linear algebra packages. However, the transposition process can be a serious bottleneck on modern parallel computing platforms. A previous work proposed a relatively complex data structure for efficiently computing SMTVP with multi-core CPUs, but it proved to be inefficient on GPUs. In this work, we show that the Compressed Sparse Row (CSR) based SMVP algorithm can also be efficient for SMTVP computation on modern GPUs. The proposed method exploits atomic operations to perform the reduce operation in the computation of each inner product of a row in the transposed matrix and the vector. Experimental results show that the simple technique can outperform the SMTVP flow of transposition plus SMVP released in the CUSPARSE package by up to 405-fold.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123578674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Design and analysis of software defined Vehicular Cyber Physical Systems 软件定义车辆网络物理系统的设计与分析
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097836
P. Duan, Chao Peng, Qin Zhu, Jingmin Shi, Haibin Cai
VCPS (Vehicular Cyber Physical Systems) is a special kind of networked cyber physical system in which each vehicle is regarded as a communication unit. Vehicle's movement is restricted by road and environment in VCPS, while traditional random mobility model and waypoint mobility model cannot reflect the realistic vehicle traces. In VCPS, with the high speed of vehicles, the network topology undergoing tremendous changes all the time, which greatly undermines the stability of communication between vehicles. The diversity and complexity of traffic scenarios in VCPS have also increased the difficulty of designing an efficient and stable routing protocol. In this paper, we creatively combine SDN (Software Defined Networking) and VCPS together and propose a new VCPS communication architecture, which enable VCPS to be manageable by remote controller. SD-VCPS can flexibly change routing policies depending on different traffic scenes or traffic periods, adjusting the topology of VCPS to adapt to different network requirements. We further present a new location-based routing protocol for SD-VCPS, and corroborate the efficiency of our proposed framework by experiments using network simulator NS3.
车辆网络物理系统(VCPS)是一种特殊的网络化网络物理系统,其中每辆车都被视为一个通信单元。在VCPS中,车辆的运动受到道路和环境的限制,传统的随机机动模型和航路点机动模型不能反映真实的车辆轨迹。在VCPS中,随着车辆的高速行驶,网络拓扑结构一直在发生巨大的变化,这极大地破坏了车辆间通信的稳定性。VCPS中流量场景的多样性和复杂性也增加了设计高效稳定路由协议的难度。本文创造性地将SDN(软件定义网络)和VCPS结合在一起,提出了一种新的VCPS通信架构,使VCPS能够被远程控制器管理。SD-VCPS可以根据不同的流量场景或流量周期灵活改变路由策略,通过调整VCPS的拓扑结构来适应不同的网络需求。我们进一步提出了一种新的基于位置的SD-VCPS路由协议,并通过网络模拟器NS3的实验验证了我们提出的框架的有效性。
{"title":"Design and analysis of software defined Vehicular Cyber Physical Systems","authors":"P. Duan, Chao Peng, Qin Zhu, Jingmin Shi, Haibin Cai","doi":"10.1109/PADSW.2014.7097836","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097836","url":null,"abstract":"VCPS (Vehicular Cyber Physical Systems) is a special kind of networked cyber physical system in which each vehicle is regarded as a communication unit. Vehicle's movement is restricted by road and environment in VCPS, while traditional random mobility model and waypoint mobility model cannot reflect the realistic vehicle traces. In VCPS, with the high speed of vehicles, the network topology undergoing tremendous changes all the time, which greatly undermines the stability of communication between vehicles. The diversity and complexity of traffic scenarios in VCPS have also increased the difficulty of designing an efficient and stable routing protocol. In this paper, we creatively combine SDN (Software Defined Networking) and VCPS together and propose a new VCPS communication architecture, which enable VCPS to be manageable by remote controller. SD-VCPS can flexibly change routing policies depending on different traffic scenes or traffic periods, adjusting the topology of VCPS to adapt to different network requirements. We further present a new location-based routing protocol for SD-VCPS, and corroborate the efficiency of our proposed framework by experiments using network simulator NS3.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121954679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Providing hybrid block storage for virtual machines using object-based storage 使用基于对象的存储为虚拟机提供混合块存储
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097803
Sixiang Ma, Hao-peng Chen, Yuxi Shen, Heng Lu, Bin Wei, P. He
This paper presents the design, implementation, and evaluation of a multi-tiered storage system called MOBBS, which provides hybrid block storage for Virtual Machines (VMs) on top of object-based storage infrastructure. MOBBS is mainly motivated by the gap between the lack of studies on hybrid block storage for VMs and the increasing prevalence of hybrid storage systems. By stripping disk images into partitions and intelligently storing them on different storage tiers according to real-time workload patterns, MOBBS achieves efficient use of multiple storage devices and relieves the burden of data placement. Leveraging the benefits of object-based storage, MOBBS is able to dynamically perform non-disruptive and fine-grained data migration between storage tiers and distribute the complexity of data migration across entire storage nodes. Such designs enable our system to deliver storage for VMs with high scalability and availability under an efficient use of SSDs. We evaluated a Ceph implementation of MOBBS using both block and file system workloads. The results comprehensively demonstrate MOBBS's effectiveness in performance improvement as well as efficient utilization of different storage devices.
本文介绍了一种称为MOBBS的多层存储系统的设计、实现和评估,该系统在基于对象的存储基础设施之上为虚拟机(vm)提供混合块存储。MOBBS的主要动机是缺乏对虚拟机混合块存储的研究和混合存储系统的日益普及之间的差距。通过将磁盘映像剥离成分区,并根据实时工作负载模式智能地存储在不同的存储层上,MOBBS实现了对多个存储设备的高效利用,减轻了数据放置的负担。利用基于对象存储的优势,MOBBS能够在存储层之间动态执行非中断和细粒度的数据迁移,并在整个存储节点之间分配数据迁移的复杂性。这样的设计使我们的系统能够在有效使用ssd的情况下为vm提供具有高可扩展性和可用性的存储。我们使用块和文件系统工作负载评估了MOBBS的Ceph实现。结果全面证明了MOBBS在提高性能和有效利用不同存储设备方面的有效性。
{"title":"Providing hybrid block storage for virtual machines using object-based storage","authors":"Sixiang Ma, Hao-peng Chen, Yuxi Shen, Heng Lu, Bin Wei, P. He","doi":"10.1109/PADSW.2014.7097803","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097803","url":null,"abstract":"This paper presents the design, implementation, and evaluation of a multi-tiered storage system called MOBBS, which provides hybrid block storage for Virtual Machines (VMs) on top of object-based storage infrastructure. MOBBS is mainly motivated by the gap between the lack of studies on hybrid block storage for VMs and the increasing prevalence of hybrid storage systems. By stripping disk images into partitions and intelligently storing them on different storage tiers according to real-time workload patterns, MOBBS achieves efficient use of multiple storage devices and relieves the burden of data placement. Leveraging the benefits of object-based storage, MOBBS is able to dynamically perform non-disruptive and fine-grained data migration between storage tiers and distribute the complexity of data migration across entire storage nodes. Such designs enable our system to deliver storage for VMs with high scalability and availability under an efficient use of SSDs. We evaluated a Ceph implementation of MOBBS using both block and file system workloads. The results comprehensively demonstrate MOBBS's effectiveness in performance improvement as well as efficient utilization of different storage devices.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116845534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1