首页 > 最新文献

2010 39th International Conference on Parallel Processing Workshops最新文献

英文 中文
GEM: Graphical Explorer of MPI Programs 图形资源管理器的MPI程序
Pub Date : 2010-10-25 DOI: 10.1145/1879211.1879248
A. Humphrey, C. Derrick, G. Gopalakrishnan, Beth Tibbitts
Formal dynamic verification can complement MPI program testing by detecting hard-to-find concurrency bugs. In previous work, we described our dynamic verifier called In-situ Partial Order (ISP) that can parsimoniously search the execution space of an MPI program while detecting important classes of bugs. One major limitation of ISP, when used by itself, is the lack of a powerful and widely usable graphical front-end. We now present a new tool called Graphical Explorer of MPI Programs (GEM) that overcomes this limitation. GEM is a plug-in architecture that greatly enhances the usability of ISP, and serves to bring ISP within reach of a wide array of programmers with its original release as part of the Eclipse Foundation’s Parallel Tools Platform (PTP) Version 3.0 in December, 2009. GEM is now a part of the PTP End-User Runtime. This paper describes GEM’s features, its architecture, and usage experience summary of the ISP/GEM combination. Recently, we applied this combination on a widely used parallel hypergraph partitioner. Even with modest amounts of computational resources, the ISP/GEM combination finished quickly and intuitively displayed a previously unknown resource leak in this code-base. Here, we also describe the process and benefits of using GEM throughout the development cycle of our own test case, an MPI implementation of the A* search. We conclude with a summary of our future plans.
正式的动态验证可以通过检测难以发现的并发错误来补充MPI程序测试。在之前的工作中,我们描述了我们的动态验证器,称为原位偏序(ISP),它可以在检测重要的错误类别的同时简化搜索MPI程序的执行空间。单独使用时,ISP的一个主要限制是缺乏功能强大且广泛可用的图形前端。我们现在提出了一个新的工具,称为MPI程序的图形资源管理器(GEM),克服了这一限制。GEM是一种插件架构,它极大地增强了ISP的可用性,并在2009年12月作为Eclipse基金会并行工具平台(PTP) 3.0版本的一部分,将ISP带入了广泛的程序员范围。GEM现在是PTP终端用户运行时的一部分。本文介绍了GEM的特点、架构以及ISP/GEM组合的使用经验总结。最近,我们将这种组合应用到一个广泛使用的并行超图分区上。即使使用少量的计算资源,ISP/GEM组合也可以快速完成,并且直观地显示出该代码库中以前未知的资源泄漏。在这里,我们还描述了在我们自己的测试用例(A*搜索的MPI实现)的整个开发周期中使用GEM的过程和好处。最后,我们对未来的计划做了一个总结。
{"title":"GEM: Graphical Explorer of MPI Programs","authors":"A. Humphrey, C. Derrick, G. Gopalakrishnan, Beth Tibbitts","doi":"10.1145/1879211.1879248","DOIUrl":"https://doi.org/10.1145/1879211.1879248","url":null,"abstract":"Formal dynamic verification can complement MPI program testing by detecting hard-to-find concurrency bugs. In previous work, we described our dynamic verifier called In-situ Partial Order (ISP) that can parsimoniously search the execution space of an MPI program while detecting important classes of bugs. One major limitation of ISP, when used by itself, is the lack of a powerful and widely usable graphical front-end. We now present a new tool called Graphical Explorer of MPI Programs (GEM) that overcomes this limitation. GEM is a plug-in architecture that greatly enhances the usability of ISP, and serves to bring ISP within reach of a wide array of programmers with its original release as part of the Eclipse Foundation’s Parallel Tools Platform (PTP) Version 3.0 in December, 2009. GEM is now a part of the PTP End-User Runtime. This paper describes GEM’s features, its architecture, and usage experience summary of the ISP/GEM combination. Recently, we applied this combination on a widely used parallel hypergraph partitioner. Even with modest amounts of computational resources, the ISP/GEM combination finished quickly and intuitively displayed a previously unknown resource leak in this code-base. Here, we also describe the process and benefits of using GEM throughout the development cycle of our own test case, an MPI implementation of the A* search. We conclude with a summary of our future plans.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128379895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Exploring the Limits of Tag Reduction for Energy Saving on a Multi-core Processor 在多核处理器上探索标签减少的节能极限
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.26
Long Zheng, M. Dong, K. Ota, Huakang Li, Song Guo, M. Guo
Saving energy usually leads to performance degradation. We explore the limits of tag reduction on a multi-core processor with guaranteed performance effect. In our previous work, tag reduction is applied to multi-core processors and shows significant energy savings, meanwhile it causes performance overhead. In this paper, we have found out that when tag reduction is used on multi-core processors, the number of cores is the key factor that affects both energy and performance. More specifically, when tag reduction is applied to multi-core processors, as the number of core integrated into the chip increases, tag reduction can save more energy, while causes more performance degradation. Tag reduction has the limits that are represented by the number of cores. In order to derive the limits, we study the relationship between energy consumption and performance overhead and propose a decision model. We build up an experiment platform that is composed of Linux Physical Memory Monitor (LPMM), Trace Recorder (TR), Scalable Multi-core Simulator (SMS) and Data Analysis Module (DAM). We evaluate benchmarks from SPEC CPU2006 on a real operating system with help of LPMM and TR; and then get the raw results about energy and performance using SMS; finally, DAM analyzes the raw results and finds out the limits. Experimental results show that tag reduction should be applied to the multi-core processor which integrates no more than 6 cores; otherwise, the energy- and performance-efficiency of tag reduction degrades.
节约能源通常会导致性能下降。我们探讨了在保证性能效果的情况下,多核处理器上标签缩减的极限。在我们之前的工作中,标签减少应用于多核处理器,可以显著节省能源,同时也会带来性能开销。在本文中,我们发现在多核处理器上进行标签缩减时,核数是影响能耗和性能的关键因素。更具体地说,当标签缩减应用于多核处理器时,随着芯片中集成的核心数量的增加,标签缩减可以节省更多的能量,但会导致更多的性能下降。标签缩减具有由核心数量表示的限制。为了推导出极限,我们研究了能耗与性能开销之间的关系,并提出了一个决策模型。我们搭建了一个由Linux物理内存监视器(LPMM)、跟踪记录器(TR)、可扩展多核模拟器(SMS)和数据分析模块(DAM)组成的实验平台。利用LPMM和TR在实际操作系统上对SPEC CPU2006的基准测试进行了评估;然后使用SMS获得有关能源和性能的原始结果;最后,DAM分析原始结果并找出限制。实验结果表明,标签约简适用于集成不超过6核的多核处理器;否则,会降低标签缩减的能量和性能效率。
{"title":"Exploring the Limits of Tag Reduction for Energy Saving on a Multi-core Processor","authors":"Long Zheng, M. Dong, K. Ota, Huakang Li, Song Guo, M. Guo","doi":"10.1109/ICPPW.2010.26","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.26","url":null,"abstract":"Saving energy usually leads to performance degradation. We explore the limits of tag reduction on a multi-core processor with guaranteed performance effect. In our previous work, tag reduction is applied to multi-core processors and shows significant energy savings, meanwhile it causes performance overhead. In this paper, we have found out that when tag reduction is used on multi-core processors, the number of cores is the key factor that affects both energy and performance. More specifically, when tag reduction is applied to multi-core processors, as the number of core integrated into the chip increases, tag reduction can save more energy, while causes more performance degradation. Tag reduction has the limits that are represented by the number of cores. In order to derive the limits, we study the relationship between energy consumption and performance overhead and propose a decision model. We build up an experiment platform that is composed of Linux Physical Memory Monitor (LPMM), Trace Recorder (TR), Scalable Multi-core Simulator (SMS) and Data Analysis Module (DAM). We evaluate benchmarks from SPEC CPU2006 on a real operating system with help of LPMM and TR; and then get the raw results about energy and performance using SMS; finally, DAM analyzes the raw results and finds out the limits. Experimental results show that tag reduction should be applied to the multi-core processor which integrates no more than 6 cores; otherwise, the energy- and performance-efficiency of tag reduction degrades.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123235568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Orbital Algorithms and Unified Array Processor for Computing 2D Separable Transforms 计算二维可分变换的轨道算法和统一阵列处理器
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.29
S. Sedukhin, A. Zekri, T. Miyazaki
The two-dimensional (2D) forward/inverse discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), discrete Walsh-Hadamard transform (DWHT), play a fundamental role in many practical applications. Due to the separability property, all these transforms can be uniquely defined as a triple matrix product with one matrix transposition. Based on a systematic approach to represent and schedule different forms of the $ntimes n$ matrix-matrix multiply-add (MMA) operation in 3D index space, we design new orbital highly-parallel/scalable algorithms and present an efficient $ntimes n$ unified array processor for computing {it any} $ntimes n$ forward/inverse discrete separable transform in the minimal $2n$ time-steps. Unlike traditional 2D systolic array processing, all $n^2$ register-stored elements of initial/intermediate matrices are processed simultaneously by all $n^2$ processing elements of the unified array processor at each time-step. Hence the proposed array processor is appropriate for applications with naturally arranged multidimensional data such as still images, video frames, 2D data from a matrix sensor, etc. Ultimately, we introduce a novel formulation and a highly-parallel implementation of the frequently required matrix data alignment and manipulation by using MMA operations on the same array processor so that no additional circuitry is needed.
二维(2D)正/反离散傅立叶变换(DFT)、离散余弦变换(DCT)、离散正弦变换(DST)、离散哈特利变换(DHT)、离散Walsh-Hadamard变换(DWHT)在许多实际应用中起着基础作用。由于可分性,所有这些变换都可以唯一地定义为一个矩阵转置的三重矩阵积。基于系统地表示和调度三维索引空间中不同形式的矩阵-矩阵乘加(MMA)运算,我们设计了新的轨道高度并行/可扩展算法,并提出了一种高效的$ntimes n$统一阵列处理器,用于在最小$2n$时间步内计算$ntimes n$正/逆离散可分离变换。与传统的二维收缩数组处理不同,在每个时间步,统一数组处理器的所有$n^2$处理元素同时处理初始/中间矩阵的所有$n^2$寄存器存储元素。因此,所提出的阵列处理器适用于具有自然排列的多维数据的应用,例如静止图像、视频帧、来自矩阵传感器的2D数据等。最后,我们引入了一种新颖的公式和高度并行的实现,通过在同一阵列处理器上使用MMA操作来实现经常需要的矩阵数据对齐和操作,因此不需要额外的电路。
{"title":"Orbital Algorithms and Unified Array Processor for Computing 2D Separable Transforms","authors":"S. Sedukhin, A. Zekri, T. Miyazaki","doi":"10.1109/ICPPW.2010.29","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.29","url":null,"abstract":"The two-dimensional (2D) forward/inverse discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), discrete Walsh-Hadamard transform (DWHT), play a fundamental role in many practical applications. Due to the separability property, all these transforms can be uniquely defined as a triple matrix product with one matrix transposition. Based on a systematic approach to represent and schedule different forms of the $ntimes n$ matrix-matrix multiply-add (MMA) operation in 3D index space, we design new orbital highly-parallel/scalable algorithms and present an efficient $ntimes n$ unified array processor for computing {it any} $ntimes n$ forward/inverse discrete separable transform in the minimal $2n$ time-steps. Unlike traditional 2D systolic array processing, all $n^2$ register-stored elements of initial/intermediate matrices are processed simultaneously by all $n^2$ processing elements of the unified array processor at each time-step. Hence the proposed array processor is appropriate for applications with naturally arranged multidimensional data such as still images, video frames, 2D data from a matrix sensor, etc. Ultimately, we introduce a novel formulation and a highly-parallel implementation of the frequently required matrix data alignment and manipulation by using MMA operations on the same array processor so that no additional circuitry is needed.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122148212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Localization with Rotatable Directional Antennas for Wireless Sensor Networks 无线传感器网络的可旋转定向天线定位
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.79
Jehn-Ruey Jiang, Chih-Ming Lin, Yi-Jia Hsu
In this paper we show the design and implementation of a novel localization scheme, called Rotatable Antenna Localization (RAL), for a wireless sensor network (WSN) with beacon nodes with directional antennas which rotate regularly. A beacon node periodically sends beacon signals containing its position and antenna orientations. By observing the variation of the received signal strength indication (RSSI) values of the beacon signals, a sensor node can estimate the orientation relative to the beacon node. With the estimated orientations and exact positions of two distinct beacon nodes, a sensor can calculate its own location. Four methods are proposed and implemented for the sensor node to estimate its orientations. Among them, we find that the strongest-signal (SS) method has the most accurate orientation estimation. With SS method, we implement RAL scheme and apply it to a WSN in a 10- by 10-meter indoor environment with two beacon nodes at two ends of a side. Our experiment demonstrates that the average position estimation error of RAL is 76 centimeters. We further propose two methods, namely grid- and vector-based approximation methods, to improve RAL by installing more than two beacon nodes. We show by simulation that the improvements can reduce about 10% of the position error.
在本文中,我们展示了一种新的定位方案的设计和实现,称为可旋转天线定位(RAL),用于无线传感器网络(WSN),信标节点具有有规律旋转的定向天线。信标节点周期性地发送包含其位置和天线方向的信标信号。通过观察信标信号的接收信号强度指示(RSSI)值的变化,传感器节点可以估计相对于信标节点的方向。利用两个不同信标节点的估计方向和精确位置,传感器可以计算出自己的位置。提出并实现了传感器节点方位估计的四种方法。其中,我们发现最强信号(SS)方法具有最准确的方向估计。利用SS方法,我们实现了RAL方案,并将其应用于一个10 × 10米室内环境的WSN,该WSN在一侧两端有两个信标节点。实验表明,该方法的平均位置估计误差为76厘米。我们进一步提出了两种方法,即基于网格和基于向量的近似方法,通过安装两个以上的信标节点来改进RAL。仿真结果表明,改进后的定位误差可减小10%左右。
{"title":"Localization with Rotatable Directional Antennas for Wireless Sensor Networks","authors":"Jehn-Ruey Jiang, Chih-Ming Lin, Yi-Jia Hsu","doi":"10.1109/ICPPW.2010.79","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.79","url":null,"abstract":"In this paper we show the design and implementation of a novel localization scheme, called Rotatable Antenna Localization (RAL), for a wireless sensor network (WSN) with beacon nodes with directional antennas which rotate regularly. A beacon node periodically sends beacon signals containing its position and antenna orientations. By observing the variation of the received signal strength indication (RSSI) values of the beacon signals, a sensor node can estimate the orientation relative to the beacon node. With the estimated orientations and exact positions of two distinct beacon nodes, a sensor can calculate its own location. Four methods are proposed and implemented for the sensor node to estimate its orientations. Among them, we find that the strongest-signal (SS) method has the most accurate orientation estimation. With SS method, we implement RAL scheme and apply it to a WSN in a 10- by 10-meter indoor environment with two beacon nodes at two ends of a side. Our experiment demonstrates that the average position estimation error of RAL is 76 centimeters. We further propose two methods, namely grid- and vector-based approximation methods, to improve RAL by installing more than two beacon nodes. We show by simulation that the improvements can reduce about 10% of the position error.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129850627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Scaling Linear Algebra Kernels Using Remote Memory Access 使用远程内存访问缩放线性代数内核
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.57
M. Krishnan, R. Lewis, Abhinav Vishnu
This paper describes the scalability of linear algebra kernels based on remote memory access approach. The current approach differs from the other linear algebra algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. It is suitable for clusters and scalable shared memory systems. The experimental results on large scale systems (Linux-Infiniband cluster, Cray XT) demonstrate consistent performance advantages over ScaLAPACK suite, the leading implementation of parallel linear algebra algorithms used today. For example, on a Cray XT4 for a matrix size of 102400, our RMA-based matrix multiplication achieved over 55 tera???ops while ScaLAPACK’s pdgemm measured close to 42 tera???ops on 10000 processes.
本文描述了基于远程存储器访问方法的线性代数核的可扩展性。当前的方法与其他线性代数算法的不同之处在于显式使用共享内存和远程内存访问(RMA)通信,而不是消息传递。它适用于集群和可扩展的共享内存系统。在大规模系统(Linux-Infiniband集群,Cray XT)上的实验结果表明,与ScaLAPACK套件(当今使用的并行线性代数算法的领先实现)相比,具有一致的性能优势。例如,在矩阵大小为102400的Cray XT4上,我们基于rma的矩阵乘法实现了超过55 tera?而ScaLAPACK的电池续航里程接近42 tera??10000个进程上的操作。
{"title":"Scaling Linear Algebra Kernels Using Remote Memory Access","authors":"M. Krishnan, R. Lewis, Abhinav Vishnu","doi":"10.1109/ICPPW.2010.57","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.57","url":null,"abstract":"This paper describes the scalability of linear algebra kernels based on remote memory access approach. The current approach differs from the other linear algebra algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. It is suitable for clusters and scalable shared memory systems. The experimental results on large scale systems (Linux-Infiniband cluster, Cray XT) demonstrate consistent performance advantages over ScaLAPACK suite, the leading implementation of parallel linear algebra algorithms used today. For example, on a Cray XT4 for a matrix size of 102400, our RMA-based matrix multiplication achieved over 55 tera???ops while ScaLAPACK’s pdgemm measured close to 42 tera???ops on 10000 processes.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129097163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Quality of Surveillance Measures in K-Covered Heterogeneous Wireless Sensor Networks k覆盖异构无线传感器网络中监控措施的质量
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.82
M. Wueng, I. Hwang
Heterogeneous Wireless Sensor Networks, in which the deployed sensors have different capacities, are gradually used to perform critical surveillance in real world. For conserving energy, powerful sensors are usually activated only when an event is detected, while low-cost and error-prone sensors dominate the quality of surveillance (QoSu) when an interesting event just appears. To guarantee a desired QoSu, deploying the K-coverage configuration in HWSNs has attracted much attention. However, little work addresses a significant issue of measuring the fault tolerance level on a K-coverage configuration in HWSNs. In this paper, we first propose an energy-efficient eligibility approach to perform the K-covered HWSNs with very low cost. The QoSu is further formalized in terms of explicit metrics, such as probabilities of system false positives and system false negatives. An appropriate deployment of the K-coverage configuration can thus be determined according to a desired QoSu while prolonging the system lifetime.
异构无线传感器网络中部署的传感器具有不同的容量,正逐渐被用于现实世界中的关键监控。为了节约能源,功能强大的传感器通常只在检测到事件时才被激活,而低成本且容易出错的传感器则在有趣的事件刚刚出现时主导着监控质量(QoSu)。为了保证理想的QoSu,在hwsn中部署k覆盖配置引起了人们的广泛关注。然而,很少有研究解决hwsn中k覆盖配置的容错水平测量的重要问题。在本文中,我们首先提出了一种节能合格性方法,以非常低的成本执行k覆盖的HWSNs。QoSu是根据显式度量进一步形式化的,例如系统误报和系统误报的概率。因此,可以根据期望的QoSu确定k覆盖配置的适当部署,同时延长系统生命周期。
{"title":"Quality of Surveillance Measures in K-Covered Heterogeneous Wireless Sensor Networks","authors":"M. Wueng, I. Hwang","doi":"10.1109/ICPPW.2010.82","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.82","url":null,"abstract":"Heterogeneous Wireless Sensor Networks, in which the deployed sensors have different capacities, are gradually used to perform critical surveillance in real world. For conserving energy, powerful sensors are usually activated only when an event is detected, while low-cost and error-prone sensors dominate the quality of surveillance (QoSu) when an interesting event just appears. To guarantee a desired QoSu, deploying the K-coverage configuration in HWSNs has attracted much attention. However, little work addresses a significant issue of measuring the fault tolerance level on a K-coverage configuration in HWSNs. In this paper, we first propose an energy-efficient eligibility approach to perform the K-covered HWSNs with very low cost. The QoSu is further formalized in terms of explicit metrics, such as probabilities of system false positives and system false negatives. An appropriate deployment of the K-coverage configuration can thus be determined according to a desired QoSu while prolonging the system lifetime.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121451056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Mixed-Tool Performance Analysis on Hybrid Multicore Architectures 混合多核架构的混合工具性能分析
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.41
Peng Du, P. Luszczek, S. Tomov, J. Dongarra
This paper proposes a triangular solve algorithm with variable block size for graphics processing unit (GPU). By using diagonal blocks inversion with recursion, this algorithm works with tunable block size to achieve the best performance. Various methods are shown on how to make use of existing profiling tools to successfully measure and analyze performance of this algorithm. We use some of the most popular CPU and GPU profiling tools for their advantages and overcome their disadvantages with several new techniques to analyze the performance and relationship of different components of applications. With the presented methodologies, insight information is produced which helps to understand and tune the proposed algorithm and considerably improve the performance of the solver itself as well as the application using it.
提出了一种面向图形处理器(GPU)的变块大小三角求解算法。该算法通过对角块递归倒换,实现块大小可调,以达到最佳性能。介绍了如何利用现有的分析工具成功地测量和分析该算法的性能的各种方法。我们使用一些最流行的CPU和GPU分析工具来分析它们的优点,并使用几种新技术来克服它们的缺点,以分析应用程序不同组件的性能和关系。使用所提出的方法,可以产生洞察力信息,这有助于理解和调整所提出的算法,并大大提高求解器本身以及使用它的应用程序的性能。
{"title":"Mixed-Tool Performance Analysis on Hybrid Multicore Architectures","authors":"Peng Du, P. Luszczek, S. Tomov, J. Dongarra","doi":"10.1109/ICPPW.2010.41","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.41","url":null,"abstract":"This paper proposes a triangular solve algorithm with variable block size for graphics processing unit (GPU). By using diagonal blocks inversion with recursion, this algorithm works with tunable block size to achieve the best performance. Various methods are shown on how to make use of existing profiling tools to successfully measure and analyze performance of this algorithm. We use some of the most popular CPU and GPU profiling tools for their advantages and overcome their disadvantages with several new techniques to analyze the performance and relationship of different components of applications. With the presented methodologies, insight information is produced which helps to understand and tune the proposed algorithm and considerably improve the performance of the solver itself as well as the application using it.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125995677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On Better Performance from Scheduling Threads According to Resource Demands in MMMP MMMP中根据资源需求调度线程的性能研究
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.53
L. Weng, Chen Liu
The Multi-core Multi-threading Microprocessor introduces not only resource sharing to threads in the same core, e.g., computation resources and private caches, but also isolates those resources within different cores. Moreover, when the Simultaneous Multithreading architecture is employed, the execution resources are fully shared among the concurrently executing threads in the same core, while the isolation is worsened as the number of cores increases. Even though fetch policies regarding how to assign priorities in fetch stage are well designed to manage the shared resources in a core, it is actually the scheduling policy that makes the distributed resources available for workloads, through deciding how to send their threads to cores. On the other hand, threads consume various resources in different phases and Cycles Per Instruction Spent on Memory (CPImem) is used to express their resource demands. Consequently, aiming at better performance via scheduling according to their resource demands, we propose the Mix-Scheduling to evenly mix threads across cores, so that it achieves thread diversity, i.e., CPImem diversity in every core. As a result, it is observed in our experiment that 63% improvement in overall system throughput and 27% improvement in average thread performance, when comparing the Mix-Scheduling policy with the reference policy Mono-Scheduling, which keeps CPImem uniformity among threads in every core on chips. Furthermore, the Mix-Scheduling also makes an essential step towards shortening load latency, because it succeeds in reducing the L2 Cache Miss Rate by 6% from Mono-Scheduling.
多核多线程微处理器不仅在同一核的线程间实现了资源共享,例如计算资源和私有缓存,而且在不同核间实现了资源隔离。此外,当采用并发多线程架构时,执行资源在同一核心中并发执行的线程之间完全共享,并且随着核数的增加,隔离性越来越差。尽管关于如何在取取阶段分配优先级的取取策略设计得很好,用于管理核心中的共享资源,但实际上是调度策略通过决定如何将线程发送到核心,使分布式资源可用于工作负载。另一方面,线程在不同阶段消耗不同的资源,并且使用每条指令在内存上花费的周期(CPImem)来表示它们的资源需求。因此,为了通过根据资源需求进行调度来获得更好的性能,我们提出了mix - scheduling,将线程均匀地混合在不同的核上,从而实现线程的多样性,即每个核上的CPImem多样性。因此,在我们的实验中观察到,当将Mix-Scheduling策略与参考策略Mono-Scheduling进行比较时,总体系统吞吐量提高了63%,平均线程性能提高了27%,该策略保持了芯片上每个核心线程之间的CPImem一致性。此外,Mix-Scheduling也为缩短负载延迟迈出了重要的一步,因为它成功地将L2缓存丢失率从单调度降低了6%。
{"title":"On Better Performance from Scheduling Threads According to Resource Demands in MMMP","authors":"L. Weng, Chen Liu","doi":"10.1109/ICPPW.2010.53","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.53","url":null,"abstract":"The Multi-core Multi-threading Microprocessor introduces not only resource sharing to threads in the same core, e.g., computation resources and private caches, but also isolates those resources within different cores. Moreover, when the Simultaneous Multithreading architecture is employed, the execution resources are fully shared among the concurrently executing threads in the same core, while the isolation is worsened as the number of cores increases. Even though fetch policies regarding how to assign priorities in fetch stage are well designed to manage the shared resources in a core, it is actually the scheduling policy that makes the distributed resources available for workloads, through deciding how to send their threads to cores. On the other hand, threads consume various resources in different phases and Cycles Per Instruction Spent on Memory (CPImem) is used to express their resource demands. Consequently, aiming at better performance via scheduling according to their resource demands, we propose the Mix-Scheduling to evenly mix threads across cores, so that it achieves thread diversity, i.e., CPImem diversity in every core. As a result, it is observed in our experiment that 63% improvement in overall system throughput and 27% improvement in average thread performance, when comparing the Mix-Scheduling policy with the reference policy Mono-Scheduling, which keeps CPImem uniformity among threads in every core on chips. Furthermore, the Mix-Scheduling also makes an essential step towards shortening load latency, because it succeeds in reducing the L2 Cache Miss Rate by 6% from Mono-Scheduling.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115741639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Smartphone Evolution and Reuse: Establishing a More Sustainable Model 智能手机的进化和再利用:建立一个更可持续的模式
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.70
Xun Li, Pablo J. Ortiz, Jeffrey Browne, Diana Franklin, J. Oliver, R. Geyer, Yuanyuan Zhou, F. Chong
The dark side of Moore's Law is our society's insatiable need to constantly upgrade our computing devices. The high cost in manufacturing energy, materials and disposal is more worrisome the increasing number of smartphones. Repurposing smartphones for educational purpose is a promising idea and shown success in recent years. Our previous work has shown that although different components in smartphones degrade from use, their functionalities, available resources and power supplies are still able to satisfy the requirement of educational applications. In this study, we demonstrate the potential benefits of reusing smartphones by analyzing their manufacturing and life-time energy. The key challenge is the design of software that can adapt to extreme heterogeneity of devices. We also characterize different types of heterogeneities among different generations of smartphones from HTC and Apple, including processing capability, storage resource and various features. We propose insights to aid establishing a sustainable model of designing mobile applications for phone reuse.
摩尔定律的阴暗面是我们的社会对不断升级计算设备的永不满足的需求。与智能手机的增加相比,制造能源、材料、处理的高成本更令人担忧。将智能手机重新用于教育目的是一个很有前途的想法,近年来取得了成功。我们之前的工作表明,尽管智能手机中的不同组件在使用过程中会退化,但它们的功能、可用资源和电源仍然能够满足教育应用的要求。在这项研究中,我们通过分析智能手机的制造和生命周期能源来证明重复使用智能手机的潜在好处。关键的挑战是软件的设计要能适应设备的极端异构性。我们还描述了HTC和苹果不同代智能手机之间的不同类型的异质性,包括处理能力,存储资源和各种功能。我们提出了一些见解,以帮助建立一个可持续的手机应用程序设计模型。
{"title":"Smartphone Evolution and Reuse: Establishing a More Sustainable Model","authors":"Xun Li, Pablo J. Ortiz, Jeffrey Browne, Diana Franklin, J. Oliver, R. Geyer, Yuanyuan Zhou, F. Chong","doi":"10.1109/ICPPW.2010.70","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.70","url":null,"abstract":"The dark side of Moore's Law is our society's insatiable need to constantly upgrade our computing devices. The high cost in manufacturing energy, materials and disposal is more worrisome the increasing number of smartphones. Repurposing smartphones for educational purpose is a promising idea and shown success in recent years. Our previous work has shown that although different components in smartphones degrade from use, their functionalities, available resources and power supplies are still able to satisfy the requirement of educational applications. In this study, we demonstrate the potential benefits of reusing smartphones by analyzing their manufacturing and life-time energy. The key challenge is the design of software that can adapt to extreme heterogeneity of devices. We also characterize different types of heterogeneities among different generations of smartphones from HTC and Apple, including processing capability, storage resource and various features. We propose insights to aid establishing a sustainable model of designing mobile applications for phone reuse.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115850226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Collaborative Spatial Object Recommendation in Location Based Services 基于位置服务的协同空间对象推荐
Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.16
G. Gupta, Wang-Chien Lee
Recommendation systems have found their ways into many on-line web applications, e.g., product recommendation on Amazon and movie recommendation on Netflix. Particularly, collaborative filtering techniques have been widely used in these systems to personalize the recommendations according to the needs and tastes of users. In this paper, we apply collaborative filtering in spatial object recommendation which is essential in many location based services. Due to the large number of spatial objects and participating users, using collaborative filtering to obtain recommendations for a particular user can be very expensive. However, we observe that users tend to have affinity for some regions and argue that using users with similar regional bias in recommendation may help in reducing the search space of similar users. Thus, we propose two techniques, namely, Access Minimum Bounding Rectangle Overlapped Area (AMBROA) and Grid Division Cosine Similarity (GDCS), to form regions of interests that represent user location interests and activities and to find users with local access similarity to facilitate effective spatial object recommendation. We conduct an extensive performance evaluation to validate our ideas. Evaluation result demonstrates the superiority of our proposal over the conventional approach.
推荐系统已经在许多在线网络应用中找到了自己的方式,例如,亚马逊的产品推荐和Netflix的电影推荐。特别是,协同过滤技术已被广泛应用于这些系统中,以根据用户的需求和品味进行个性化推荐。在本文中,我们将协同过滤应用于空间对象推荐中,这在许多基于位置的服务中是必不可少的。由于有大量的空间对象和参与的用户,使用协同过滤来获得针对特定用户的推荐可能非常昂贵。然而,我们观察到用户倾向于对某些区域具有亲和力,并认为在推荐中使用具有相似区域偏见的用户可能有助于减少相似用户的搜索空间。因此,我们提出了两种技术,即访问最小边界矩形重叠区域(AMBROA)和网格划分余弦相似度(GDCS),以形成代表用户位置兴趣和活动的兴趣区域,并找到具有局部访问相似度的用户,以便进行有效的空间对象推荐。我们进行广泛的绩效评估来验证我们的想法。评价结果表明,该方法优于传统方法。
{"title":"Collaborative Spatial Object Recommendation in Location Based Services","authors":"G. Gupta, Wang-Chien Lee","doi":"10.1109/ICPPW.2010.16","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.16","url":null,"abstract":"Recommendation systems have found their ways into many on-line web applications, e.g., product recommendation on Amazon and movie recommendation on Netflix. Particularly, collaborative filtering techniques have been widely used in these systems to personalize the recommendations according to the needs and tastes of users. In this paper, we apply collaborative filtering in spatial object recommendation which is essential in many location based services. Due to the large number of spatial objects and participating users, using collaborative filtering to obtain recommendations for a particular user can be very expensive. However, we observe that users tend to have affinity for some regions and argue that using users with similar regional bias in recommendation may help in reducing the search space of similar users. Thus, we propose two techniques, namely, Access Minimum Bounding Rectangle Overlapped Area (AMBROA) and Grid Division Cosine Similarity (GDCS), to form regions of interests that represent user location interests and activities and to find users with local access similarity to facilitate effective spatial object recommendation. We conduct an extensive performance evaluation to validate our ideas. Evaluation result demonstrates the superiority of our proposal over the conventional approach.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132104998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2010 39th International Conference on Parallel Processing Workshops
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1