2015 International Conference on High Performance Computing & Simulation (HPCS)最新文献

英文中文

Self-optimization of power parameters in WCDMA networks WCDMA网络中功率参数的自优化

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237024

Harrison Mfula, T. Isotalo, J. Nurminen

Network optimization is used by operators to maximize return on investment and to ensure customer satisfaction with the quality of the delivered service. Coverage and capacity are the most important characteristics of any cellular network. In WCDMA networks, the pilot signal of a cell is used to determine the cell size, hence it can be used to determine the coverage area of the cell. Increasing or reducing the cell pilot power increases or reduces the cell size respectively and hence pilot power can be used to balance load among neighboring cells. As networks continue to evolve, the frequency of optimization and number of tunable parameters continues to increase making manual optimization challenging. This paper presents a practical solution to the pilot power optimization problem in WCDMA networks and addresses the issue of rising optimization complexity by presenting a self-optimization based algorithm for tuning pilot power. When running in closed loop, the algorithm can be used to autonomously optimize pilot power and load balance traffic in the network. When scheduled or triggered manually, the algorithm can also be used to improve network capacity in areas expecting high traffic load during a certain time for example during social gatherings.

网络优化是运营商用来最大化投资回报和确保客户对所提供服务质量满意的方法。覆盖范围和容量是蜂窝网络最重要的特征。在WCDMA网络中，小区的导频信号用于确定小区的大小，因此可用于确定小区的覆盖区域。增加或减少导频功率分别增加或减小单元尺寸，因此导频功率可用于平衡相邻单元之间的负载。随着网络的不断发展，优化的频率和可调参数的数量不断增加，使得手动优化具有挑战性。本文针对WCDMA网络中导频功率优化问题提出了一种实用的解决方案，并提出了一种基于自优化的导频功率调整算法，解决了优化复杂度上升的问题。在闭环运行时，该算法可以自动优化导频功率，实现网络流量的负载均衡。当调度或手动触发时，该算法还可以用于在特定时间(例如社交聚会期间)预计高流量负载的区域提高网络容量。

{"title":"Self-optimization of power parameters in WCDMA networks","authors":"Harrison Mfula, T. Isotalo, J. Nurminen","doi":"10.1109/HPCSim.2015.7237024","DOIUrl":"https://doi.org/10.1109/HPCSim.2015.7237024","url":null,"abstract":"Network optimization is used by operators to maximize return on investment and to ensure customer satisfaction with the quality of the delivered service. Coverage and capacity are the most important characteristics of any cellular network. In WCDMA networks, the pilot signal of a cell is used to determine the cell size, hence it can be used to determine the coverage area of the cell. Increasing or reducing the cell pilot power increases or reduces the cell size respectively and hence pilot power can be used to balance load among neighboring cells. As networks continue to evolve, the frequency of optimization and number of tunable parameters continues to increase making manual optimization challenging. This paper presents a practical solution to the pilot power optimization problem in WCDMA networks and addresses the issue of rising optimization complexity by presenting a self-optimization based algorithm for tuning pilot power. When running in closed loop, the algorithm can be used to autonomously optimize pilot power and load balance traffic in the network. When scheduled or triggered manually, the algorithm can also be used to improve network capacity in areas expecting high traffic load during a certain time for example during social gatherings.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130999645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

GPGPU performance evaluation of some basic molecular dynamics algorithms GPGPU性能评价的一些基本分子动力学算法

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237104

A. Minkin, A. Teslyuk, A. Knizhnik, B. Potapkin

Molecular dynamics is a computationally intensive problem but it is extremely amenable for parallel computation. Many-body potentials used for modeling of carbon and metallic nanostructures usually require much more computing resources than pair potentials. One of the ways to improve their performance is to transform them for running on computing systems that combines CPU and GPU. In this work OpenCL performance of basic molecular dynamics algorithms such as neighbor list generation along with different implementations of energy-force computation of Lennard-Jones, Tersoff and EAM potentials is evaluated. It is shown that concurrent memory writes are effective for Tersoff bond order potential and are not good for embedded-atom potential. Performance measurements show a significant GPU acceleration of basic molecular dynamics algorithms over the corresponding serial implementations.

分子动力学是一个计算密集的问题，但它非常适合并行计算。用于碳和金属纳米结构建模的多体势通常比对势需要更多的计算资源。提高它们性能的方法之一是将它们转换为在CPU和GPU结合的计算系统上运行。在这项工作中，OpenCL性能的基本分子动力学算法，如邻居列表生成以及能量-力计算的Lennard-Jones, Tersoff和EAM势的不同实现进行了评估。结果表明，并发存储器写入对键序电位有效，而对嵌入原子电位不利。性能测量显示，相对于相应的串行实现，基本分子动力学算法有显著的GPU加速。

引用次数: 5

Transient performance evaluation of cloud computing applications and dynamic resource control in large-scale distributed systems 大规模分布式系统中云计算应用的暂态性能评估与动态资源控制

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237046

Edwin L. C. Mamani, L. A. P. Júnior, M. J. Santana, R. Santana, Pedro Northon Nobile, F. J. Monaco

This paper discusses on non-stationary performance evaluation and dynamic modeling of cloud computing environments. In computer systems, dynamic effects results from the filling of buffers, event-handling delays, non-deterministic I/O response times, network latency, among other factors. While computer systems performance evaluation under stationary workloads have met the needs of many engineering problems, new challenges arise as the deployment of increasingly complex and large-scale distributed systems becomes commonplace. One key aspect of this discussion is that transient analysis models how the system reacts to changes in the workload and may reveal that the resources necessary to support a high steady-state workload may not be sufficient to handle a small, but sudden, workload change, even of intensity far smaller than that supported by the system's stationary capacity. This article elaborates on these issues under a control-theoretical approach.

本文讨论了云计算环境的非平稳性能评估和动态建模。在计算机系统中，动态影响来自缓冲区的填充、事件处理延迟、非确定性I/O响应时间、网络延迟以及其他因素。虽然固定负载下的计算机系统性能评估已经满足了许多工程问题的需要，但随着越来越复杂和大规模分布式系统的部署变得司空见惯，新的挑战也随之而来。这个讨论的一个关键方面是，瞬态分析模拟了系统对工作负载变化的反应，并可能揭示支持高稳态工作负载所需的资源可能不足以处理一个小的、突然的工作负载变化，甚至强度远远小于系统稳定能力所支持的变化。本文从控制理论的角度对这些问题进行了阐述。

引用次数: 6

Efficient storage scheme for n-dimensional sparse array: GCRS/GCCS n维稀疏阵列的高效存储方案:GCRS/GCCS

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237032

Md Abu Hanif Shaikh, K. Hasan

Degree of data sparsity increases with the increase of number of dimensions in high performance scientific computing. Storing and applying operations on this highly sparse multidimensional data is still a challenge for data scientists. Experts suggest special storage scheme over sparse array. In traditional sparse array storage scheme, (n+l) one dimensional arrays are necessary to store n-dimensional array. In this paper, we propose `Generalized Row/Column Storage (GCRS/GCCS)' scheme which requires three one dimensional arrays only for storing a n-dimensional array. The superiority of the GCRS/GCCS over traditional Compressed Row/Column Storage (CRS/CCS) is shown by both theoretical analysis and experimental results. In theoretical analysis, we derive equations for space and time complexity as well as the range of usability for GCRS/GCCS. It is shown that the GCRS/GCCS scheme yields to support minimum 50% data density where as the range of usability is inversely proportional with the number of dimensions for CRS/CCS scheme. The experimental result shows that the proposed GCRS/GCCS scheme outperforms the CRS/CCS scheme with respect to space complexity, time complexity and range of usability.

在高性能科学计算中，数据稀疏度随着维数的增加而增加。对这种高度稀疏的多维数据进行存储和应用操作仍然是数据科学家面临的一个挑战。专家建议在稀疏阵列之上采用特殊的存储方案。在传统的稀疏数组存储方案中，存储n维数组需要(n+l)个一维数组。在本文中，我们提出了“通用行/列存储(GCRS/GCCS)”方案，该方案只需要三个一维数组来存储n维数组。理论分析和实验结果都证明了GCRS/GCCS相对于传统压缩行/列存储(CRS/CCS)的优越性。在理论分析中，我们推导了GCRS/GCCS的空间和时间复杂度方程以及可用性范围。结果表明，GCRS/GCCS方案至少支持50%的数据密度，其中可用性范围与CRS/CCS方案的维数成反比。实验结果表明，GCRS/GCCS方案在空间复杂度、时间复杂度和可用性范围等方面均优于CRS/CCS方案。

{"title":"Efficient storage scheme for n-dimensional sparse array: GCRS/GCCS","authors":"Md Abu Hanif Shaikh, K. Hasan","doi":"10.1109/HPCSim.2015.7237032","DOIUrl":"https://doi.org/10.1109/HPCSim.2015.7237032","url":null,"abstract":"Degree of data sparsity increases with the increase of number of dimensions in high performance scientific computing. Storing and applying operations on this highly sparse multidimensional data is still a challenge for data scientists. Experts suggest special storage scheme over sparse array. In traditional sparse array storage scheme, (n+l) one dimensional arrays are necessary to store n-dimensional array. In this paper, we propose `Generalized Row/Column Storage (GCRS/GCCS)' scheme which requires three one dimensional arrays only for storing a n-dimensional array. The superiority of the GCRS/GCCS over traditional Compressed Row/Column Storage (CRS/CCS) is shown by both theoretical analysis and experimental results. In theoretical analysis, we derive equations for space and time complexity as well as the range of usability for GCRS/GCCS. It is shown that the GCRS/GCCS scheme yields to support minimum 50% data density where as the range of usability is inversely proportional with the number of dimensions for CRS/CCS scheme. The experimental result shows that the proposed GCRS/GCCS scheme outperforms the CRS/CCS scheme with respect to space complexity, time complexity and range of usability.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117179962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A survey on Information Flow Control mechanisms in web applications web应用中的信息流控制机制综述

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237042

Oscar Zibordi de Paiva, W. Ruggiero

Web applications are nowadays ubiquitous channels that provide access to valuable information. However, web application security remains problematic, with Information Leakage, Cross-Site Scripting and SQL-Injection vulnerabilities - which all present threats to information - standing among the most common ones. On the other hand, Information Flow Control is a mature and well-studied area, providing techniques to ensure the confidentiality and integrity of information. Thus, numerous works were made proposing the use of these techniques to improve web application security. This paper provides a survey on some of these works that propose server-side only mechanisms, which operate in association with standard browsers. It also provides a brief overview of the information flow control techniques themselves. At the end, we draw a comparative scenario between the surveyed works, highlighting the environments for which they were designed and the security guarantees they provide, also suggesting directions in which they may evolve.

如今，Web应用程序是无处不在的通道，提供对有价值信息的访问。然而，web应用程序的安全性仍然存在问题，信息泄露、跨站点脚本和sql注入漏洞——这些都对信息构成威胁——是最常见的漏洞。另一方面，信息流控制是一个成熟且研究充分的领域，它提供了确保信息机密性和完整性的技术。因此，大量的工作建议使用这些技术来提高web应用程序的安全性。本文概述了其中一些提出仅服务端机制的工作，这些机制与标准浏览器相关联。它还提供了信息流控制技术本身的简要概述。最后，我们绘制了一个被调查作品之间的比较场景，突出了它们的设计环境和它们提供的安全保障，也提出了它们可能发展的方向。

引用次数: 2

Performance evaluation of Data Mining algorithms on three generations of Intel® microarchitecture 数据挖掘算法在三代Intel®微架构上的性能评估

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237059

S. Sadasivam, S. Selvi

Data Mining algorithms and machine learning techniques form a key part of the majority of computing applications today. They are becoming an inherent part of business decision processes, e-commerce, social networking and social media applications as well as commercial and scientific computing applications. It is becoming increasingly important to provide a high performance computing platform for these emerging data mining applications. In this paper we explore the performance characteristics of the data mining benchmark suite MineBench across three “tock” generations of Intel microarchitecture. Our objective is to study the impact of microarchitecture improvements on the performance of data mining algorithms. We present comparative microarchitecture characteristics between data mining algorithms and SPEC INT 2006 benchmarks. We have proposed a generic cycle accounting methodology to attribute performance improvements to various units of the microprocessor. The proposed methodology helps differentiate the impact on performance due to front-end and back-end microarchitecture improvements.

数据挖掘算法和机器学习技术构成了当今大多数计算应用的关键部分。它们正在成为商业决策过程、电子商务、社交网络和社交媒体应用以及商业和科学计算应用的固有组成部分。为这些新兴的数据挖掘应用提供一个高性能的计算平台变得越来越重要。在本文中，我们探讨了数据挖掘基准套件MineBench跨三代英特尔微架构的性能特征。我们的目标是研究微架构改进对数据挖掘算法性能的影响。我们比较了数据挖掘算法和SPEC INT 2006基准之间的微架构特征。我们提出了一种通用的周期核算方法，将性能改进归因于微处理器的各个单元。所提出的方法有助于区分前端和后端微体系结构改进对性能的影响。

{"title":"Performance evaluation of Data Mining algorithms on three generations of Intel® microarchitecture","authors":"S. Sadasivam, S. Selvi","doi":"10.1109/HPCSim.2015.7237059","DOIUrl":"https://doi.org/10.1109/HPCSim.2015.7237059","url":null,"abstract":"Data Mining algorithms and machine learning techniques form a key part of the majority of computing applications today. They are becoming an inherent part of business decision processes, e-commerce, social networking and social media applications as well as commercial and scientific computing applications. It is becoming increasingly important to provide a high performance computing platform for these emerging data mining applications. In this paper we explore the performance characteristics of the data mining benchmark suite MineBench across three “tock” generations of Intel microarchitecture. Our objective is to study the impact of microarchitecture improvements on the performance of data mining algorithms. We present comparative microarchitecture characteristics between data mining algorithms and SPEC INT 2006 benchmarks. We have proposed a generic cycle accounting methodology to attribute performance improvements to various units of the microprocessor. The proposed methodology helps differentiate the impact on performance due to front-end and back-end microarchitecture improvements.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127584179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Utilization of room-to-room transition time in Wi-Fi fingerprint-based indoor localization 基于Wi-Fi指纹的室内定位中房间间过渡时间的利用

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237056

Isil Karabey, Levent Bayindir

In indoor localization applications, many different methods have been proposed to increase positioning accuracy. Among these methods, fingerprint-based techniques are generally preferred because they use existing resources such as Wi-Fi, Bluetooth, FM signals, etc., and can be implemented on commonly used devices such as mobile phones. In this paper, we evaluate different Wi-Fi fingerprint-based methods on two datasets (with and without room-to-room transition features) created from the same environment, and we investigate the impact of room-to-room transition features on classification performance. To the best of our knowledge, transition time between rooms has not been used in past studies on fingerprint-based indoor localization. This information is of significant importance, due to the physical distance between rooms. Therefore, in this study source room and transition time to a target room have been included as features in addition to signal sources and signal strength values in the target room. From preliminary experimental results we observed that the transition time between rooms increases the performance of all tested positioning algorithms, with the Back-propagation classifier showing the best performance increase (13%).

在室内定位应用中，提出了许多不同的方法来提高定位精度。在这些方法中，基于指纹的技术通常是首选的，因为它利用了现有的资源，如Wi-Fi、蓝牙、FM信号等，并且可以在手机等常用设备上实现。在本文中，我们在同一环境中创建的两个数据集(有和没有房间到房间的过渡特征)上评估了不同的基于Wi-Fi指纹的方法，并研究了房间到房间的过渡特征对分类性能的影响。据我们所知，在过去的基于指纹的室内定位研究中，没有使用房间之间的过渡时间。由于房间之间的物理距离，这个信息非常重要。因此，在本研究中，除了目标房间的信号源和信号强度值外，还将源房间和到目标房间的过渡时间作为特征。从初步的实验结果中，我们观察到房间之间的过渡时间提高了所有测试的定位算法的性能，其中反向传播分类器的性能提高最好(13%)。

{"title":"Utilization of room-to-room transition time in Wi-Fi fingerprint-based indoor localization","authors":"Isil Karabey, Levent Bayindir","doi":"10.1109/HPCSim.2015.7237056","DOIUrl":"https://doi.org/10.1109/HPCSim.2015.7237056","url":null,"abstract":"In indoor localization applications, many different methods have been proposed to increase positioning accuracy. Among these methods, fingerprint-based techniques are generally preferred because they use existing resources such as Wi-Fi, Bluetooth, FM signals, etc., and can be implemented on commonly used devices such as mobile phones. In this paper, we evaluate different Wi-Fi fingerprint-based methods on two datasets (with and without room-to-room transition features) created from the same environment, and we investigate the impact of room-to-room transition features on classification performance. To the best of our knowledge, transition time between rooms has not been used in past studies on fingerprint-based indoor localization. This information is of significant importance, due to the physical distance between rooms. Therefore, in this study source room and transition time to a target room have been included as features in addition to signal sources and signal strength values in the target room. From preliminary experimental results we observed that the transition time between rooms increases the performance of all tested positioning algorithms, with the Back-propagation classifier showing the best performance increase (13%).","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129819335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Applying domain decomposition Schwarz method to accelerate wind field calculation 应用域分解Schwarz方法加速风场计算

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237080

Gemma Sanjuan, T. Margalef, A. Cortés

Wind field is a critical issue in forest fire propagation prediction. However, wind field calculation is a complex problem that for large terrains involves solving huge linear systems. Solving such systems takes too much time and makes the approach unfeasible in real time operation. To overcome this problem the Schwarz alternating domain decomposition can be applied. Using this method the linear system is decomposed in a set of overlapped subdomains that can be solved in parallel using a Master/Worker paradigm and the wind field calculation time can be significantly reduced.

风场是森林火灾传播预测中的一个关键问题。然而，风场计算是一个复杂的问题，对于大型地形，需要求解巨大的线性系统。求解这类系统耗时太长，使得该方法在实时操作中不可行。为了克服这个问题，可以应用Schwarz交替域分解。该方法将线性系统分解为一组重叠的子域，并采用主/工模式并行求解，可显著减少风场计算时间。

引用次数: 6

Fast and scalable NUMA-based thread parallel breadth-first search 快速和可扩展的基于numa的线程并行宽度优先搜索

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237065

Yuichiro Yasui, K. Fujisawa

The breadth-first search (BFS) is one of the most centric kernels in graph processing. Beamer's direction-optimizing BFS algorithm, which selects one of two traversal directions at each level, can reduce unnecessary edge traversals. In a previous paper, we presented an efficient BFS for a non-uniform memory access (NUMA)-based system, in which the NUMA architecture was carefully considered. In this paper, we investigate the locality of memory accesses in terms of the communication with remote memories in a BFS for a NUMA system, and describe a fast and highly scalable implementation. Our new implementation achieves performance rates of 174.704 billion edges per second for a Kronecker graph with 233 vertices and 237 edges on two racks of a SGI UV 2000 system with 1,280 threads. The implementations described in this paper achieved the fastest entries for a shared-memory system in the June 2014 and November 2014 Graph500 lists, and produced the most energy-efficient entries in the second, third, and fourth Green Graph500 lists (big data category).

宽度优先搜索(BFS)是图处理中最核心的核算法之一。Beamer的方向优化BFS算法在每层的两个遍历方向中选择一个，可以减少不必要的边缘遍历。在之前的一篇论文中，我们提出了一种基于非统一内存访问(NUMA)系统的高效BFS，其中仔细考虑了NUMA架构。在本文中，我们从NUMA系统的BFS中与远程存储器通信的角度研究了存储器访问的局部性，并描述了一个快速和高度可扩展的实现。我们的新实现在SGI UV 2000系统的两个机架上实现了具有233个顶点和237条边的Kronecker图每秒1747.04亿个边的性能。本文描述的实现在2014年6月和2014年11月的Graph500列表中实现了共享内存系统最快的条目，并在第二、第三和第四Green Graph500列表(大数据类别)中产生了最节能的条目。

引用次数: 23

The Batched DOACROSS loop parallelization algorithm 批处理DOACROSS循环并行化算法

2015 International Conference on High Performance Computing & Simulation (HPCS)

Pub Date : 2015-07-20 DOI: 10.1109/HPCSim.2015.7237079

D. C. S. Lucas, G. Araújo

Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far from solved. For many loops, old (DOACROSS), and new (DSWP) techniques have not been able to offer a solution to this problem. This paper does a qualitative and quantitative analysis of synchronization costs of these two loop parallelization algorithms, on two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results show that at least 30% of the execution time of the programs we parallelized are spent on synchronization/data communication. We also show that, besides the problem being hard, these architectures are on opposite endpoints along the axis of commonly accepted requisites for efficient loop parallelization. As a consequence, both techniques struggle to effectively speed up several programs. Moreover, this paper presents a novel algorithm, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACROSS, while minimizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP does) and makes use of thread local buffers to reduce DOACROSS synchronization overheads.

并行化包含循环携带的依赖项的循环被认为是一项非常困难的任务，主要是由于在迭代之间通信依赖项所带来的开销。尽管为这种循环设计有效的并行化技术付出了巨大的努力，但这个问题仍然远远没有解决。对于许多循环，旧的(DOACROSS)和新的(DSWP)技术都不能解决这个问题。本文在两种现代计算机体系结构(ARM A9 MPCore和Intel Ivy Bridge)上对这两种循环并行化算法的同步成本进行了定性和定量分析。我们的结果表明，我们并行化的程序至少有30%的执行时间花在同步/数据通信上。我们还表明，除了困难的问题之外，这些体系结构沿着普遍接受的有效循环并行化必要条件轴的相反端点。因此，这两种技术都难以有效地提高几个程序的速度。此外，本文提出了一种新的算法，称为Batched DOACROSS (BDX)，它利用了DSWP和DOACROSS的优点，同时最大限度地减少了它们的不足。BDX不需要新的硬件机制(与DSWP不同)，并且利用线程本地缓冲区来减少DOACROSS同步开销。

{"title":"The Batched DOACROSS loop parallelization algorithm","authors":"D. C. S. Lucas, G. Araújo","doi":"10.1109/HPCSim.2015.7237079","DOIUrl":"https://doi.org/10.1109/HPCSim.2015.7237079","url":null,"abstract":"Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far from solved. For many loops, old (DOACROSS), and new (DSWP) techniques have not been able to offer a solution to this problem. This paper does a qualitative and quantitative analysis of synchronization costs of these two loop parallelization algorithms, on two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results show that at least 30% of the execution time of the programs we parallelized are spent on synchronization/data communication. We also show that, besides the problem being hard, these architectures are on opposite endpoints along the axis of commonly accepted requisites for efficient loop parallelization. As a consequence, both techniques struggle to effectively speed up several programs. Moreover, this paper presents a novel algorithm, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACROSS, while minimizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP does) and makes use of thread local buffers to reduce DOACROSS synchronization overheads.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121762949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 International Conference on High Performance Computing & Simulation (HPCS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀