2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

Towards Pervasive Containerization of HPC Job Schedulers 迈向高性能计算作业调度器的普及容器化

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00046

C. Cérin, Nicolas Grenèche, Tarek Menouer

In cloud computing, elasticity is defined as "the degree to which a system is able to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each point in time the available resources match the current demand as closely as possible". Adding elasticity to HPC (High Performance Computing) clusters management systems remains challenging even if we deploy such HPC systems in today's cloud environments. This difficulty is caused by the fact that HPC jobs scheduler needs to rely on a fixed set of resources. Every change of topology (adding or removing computing resources) leads to a global restart of the HPC jobs scheduler. This phenomenon is not a major drawback because it provides a very effective way of sharing a fixed set of resources but we think that it could be complemented by a more elastic approach. Moreover, the elasticity issue should not be reduced to the scaling of resources issues. Clouds also enable access to various technologies that enhance the services offer to users. In this paper, our approach is to use containers technology to instantiate a tailored HPC environment based on the user's reservation constraints. We claim that the introduction and use of containers in HPC job schedulers allow better management of resources, in a more economical way. From the use case of SLURM, we release a methodology for 'containerization' of HPC jobs schedulers which is pervasive i.e. spreading widely throughout any layers of job schedulers. We also provide initial experiments demonstrating that our containerized SLURM system is operational and promising.

在云计算中，弹性被定义为“系统能够通过以自主方式提供和取消资源来适应工作负载变化的程度，以便在每个时间点可用资源尽可能与当前需求相匹配”。即使我们在当今的云环境中部署HPC系统，为HPC(高性能计算)集群管理系统增加弹性仍然具有挑战性。造成这种困难的原因是HPC作业调度器需要依赖一组固定的资源。每次拓扑更改(添加或删除计算资源)都会导致HPC作业调度器的全局重新启动。这种现象并不是一个主要的缺点，因为它提供了一种非常有效的方式来共享一套固定的资源，但我们认为可以用一种更有弹性的方法来补充它。此外，不应将弹性问题简化为资源的规模问题。云还支持访问各种技术，以增强向用户提供的服务。在本文中，我们的方法是使用容器技术实例化一个基于用户保留约束的定制HPC环境。我们声称在HPC作业调度器中引入和使用容器可以以更经济的方式更好地管理资源。从SLURM的用例来看，我们发布了一种HPC作业调度器的“容器化”方法，这种方法在任何层的作业调度器中都很普遍。我们还提供了初步实验，证明我们的容器化SLURM系统是可行的和有前途的。

{"title":"Towards Pervasive Containerization of HPC Job Schedulers","authors":"C. Cérin, Nicolas Grenèche, Tarek Menouer","doi":"10.1109/SBAC-PAD49847.2020.00046","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00046","url":null,"abstract":"In cloud computing, elasticity is defined as \"the degree to which a system is able to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each point in time the available resources match the current demand as closely as possible\". Adding elasticity to HPC (High Performance Computing) clusters management systems remains challenging even if we deploy such HPC systems in today's cloud environments. This difficulty is caused by the fact that HPC jobs scheduler needs to rely on a fixed set of resources. Every change of topology (adding or removing computing resources) leads to a global restart of the HPC jobs scheduler. This phenomenon is not a major drawback because it provides a very effective way of sharing a fixed set of resources but we think that it could be complemented by a more elastic approach. Moreover, the elasticity issue should not be reduced to the scaling of resources issues. Clouds also enable access to various technologies that enhance the services offer to users. In this paper, our approach is to use containers technology to instantiate a tailored HPC environment based on the user's reservation constraints. We claim that the introduction and use of containers in HPC job schedulers allow better management of resources, in a more economical way. From the use case of SLURM, we release a methodology for 'containerization' of HPC jobs schedulers which is pervasive i.e. spreading widely throughout any layers of job schedulers. We also provide initial experiments demonstrating that our containerized SLURM system is operational and promising.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116428604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A Robotic Communication Middleware Combining High Performance and High Reliability 一种高性能与高可靠性相结合的机器人通信中间件

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00038

Wei Liu, Hao Wu, Ziyue Jiang, Yifan Gong, Jiangming Jin

With the significant advances of AI technology, intelligent robotic systems have achieved remarkable development and profound effects. To enable massive data transmissionin an efficient and reliable way, both high performance andhigh reliability should be taken into account in system design. However, the conventional communication middleware used in the majority of autonomous robotic systems, is based on socked-based methods, which always lead to high latency. Moreover, some sophisticated communication middleware utilizes shared memory upon ring buffers for high performance without consideration of the reliability. To obtain both high performance and high reliability, we employ shared memory for performance improvement and propose a novel socket-based communication control algorithm to improve reliability during data transmission. Furthermore, based on the proposed algorithm, we implement a novel robotic communication middleware, named Robust-Z, combining both high performance and high reliability. Experimental results show that (1) Robust-Z is able to gain up to 41% and 5% performance improvement compared to ROS2 and Apollo CyberRT, respectively; (2) Robust-Z is able to provide crash safety and reduce 5.2% data missing rate compared with CyberRT.

随着人工智能技术的显著进步，智能机器人系统取得了显著的发展和深远的影响。为了高效、可靠地传输海量数据，在系统设计时必须兼顾高性能和高可靠性。然而，在大多数自主机器人系统中使用的传统通信中间件是基于基于套接字的方法，这总是导致高延迟。此外，一些复杂的通信中间件利用环缓冲区上的共享内存来实现高性能，而不考虑可靠性。为了获得高性能和高可靠性，我们采用共享内存来提高性能，并提出了一种新的基于套接字的通信控制算法来提高数据传输的可靠性。在此基础上，实现了一种高性能、高可靠性的新型机器人通信中间件Robust-Z。实验结果表明:(1)与ROS2和Apollo CyberRT相比，Robust-Z的性能分别提高了41%和5%;(2)与CyberRT相比，Robust-Z能够提供碰撞安全性，数据缺失率降低5.2%。

{"title":"A Robotic Communication Middleware Combining High Performance and High Reliability","authors":"Wei Liu, Hao Wu, Ziyue Jiang, Yifan Gong, Jiangming Jin","doi":"10.1109/SBAC-PAD49847.2020.00038","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00038","url":null,"abstract":"With the significant advances of AI technology, intelligent robotic systems have achieved remarkable development and profound effects. To enable massive data transmissionin an efficient and reliable way, both high performance andhigh reliability should be taken into account in system design. However, the conventional communication middleware used in the majority of autonomous robotic systems, is based on socked-based methods, which always lead to high latency. Moreover, some sophisticated communication middleware utilizes shared memory upon ring buffers for high performance without consideration of the reliability. To obtain both high performance and high reliability, we employ shared memory for performance improvement and propose a novel socket-based communication control algorithm to improve reliability during data transmission. Furthermore, based on the proposed algorithm, we implement a novel robotic communication middleware, named Robust-Z, combining both high performance and high reliability. Experimental results show that (1) Robust-Z is able to gain up to 41% and 5% performance improvement compared to ROS2 and Apollo CyberRT, respectively; (2) Robust-Z is able to provide crash safety and reduce 5.2% data missing rate compared with CyberRT.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128484342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

OmpTracing: Easy Profiling of OpenMP Programs comptracing: OpenMP程序的简单分析

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00042

Vitoria Pinho, H. Yviquel, M. Pereira, G. Araújo

One of the greatest challenges of modern computing is the development of software for parallel execution. To address such challenge, programmers use profiling tools to record relevant operations, like the communications that the different parts of an application carried out during its execution. Profilers can be used to analyze the execution of the application as they enable the programmer to check its performance hot spots and sources of overhead. This paper introduces the OmpTracing library, a lightweight tool that eases the task of profiling OpenMP based applications without the need to inject expensive profiling code into the program. OmpTracing leverages on OMPT, an application programming interface that provides an introspection mechanism of the OpenMP runtime, and that enables the programmer to capture execution details of the parallelized application while generating notifications about significant program events.

现代计算的最大挑战之一是并行执行软件的开发。为了应对这样的挑战，程序员使用分析工具来记录相关的操作，比如在执行过程中应用程序的不同部分所执行的通信。分析器可用于分析应用程序的执行，因为它们使程序员能够检查其性能热点和开销来源。本文介绍了comptracing库，这是一个轻量级工具，可以简化基于OpenMP的应用程序的分析任务，而不需要向程序中注入昂贵的分析代码。OMPT是一种应用程序编程接口，它提供了OpenMP运行时的内省机制，使程序员能够在生成关于重要程序事件的通知时捕获并行化应用程序的执行细节。

引用次数: 0

Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC 基于TFLITE-SOC的加速器设计空间探索和端到端DNN评估

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00013

Nicolas Bohm Agostini, Shi Dong, Elmira Karimi, Marti Torrents Lapuerta, José Cano, José L. Abellán, D. Kaeli

Recently there has been a rapidly growing demand for faster machine learning (ML) processing in data centers and migration of ML inference applications to edge devices. These developments have prompted both industry and academia to explore custom accelerators to optimize ML executions for performance and power. However, identifying which accelerator is best equipped for performing a particular ML task is challenging, especially given the growing range of ML tasks, the number of target environments, and the limited number of integrated modeling tools. To tackle this issue, it is of paramount importance to provide the computer architecture research community with a common framework capable of performing a comprehensive, uniform, and fair comparison across different accelerator designs targeting a particular ML task. To this aim, we propose a new framework named TFLITE-SOC (System On Chip) that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language's hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis. In addition to providing rapid prototyping, integrated benchmarking, and a range of platform configurations, TFLITE-SOC offers comprehensive performance analysis of accelerator occupancy and execution time breakdown as well as a rich set of modules that can be used by new accelerators to implement scaling up studies and optimized memory transfer protocols. We present our framework and demonstrate its utility by considering the design space of a TPU-like systolic array and describing possible directions for optimization. Using a compression technique, we implement an optimization targeting reducing the memory traffic between DRAM and on-device buffers. Compared to the baseline accelerator, our optimized design shows up to 1.26x speedup on accelerated operations and up to 1.19x speedup on end-to-end DNN execution.

最近，数据中心对更快的机器学习(ML)处理以及将ML推理应用程序迁移到边缘设备的需求迅速增长。这些发展促使工业界和学术界探索定制加速器，以优化机器学习执行的性能和功率。然而，确定哪个加速器最适合执行特定的机器学习任务是具有挑战性的，特别是考虑到机器学习任务的范围不断扩大、目标环境的数量不断增加，以及集成建模工具的数量有限。为了解决这个问题，为计算机体系结构研究界提供一个通用的框架是至关重要的，这个框架能够在针对特定ML任务的不同加速器设计之间进行全面、统一和公平的比较。使用这种方法，我们可以利用SystemC语言的分层设计功能，对在SystemC中开发的新加速器进行建模和评估，从而实现更快的设计原型。此外，使用TFLite - soc设计的任何加速器都可以通过与TFLite兼容的任何DNN模型进行基准测试，从而实现端到端DNN处理和详细(即每个DNN层)性能分析。除了提供快速原型设计、集成基准测试和一系列平台配置外，TFLITE-SOC还提供加速器占用率和执行时间分解的全面性能分析，以及一组丰富的模块，可用于新加速器实施扩展研究和优化内存传输协议。我们提出了我们的框架，并通过考虑类似tpu的收缩阵列的设计空间和描述优化的可能方向来展示其实用性。使用压缩技术，我们实现了一种优化，目标是减少DRAM和设备上缓冲区之间的内存流量。与基线加速器相比，我们的优化设计在加速操作上加速高达1.26倍，在端到端DNN执行上加速高达1.19倍。

{"title":"Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC","authors":"Nicolas Bohm Agostini, Shi Dong, Elmira Karimi, Marti Torrents Lapuerta, José Cano, José L. Abellán, D. Kaeli","doi":"10.1109/SBAC-PAD49847.2020.00013","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00013","url":null,"abstract":"Recently there has been a rapidly growing demand for faster machine learning (ML) processing in data centers and migration of ML inference applications to edge devices. These developments have prompted both industry and academia to explore custom accelerators to optimize ML executions for performance and power. However, identifying which accelerator is best equipped for performing a particular ML task is challenging, especially given the growing range of ML tasks, the number of target environments, and the limited number of integrated modeling tools. To tackle this issue, it is of paramount importance to provide the computer architecture research community with a common framework capable of performing a comprehensive, uniform, and fair comparison across different accelerator designs targeting a particular ML task. To this aim, we propose a new framework named TFLITE-SOC (System On Chip) that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language's hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis. In addition to providing rapid prototyping, integrated benchmarking, and a range of platform configurations, TFLITE-SOC offers comprehensive performance analysis of accelerator occupancy and execution time breakdown as well as a rich set of modules that can be used by new accelerators to implement scaling up studies and optimized memory transfer protocols. We present our framework and demonstrate its utility by considering the design space of a TPU-like systolic array and describing possible directions for optimization. Using a compression technique, we implement an optimization targeting reducing the memory traffic between DRAM and on-device buffers. Compared to the baseline accelerator, our optimized design shows up to 1.26x speedup on accelerated operations and up to 1.19x speedup on end-to-end DNN execution.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123368446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Reliable and Energy-aware Mapping of Streaming Series-parallel Applications onto Hierarchical Platforms 流串并联应用到分层平台的可靠和能量感知映射

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00026

Changjiang Gou, A. Benoit, Mingsong Chen, L. Marchal, Tongquan Wei

Streaming applications come from various application fields such as physics, and many can be represented as a series-parallel dependence graph. We aim at minimizing the energy consumption of such applications when executed on a hierarchical platform, by proposing novel mapping strategies. Dynamic voltage and frequency scaling (DVFS) is used to reduce the energy consumption, and we ensure a reliable execution by either executing a task at maximum speed, or by triplicating it. In this paper, we propose a structure rule to partition the series-parallel applications, and we prove that the optimization problem is NP-complete. We are able to derive a dynamic programming algorithm for the special case of linear chains, which provides an interesting heuristic and a building block for designing heuristics for the general case. The heuristics performance is compared to a baseline solution, where each task is executed at maximum speed. Simulations demonstrate that significant energy savings can be obtained.

流应用程序来自不同的应用领域，如物理，许多应用程序可以表示为串并联依赖图。通过提出新颖的映射策略，我们的目标是在分层平台上执行此类应用程序时最大限度地减少能耗。动态电压和频率缩放(DVFS)用于降低能耗，我们通过以最大速度执行任务或通过三倍执行任务来确保可靠的执行。本文提出了一个划分串并联应用的结构规则，并证明了该优化问题是np完全的。我们能够推导出线性链特殊情况下的动态规划算法，这为一般情况下的启发式设计提供了有趣的启发和基础。将启发式性能与基线解决方案进行比较，在基线解决方案中，每个任务都以最大速度执行。仿真结果表明，可以获得显著的节能效果。

引用次数: 0

An Optimal Model for Optimizing the Placement and Parallelism of Data Stream Processing Applications on Cloud-Edge Computing 基于云边缘计算的数据流处理应用程序布局和并行性优化模型

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00019

Felipe Rodrigo de Souza, M. Assunção, E. Caron, A. Veith

The Internet of Things has enabled many application scenarios where a large number of connected devices generate unbounded streams of data, often processed by data stream processing frameworks deployed in the cloud. Edge computing enables offloading processing from the cloud and placing it close to where the data is generated, thereby reducing the time to process data events and deployment costs. However, edge resources are more computationally constrained than their cloud counterparts, raising two interrelated issues, namely deciding on the parallelism of processing tasks (a.k.a. operators) and their mapping onto available resources. In this work, we formulate the scenario of operator placement and parallelism as an optimal mixed-integer linear programming problem. The proposed model is termed as Cloud-Edge data Stream Placement (CESP). Experimental results using discrete-event simulation demonstrate that CESP can achieve an end-to-end latency at least ≃ 80% and monetary costs at least ≃ 30% better than traditional cloud deployment.

物联网已经实现了许多应用场景，大量连接的设备产生无界的数据流，这些数据流通常由部署在云中的数据流处理框架进行处理。边缘计算支持从云端卸载处理，并将其放置在数据生成的附近，从而减少处理数据事件的时间和部署成本。然而，边缘资源比云计算资源更受计算限制，这引发了两个相互关联的问题，即决定处理任务(又称运算符)的并行性及其到可用资源的映射。在这项工作中，我们将算子放置和并行性的场景表述为最优混合整数线性规划问题。提出的模型被称为云边缘数据流放置(CESP)。基于离散事件仿真的实验结果表明，与传统云部署相比，CESP的端到端时延至少达到80%，成本至少达到30%。

{"title":"An Optimal Model for Optimizing the Placement and Parallelism of Data Stream Processing Applications on Cloud-Edge Computing","authors":"Felipe Rodrigo de Souza, M. Assunção, E. Caron, A. Veith","doi":"10.1109/SBAC-PAD49847.2020.00019","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00019","url":null,"abstract":"The Internet of Things has enabled many application scenarios where a large number of connected devices generate unbounded streams of data, often processed by data stream processing frameworks deployed in the cloud. Edge computing enables offloading processing from the cloud and placing it close to where the data is generated, thereby reducing the time to process data events and deployment costs. However, edge resources are more computationally constrained than their cloud counterparts, raising two interrelated issues, namely deciding on the parallelism of processing tasks (a.k.a. operators) and their mapping onto available resources. In this work, we formulate the scenario of operator placement and parallelism as an optimal mixed-integer linear programming problem. The proposed model is termed as Cloud-Edge data Stream Placement (CESP). Experimental results using discrete-event simulation demonstrate that CESP can achieve an end-to-end latency at least ≃ 80% and monetary costs at least ≃ 30% better than traditional cloud deployment.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125755805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Performance Analysis and Optimization of the Vector-Kronecker Product Multiplication 向量-克罗内克积乘法的性能分析与优化

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00044

Alexandre Azevedo, C. Bentes, Maria Clicia Stelling de Castro, C. Tadonki

The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, used to model complex systems using structured descriptions. This operation needs to be computed efficiently, since it is a critical kernel for iterative algorithms. In this work, we focus on the vector-kronecker product operation, where we present an in-depth performance analysis of a sequential and a parallel algorithm previously proposed. Based on this analysis, we proposed three optimizations: changing the memory access pattern, reducing load imbalance and manually vectorizing some portions of the code with Intel SSE4.2 intrinsics. The obtained results show better cache usage and load balance, thus improving the performance, especially for larger matrices.

克罗内克积，也称为张量积，是一种基本的矩阵代数运算，用于使用结构化描述对复杂系统进行建模。这个操作需要高效地计算，因为它是迭代算法的关键核。在这项工作中，我们专注于向量-克罗内克积运算，其中我们对先前提出的顺序和并行算法进行了深入的性能分析。基于此分析，我们提出了三种优化:改变内存访问模式，减少负载不平衡以及使用Intel SSE4.2 intrinsic手动向量化代码的某些部分。得到的结果显示了更好的缓存使用和负载平衡，从而提高了性能，特别是对于较大的矩阵。

引用次数: 2

Scalable and Efficient Spatial-Aware Parallelization Strategies for Multimedia Retrieval 面向多媒体检索的可扩展高效空间感知并行化策略

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00027

Guilherme Andrade, George Teodoro, R. Ferreira

Similarity search is a key operation in several multimedia applications, including online Content-Based Multimedia Retrieval (CBMR) services. These applications have to deal with very large databases and are submitted to high query rates. In this context, scalability in distributed memory system is critical to assemble the required computing power and memory space. However, we have identified that the Data Equal Split (DES) parallelization and associated data partition strategy employed by the related works on the domain have limitations in terms of efficiency and scalability. Therefore, in this paper, we developed and implemented a framework for similarity search execution on distributed memory machines and proposed a novel class of data partition strategies that takes into account the data spatial organization in its distribution. This approach leads to a reduction in communication traffic and in costs associated with processing each task in local searches carried out in the distributed machine. Our approach attained a speedup of 2.4× on top of DES in the baseline case (5 nodes) and also achieves higher scalability efficiency and is 14.5× faster when 160 nodes are used. In fact, our novel data organization led to superlinear scalability in all configurations evaluated.

相似度搜索是许多多媒体应用的关键操作，包括基于内容的在线多媒体检索服务。这些应用程序必须处理非常大的数据库，并且提交的查询率很高。在这种情况下，分布式内存系统的可伸缩性对于集合所需的计算能力和内存空间至关重要。然而，我们已经发现，在该领域的相关工作中使用的数据相等分割(DES)并行化和相关的数据分区策略在效率和可伸缩性方面存在局限性。因此，在本文中，我们开发并实现了一个在分布式存储机器上执行相似搜索的框架，并提出了一类考虑数据分布中的空间组织的新型数据分区策略。这种方法减少了通信流量，并降低了在分布式机器上执行本地搜索中处理每个任务的相关成本。在基线情况下(5个节点)，我们的方法在DES的基础上实现了2.4倍的加速，并且还实现了更高的可伸缩性效率，当使用160个节点时，速度提高了14.5倍。事实上，我们的新颖数据组织在所有评估的配置中都带来了超线性可伸缩性。

{"title":"Scalable and Efficient Spatial-Aware Parallelization Strategies for Multimedia Retrieval","authors":"Guilherme Andrade, George Teodoro, R. Ferreira","doi":"10.1109/SBAC-PAD49847.2020.00027","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00027","url":null,"abstract":"Similarity search is a key operation in several multimedia applications, including online Content-Based Multimedia Retrieval (CBMR) services. These applications have to deal with very large databases and are submitted to high query rates. In this context, scalability in distributed memory system is critical to assemble the required computing power and memory space. However, we have identified that the Data Equal Split (DES) parallelization and associated data partition strategy employed by the related works on the domain have limitations in terms of efficiency and scalability. Therefore, in this paper, we developed and implemented a framework for similarity search execution on distributed memory machines and proposed a novel class of data partition strategies that takes into account the data spatial organization in its distribution. This approach leads to a reduction in communication traffic and in costs associated with processing each task in local searches carried out in the distributed machine. Our approach attained a speedup of 2.4× on top of DES in the baseline case (5 nodes) and also achieves higher scalability efficiency and is 14.5× faster when 160 nodes are used. In fact, our novel data organization led to superlinear scalability in all configurations evaluated.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124209536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Energy-Efficient Time Series Analysis Using Transprecision Computing 使用精确计算的节能时间序列分析

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00022

Ivan Fernandez, Ricardo Quislant, E. Gutiérrez, O. Plata

Time series analysis is a key step in monitoring and predicting events over time in domains such as epidemiology, genomics, medicine, seismology, speech recognition, and economics. Matrix Profile has been recently proposed as a promising technique to perform time series analysis. For each subsequence, the matrix profile provides the most similar neighbour in the time series. This computation requires a huge amount of floating-point (FP) operations, which are a major contributor (approximately 50%) to the energy consumption in modern computing platforms. Transprecision Computing has recently emerged as a promising approach to improve energy efficiency and performance by tolerating some loss of precision in FP operations. In this work, we study how the matrix profile parallel algorithms benefit from transprecision computing using a recently proposed transprecision FPU. This FPU is intended to be integrated on embedded devices as part of RISC-V processors, FPGAs or ASICs to perform energy-efficient time series analysis. To this end, we propose an accuracy metric to compare the results with the double precision matrix profile. We use this metric to explore a wide range of exponent and mantissa combinations for a variety of datasets, as well as a mixed precision and a vectorized approach. Our analysis reveals that the energy consumption is reduced up to 3.3x compared with double precision approaches, while only slightly affecting the accuracy.

在流行病学、基因组学、医学、地震学、语音识别和经济学等领域，时间序列分析是监测和预测事件随时间变化的关键步骤。矩阵轮廓最近被认为是一种很有前途的时间序列分析技术。对于每个子序列，矩阵轮廓提供时间序列中最相似的邻居。这种计算需要大量的浮点(FP)操作，这是现代计算平台中能源消耗的主要贡献者(大约50%)。透明精度计算最近作为一种有前途的方法出现，通过容忍FP操作中的一些精度损失来提高能源效率和性能。在这项工作中，我们研究了矩阵轮廓并行算法如何受益于透明精度计算，使用最近提出的透明精度FPU。该FPU旨在作为RISC-V处理器，fpga或asic的一部分集成在嵌入式设备上，以执行节能时间序列分析。为此，我们提出了一个精度度量，将结果与双精度矩阵轮廓进行比较。我们使用这个度量来探索各种数据集的指数和尾数组合，以及混合精度和矢量化方法。我们的分析表明，与双精度方法相比，能耗降低了3.3倍，而精度只受到轻微影响。

{"title":"Energy-Efficient Time Series Analysis Using Transprecision Computing","authors":"Ivan Fernandez, Ricardo Quislant, E. Gutiérrez, O. Plata","doi":"10.1109/SBAC-PAD49847.2020.00022","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00022","url":null,"abstract":"Time series analysis is a key step in monitoring and predicting events over time in domains such as epidemiology, genomics, medicine, seismology, speech recognition, and economics. Matrix Profile has been recently proposed as a promising technique to perform time series analysis. For each subsequence, the matrix profile provides the most similar neighbour in the time series. This computation requires a huge amount of floating-point (FP) operations, which are a major contributor (approximately 50%) to the energy consumption in modern computing platforms. Transprecision Computing has recently emerged as a promising approach to improve energy efficiency and performance by tolerating some loss of precision in FP operations. In this work, we study how the matrix profile parallel algorithms benefit from transprecision computing using a recently proposed transprecision FPU. This FPU is intended to be integrated on embedded devices as part of RISC-V processors, FPGAs or ASICs to perform energy-efficient time series analysis. To this end, we propose an accuracy metric to compare the results with the double precision matrix profile. We use this metric to explore a wide range of exponent and mantissa combinations for a variety of datasets, as well as a mixed precision and a vectorized approach. Our analysis reveals that the energy consumption is reduced up to 3.3x compared with double precision approaches, while only slightly affecting the accuracy.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114743849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

MASA-StarPU: Parallel Sequence Comparison with Multiple Scheduling Policies and Pruning MASA-StarPU:具有多调度策略和修剪的并行序列比较

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00039

Rafael A. Lopes, Samuel Thibault, A. Melo

Sequence comparison tools based on the Smith-Waterman (SW) algorithm provide the optimal result but have high execution times when the sequences compared are long, since a huge dynamic programming (DP) matrix is computed. Block pruning is an optimization that does not compute some parts of the DP matrix and can reduce considerably the execution time when the sequences compared are similar. However, block pruning's resulting task graph is dynamic and irregular. Since different pruning scenarios lead to different pruning shapes, we advocate that no single scheduling policy will behave the best for all scenarios. This paper proposes MASA-StarPU, a sequence aligner that integrates the domain specific framework MASA to the generic programming environment StarPU, creating a tool which has the benefits of StarPU (i.e., multiple task scheduling policies) and MASA (i.e., fast sequence alignment). MASA-StarPU was executed in two different multicore platforms and the results show that a bad choice of the scheduling policy may have a great impact on the performance. For instance, using 24 cores, the 5M x 5M comparison took 1484s with the dmdas policy whereas the same comparison took 3601s with lws. We also show that no scheduling policy behaves the best for all scenarios.

基于Smith-Waterman (SW)算法的序列比较工具可以提供最优结果，但由于需要计算庞大的动态规划(DP)矩阵，当比较的序列较长时，执行时间较长。块剪枝是一种不计算DP矩阵某些部分的优化，当比较的序列相似时，可以大大减少执行时间。然而，块修剪得到的任务图是动态的、不规则的。由于不同的修剪场景会导致不同的修剪形状，我们主张没有一个调度策略可以对所有场景都表现得最好。本文提出了一种序列对齐器MASA-StarPU，它将特定领域的框架MASA集成到通用编程环境StarPU中，创建了一个具有StarPU(即多任务调度策略)和MASA(即快速序列对齐)优点的工具。在两个不同的多核平台上执行了MASA-StarPU，结果表明调度策略的选择不当可能会对性能产生很大影响。例如，使用24核时，5M × 5M比较使用dmdas策略需要1484秒，而相同的比较使用lws策略需要3601秒。我们还表明，没有调度策略对所有场景都是最佳的。

{"title":"MASA-StarPU: Parallel Sequence Comparison with Multiple Scheduling Policies and Pruning","authors":"Rafael A. Lopes, Samuel Thibault, A. Melo","doi":"10.1109/SBAC-PAD49847.2020.00039","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00039","url":null,"abstract":"Sequence comparison tools based on the Smith-Waterman (SW) algorithm provide the optimal result but have high execution times when the sequences compared are long, since a huge dynamic programming (DP) matrix is computed. Block pruning is an optimization that does not compute some parts of the DP matrix and can reduce considerably the execution time when the sequences compared are similar. However, block pruning's resulting task graph is dynamic and irregular. Since different pruning scenarios lead to different pruning shapes, we advocate that no single scheduling policy will behave the best for all scenarios. This paper proposes MASA-StarPU, a sequence aligner that integrates the domain specific framework MASA to the generic programming environment StarPU, creating a tool which has the benefits of StarPU (i.e., multiple task scheduling policies) and MASA (i.e., fast sequence alignment). MASA-StarPU was executed in two different multicore platforms and the results show that a bad choice of the scheduling policy may have a great impact on the performance. For instance, using 24 cores, the 5M x 5M comparison took 1484s with the dmdas policy whereas the same comparison took 3601s with lws. We also show that no scheduling policy behaves the best for all scenarios.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124482488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2