2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

英文中文

A parallel software pipeline to select relevant genes for pathway enrichment 一个平行的软件管道来选择相关基因进行途径富集

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00041

Giuseppe Agapito, M. Cannataro

The continuous technological development of experimental omics technologies such as microarrays, allows to perform large scale genomics studies. After the initial enthusiasm, it became pretty clear that even the results provided by microarrays in form of lists of differential expressed genes (DEGs), were mainly as enigmatic as the first sequence of the genome, because these lists of DEGs are detached from the influenced biological mechanisms. Pathway enrichment analysis (PEA) supports researchers to provide the clues necessary to link DEGs to the influenced biological pathways and consequently to the underlying biological mechanisms and processes. Putting DEGs data sets in a suitable format for the PEA can be a tedious error-prone and laborious process even for bioinformaticians, who needs to perform it manually before to be ready for the PEA. To fill this lack, we present a parallel software pipeline which uploads a list of DEGs and automatically provides as results the enriched pathways.The parallel software pipeline is implemented in Python and provides the following automated actions: i) parallel splitting of DEGs in groups; ii) parallel building of the similarity matrices related to the DEGs groups; iii) parallel mapping of similarity matrices in networks; iv) parallel pathway enrichment analysis for each group of identified DEGs.Preliminary results shown that the pipeline can help to analyze DEGs and easily generate in a few minutes a list of pathway enrichment results that otherwise would require numerous hours of manual work and several different scripts.The parallel software pipeline provides a two-fold benefits: first, it contributes to speed up the computation of pathway enrichment, automating several steps currently performed manually. Second, it provides a more peculiar list of DEGs to calculate pathway enrichment, contributing to improve the relevance and significance of the enriched pathways.

实验组学技术的不断发展，如微阵列，允许进行大规模的基因组学研究。在最初的热情之后，很明显，即使是微阵列以差异表达基因列表(deg)的形式提供的结果，也主要像基因组的第一个序列一样神秘，因为这些差异表达基因列表与受影响的生物机制是分离的。途径富集分析(Pathway enrichment analysis, PEA)支持研究人员提供必要的线索，将deg与受影响的生物学途径联系起来，从而与潜在的生物学机制和过程联系起来。将DEGs数据集转换为适合PEA的格式可能是一个乏味且容易出错的过程，即使对于生物信息学家来说也是如此，他们需要在为PEA做好准备之前手动执行该过程。为了填补这一不足，我们提出了一个并行的软件管道，它上传了一个deg列表，并自动提供了丰富的路径结果。并行软件管道是在Python中实现的，并提供以下自动操作:i)分组并行拆分deg;ii)平行构建与deg群相关的相似性矩阵;网络中相似矩阵的并行映射;iv)对每组鉴定的DEGs进行平行途径富集分析。初步结果表明，该管道可以帮助分析deg，并在几分钟内轻松生成途径富集结果列表，否则需要许多小时的手工工作和几个不同的脚本。并行软件管道提供了双重好处:首先，它有助于加快路径富集的计算速度，使目前手动执行的几个步骤自动化。其次，它提供了一个更奇特的deg列表来计算途径富集，有助于提高富集途径的相关性和意义。

{"title":"A parallel software pipeline to select relevant genes for pathway enrichment","authors":"Giuseppe Agapito, M. Cannataro","doi":"10.1109/pdp55904.2022.00041","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00041","url":null,"abstract":"The continuous technological development of experimental omics technologies such as microarrays, allows to perform large scale genomics studies. After the initial enthusiasm, it became pretty clear that even the results provided by microarrays in form of lists of differential expressed genes (DEGs), were mainly as enigmatic as the first sequence of the genome, because these lists of DEGs are detached from the influenced biological mechanisms. Pathway enrichment analysis (PEA) supports researchers to provide the clues necessary to link DEGs to the influenced biological pathways and consequently to the underlying biological mechanisms and processes. Putting DEGs data sets in a suitable format for the PEA can be a tedious error-prone and laborious process even for bioinformaticians, who needs to perform it manually before to be ready for the PEA. To fill this lack, we present a parallel software pipeline which uploads a list of DEGs and automatically provides as results the enriched pathways.The parallel software pipeline is implemented in Python and provides the following automated actions: i) parallel splitting of DEGs in groups; ii) parallel building of the similarity matrices related to the DEGs groups; iii) parallel mapping of similarity matrices in networks; iv) parallel pathway enrichment analysis for each group of identified DEGs.Preliminary results shown that the pipeline can help to analyze DEGs and easily generate in a few minutes a list of pathway enrichment results that otherwise would require numerous hours of manual work and several different scripts.The parallel software pipeline provides a two-fold benefits: first, it contributes to speed up the computation of pathway enrichment, automating several steps currently performed manually. Second, it provides a more peculiar list of DEGs to calculate pathway enrichment, contributing to improve the relevance and significance of the enriched pathways.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117177548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Load Balancing of the Parallel Execution of Two Dimensional Partitioned Cellular Automata 二维分区元胞自动机并行执行的负载平衡

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00039

Andrea Giordano, Francesca Amelia, Salvatore Gigliotti, R. Rongo, W. Spataro

Load Balancing is generally referred as the technique to properly partition computation among processing elements in order to achieve optimal resource usage and thus reduce computation time. In this paper, we present a dynamic load balancing application in the context of the parallel execution of Cellular Automata where the domain space is partitioned in two dimensional regions that are assigned to different processing elements. Starting from general closed-form expressions that allow to compute the optimal workload assignment in a dynamic fashion when partitioning takes place along only one dimension, we extend the procedure to allow partitioning and balancing along both dimensions. As confirmed by the experimental results, two dimensional partitioning itself enables to speedup the execution, and further improvements are obtained when the load balancing occurs along both dimensions.

负载平衡通常被称为在处理元素之间合理划分计算，以达到最佳的资源利用，从而减少计算时间的技术。在本文中，我们提出了一个在元胞自动机并行执行背景下的动态负载平衡应用，其中域空间被划分为二维区域，分配给不同的处理元素。从允许在仅沿一个维度进行分区时以动态方式计算最佳工作负载分配的一般封闭形式表达式开始，我们扩展该过程以允许沿两个维度进行分区和平衡。实验结果证实，二维分区本身可以加快执行速度，并且在两个维度上进行负载平衡时可以获得进一步的改进。

引用次数: 0

Design and Evaluation of Multi-threaded Optimizations for Individual MPI I/O Operations 单个MPI I/O操作的多线程优化设计与评估

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00027

Raafat Feki, E. Gabriel

Todays high-end parallel clusters are architecturally very complex. Most large scale applications nowadays are utilizing multiple parallel programming paradigms to achieve the required scalability, with MPI+threads being the most common approach. Yet, as of today, there is no parallel I/O library that matches this hybrid programming model. File I/O operations are typically executed by a single thread for each process. This paper explores multi-threaded optimizations for individual MPI I/O operations, an important step towards matching the execution model of modern parallel applications. We describe the changes necessary to the internal processing in the MPI I/O library as well as to the file access phase. We demonstrate the performance improvement of the redesigned functions using multiple benchmarks and on multiple platforms for many scenarios over the original, single-threaded version.

今天的高端并行集群在架构上非常复杂。如今，大多数大规模应用程序都在利用多个并行编程范式来实现所需的可伸缩性，MPI+线程是最常见的方法。然而，到目前为止，还没有与这种混合编程模型相匹配的并行I/O库。文件I/O操作通常由每个进程的单个线程执行。本文探讨了单个MPI I/O操作的多线程优化，这是与现代并行应用程序的执行模型相匹配的重要一步。我们描述了MPI I/O库中的内部处理以及文件访问阶段所必需的更改。我们使用多个基准测试，并在多个平台上为许多场景演示了与原始的单线程版本相比，重新设计的函数的性能改进。

引用次数: 1

SeRSS: a storage mesh architecture to build serverless reliable storage services SeRSS:用于构建无服务器可靠存储服务的存储网格架构

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00022

D. Carrizales-Espinoza, Dante D. Sánchez-Gallegos, J. L. González-Compeán, J. Carretero, R. Marcelín-Jiménez

Cloud storage has been the solution for organizations to manage the exponential growth of data observed over the past few years. However, end-users still suffer from side-effects of cloud service outages, which particularly affect edge-fog-cloud environments. This paper presents SeRSS, a storage mesh architecture to create and operate reliable, configurable, and flexible serverless storage services for heterogeneous infrastructures. A case study was conducted based on-the-fly building of storage services to manage medical imagery. The experimental evaluation revealed the efficiency of SeRSS to manage and store data in a reliable manner in heterogeneous infrastructures.

在过去的几年中，云存储一直是组织管理指数级增长的数据的解决方案。然而，最终用户仍然会遭受云服务中断的副作用，这尤其会影响边缘雾云环境。本文介绍了SeRSS，一种存储网格体系结构，用于为异构基础设施创建和运行可靠、可配置和灵活的无服务器存储服务。以实时构建医学影像存储服务为例进行了研究。实验结果表明，SeRSS能够在异构基础设施中可靠地管理和存储数据。

引用次数: 3

A Scalable Architecture Exploiting Elastic Stack and Meta Ensemble of Classifiers for Profiling User Behaviour 利用弹性堆栈和元集成分类器分析用户行为的可扩展架构

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00037

G. Folino, Carla Otranto Godano, F. S. Pisani

Large user and application logs are generated and stored by many organisations at a rate that makes it really hard to analyse, especially in real-time. In particular, in the field of cybersecurity, it is of great interest to analyse fast user logs, coming from different and heterogeneous sources, in order to prevent data breach issues caused by user behaviour. In addition to these problems, often part of the data or some entire sources are missing. To overcome these issues, we propose a framework based on the Elastic Stack (ELK) to process and store log data coming from different users and applications to generate an ensemble of classifiers, in order to classify the user behaviour, and eventually to detect anomalies. The system exploits the scalable architecture of ELK by running on top of a Kubernetes platform and adopts a distributed evolutionary algorithm for classifying the users, on the basis of their digital footprints, derived by many sources of data. Preliminary experiments show that the system is effective in classifying the behaviour of the different users and that this can be considered as an auxiliary task for detecting anomalies in their behaviour, by helping to reduce the number of false alarms.

许多组织生成和存储大量用户和应用程序日志的速度非常快，很难分析，尤其是在实时情况下。特别是在网络安全领域，为了防止用户行为引起的数据泄露问题，分析来自不同和异构来源的快速用户日志是非常有趣的。除了这些问题之外，通常还会丢失部分数据或某些完整来源。为了克服这些问题，我们提出了一个基于弹性堆栈(ELK)的框架来处理和存储来自不同用户和应用程序的日志数据，以生成分类器集合，以便对用户行为进行分类，并最终检测异常。该系统在Kubernetes平台上运行，利用ELK的可扩展架构，并采用分布式进化算法对用户进行分类，基于用户的数字足迹，由多个数据源派生。初步实验表明，该系统在对不同用户的行为进行分类方面是有效的，通过帮助减少误报的数量，可以将其视为检测异常行为的辅助任务。

{"title":"A Scalable Architecture Exploiting Elastic Stack and Meta Ensemble of Classifiers for Profiling User Behaviour","authors":"G. Folino, Carla Otranto Godano, F. S. Pisani","doi":"10.1109/pdp55904.2022.00037","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00037","url":null,"abstract":"Large user and application logs are generated and stored by many organisations at a rate that makes it really hard to analyse, especially in real-time. In particular, in the field of cybersecurity, it is of great interest to analyse fast user logs, coming from different and heterogeneous sources, in order to prevent data breach issues caused by user behaviour. In addition to these problems, often part of the data or some entire sources are missing. To overcome these issues, we propose a framework based on the Elastic Stack (ELK) to process and store log data coming from different users and applications to generate an ensemble of classifiers, in order to classify the user behaviour, and eventually to detect anomalies. The system exploits the scalable architecture of ELK by running on top of a Kubernetes platform and adopts a distributed evolutionary algorithm for classifying the users, on the basis of their digital footprints, derived by many sources of data. Preliminary experiments show that the system is effective in classifying the behaviour of the different users and that this can be considered as an auxiliary task for detecting anomalies in their behaviour, by helping to reduce the number of false alarms.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126587457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Neural Network to Estimate Isolated Performance from Multi-Program Execution 一种估计多程序执行孤立性能的神经网络

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00018

Manel Lurbe, Josué Feliu, S. Petit, M. E. Gómez, J. Sahuquillo

When multiple applications are running on a platform with shared resources like multicore CPUs, the behaviour of the running application can be altered by the co-runners. In this case, the system resources need to be managed (e.g. by repartitioning the cache space, re-schedule applications in distinct cores, modifying the prefetcher configuration, etc.) to reduce the inter-application interference in order to minimize the performance losses over isolated execution. In this context, a main challenge in different computing scenarios like the public cloud or soft real-time systems is knowing the performance impact of a given management action on each application with respect to its isolated execution. With this aim, in this work we present a neural network-based approach that estimates the performance an application would have had in isolation from multi-program executions. Experimental results show that the proposal dynamically adapts to changes in application behavior. On average, the predicted performance presents an error deviation by 11.7% and 2.3% for MAPE and MSE respectively.

当多个应用程序在具有多核cpu等共享资源的平台上运行时，正在运行的应用程序的行为可以由共同运行者改变。在这种情况下，需要管理系统资源(例如，通过重新划分缓存空间，在不同的核心中重新调度应用程序，修改预取器配置等)来减少应用程序间的干扰，以最大限度地减少隔离执行带来的性能损失。在这种情况下，在不同的计算场景(如公共云或软实时系统)中，一个主要挑战是了解给定的管理操作对每个应用程序的性能影响(相对于其孤立的执行)。有了这个目标，在这项工作中，我们提出了一种基于神经网络的方法，该方法可以估计应用程序在独立于多程序执行时的性能。实验结果表明，该方案能够动态适应应用行为的变化。平均而言，MAPE和MSE的预测性能偏差分别为11.7%和2.3%。

{"title":"A Neural Network to Estimate Isolated Performance from Multi-Program Execution","authors":"Manel Lurbe, Josué Feliu, S. Petit, M. E. Gómez, J. Sahuquillo","doi":"10.1109/pdp55904.2022.00018","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00018","url":null,"abstract":"When multiple applications are running on a platform with shared resources like multicore CPUs, the behaviour of the running application can be altered by the co-runners. In this case, the system resources need to be managed (e.g. by repartitioning the cache space, re-schedule applications in distinct cores, modifying the prefetcher configuration, etc.) to reduce the inter-application interference in order to minimize the performance losses over isolated execution. In this context, a main challenge in different computing scenarios like the public cloud or soft real-time systems is knowing the performance impact of a given management action on each application with respect to its isolated execution. With this aim, in this work we present a neural network-based approach that estimates the performance an application would have had in isolation from multi-program executions. Experimental results show that the proposal dynamically adapts to changes in application behavior. On average, the predicted performance presents an error deviation by 11.7% and 2.3% for MAPE and MSE respectively.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127243616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NAS Parallel Benchmark Kernels with Python: A performance and programming effort analysis focusing on GPUs NAS并行基准内核与Python:性能和编程工作分析的重点是gpu

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00013

D. D. Domenico, G. H. Cavalheiro, J. F. Lima

GPU devices are currently seen as one of the trending topics for parallel computing. Commonly, GPU applications are developed with programming tools based on compiled languages, like C/C++ and Fortran. This paper presents a performance and programming effort analysis employing the Python high-level language to implement the NAS Parallel Benchmark kernels targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement a GPU application with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB kernels. Furthermore, Python codes required less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. However, our Python versions demanded more operations than OpenACC implementations.

GPU设备目前被视为并行计算的热门话题之一。通常，GPU应用程序是使用基于编译语言(如C/ c++和Fortran)的编程工具开发的。本文介绍了采用Python高级语言实现以gpu为目标的NAS并行基准内核的性能和编程工作量分析。我们使用Numba环境在Python中启用CUDA支持，该工具允许我们使用纯Python代码实现GPU应用程序。我们的实验结果表明，Python应用程序在大多数NPB内核上达到了与使用CUDA的c++程序相似的性能，并且优于使用OpenACC的c++程序。此外，Python代码比CUDA需要更少的与GPU框架相关的操作，主要是因为Python需要更少的语句来管理内存分配和数据传输。然而，我们的Python版本需要比OpenACC实现更多的操作。

引用次数: 4

Mitigating Transceiver and Token Controller Permanent Faults in Wireless Network-on-Chip 减轻无线片上网络中的收发器和令牌控制器永久故障

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00045

Navonil Chatterjee, Marcelo Ruaro, Kevin J. M. Martin, J. Diguet

Conventional wired Network-on-Chip (NoC) designs suffer from performance degradation due to multi-hop long-distance communication. To address such a problem, in the past decade, researchers have been focused on investigating Wireless NoC (WiNoC), which evolved as a viable solution to mitigate this communication bottleneck by using single-hop long-range wireless links. However, many researchers reported that these interconnects may suffer failure due to the complexity of implementation. Although few works in the literature tackle faults in WiNoC, none of them provides a comprehensive study related to channel access mechanisms in the presence of faults. To fill this gap, we propose a fault aware WiNoC architecture. We discuss two types of faults in wireless interconnects, namely, transceiver faults and token controller faults. We provide different fault-tolerant techniques to deal with such faults. The proposed FTWiNoC presents, on average, 17.8% and 8.9% improvement in latency compared to two different fault mitigation strategies in the literature.

传统的有线片上网络(NoC)设计由于多跳长距离通信而导致性能下降。为了解决这一问题，在过去的十年中，研究人员一直专注于研究无线NoC (WiNoC)，它通过使用单跳远程无线链路来缓解这一通信瓶颈。然而，许多研究人员报告说，由于实现的复杂性，这些互连可能会失败。虽然文献中很少有研究WiNoC中的断层，但都没有对断层存在下的通道访问机制进行全面的研究。为了填补这一空白，我们提出了一个故障感知的WiNoC体系结构。讨论了无线互连中的两类故障，即收发器故障和令牌控制器故障。我们提供了不同的容错技术来处理这类错误。与文献中两种不同的故障缓解策略相比，所提出的FTWiNoC的延迟平均改善了17.8%和8.9%。

{"title":"Mitigating Transceiver and Token Controller Permanent Faults in Wireless Network-on-Chip","authors":"Navonil Chatterjee, Marcelo Ruaro, Kevin J. M. Martin, J. Diguet","doi":"10.1109/pdp55904.2022.00045","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00045","url":null,"abstract":"Conventional wired Network-on-Chip (NoC) designs suffer from performance degradation due to multi-hop long-distance communication. To address such a problem, in the past decade, researchers have been focused on investigating Wireless NoC (WiNoC), which evolved as a viable solution to mitigate this communication bottleneck by using single-hop long-range wireless links. However, many researchers reported that these interconnects may suffer failure due to the complexity of implementation. Although few works in the literature tackle faults in WiNoC, none of them provides a comprehensive study related to channel access mechanisms in the presence of faults. To fill this gap, we propose a fault aware WiNoC architecture. We discuss two types of faults in wireless interconnects, namely, transceiver faults and token controller faults. We provide different fault-tolerant techniques to deal with such faults. The proposed FTWiNoC presents, on average, 17.8% and 8.9% improvement in latency compared to two different fault mitigation strategies in the literature.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122023611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Parallel integer multiplication 并行整数乘法

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00024

Vivien Samuel

Multiplication is a fundamental step in many algorithms. If the multiplication of two integers of n words has a complexity of M(n), divisions and squares can be computed in O(M(n)) as well and the greatest common divisor can be computed in O(M(n)logn). Thus being able to have a small value for M(n) is extremely important.To this day, the best known algorithm for reachable values is the Schönhage-Strassen algorithm which is implemented by a few arithmetic libraries. Asymptotically faster algorithms exist, however no computer is able to hold numbers big enough for those algorithms to outrun Schönhage-Strassen.The GNU Multiple Precision (GMP) library has a sequential-only implementation of Schönhage-Strassen.However some algorithms contains a step which is a single big multiplication. Thus when trying to parallelize such an algorithm, one requires a parallel algorithm for multiplication. An example of such an algorithm is the batch factorization for Number Field Sieve. Thus people trying to implement a parallel version of such algorithms need to find an arithmetic library that implements a parallel integer multiplication.An example of such a library is the Flint (Fast LIbrary for Number Theory) library that contains a parallel implementation of Schönhage-Strassen. In this article we present an implementation of Schönhage-Strassen, that reaches a speedup of 20 for the multiplication of two integers of 107 words of 64 bits using a Xeon Gold with 32 cores.

乘法是许多算法的基本步骤。如果n个单词的两个整数的乘法的复杂度为M(n)，那么除法和平方也可以在O(M(n))中计算出来，最大公约数可以在O(M(n)logn)中计算出来。因此，M(n)的小值是非常重要的。到目前为止，最著名的可达值算法是Schönhage-Strassen算法，它由几个算法库实现。渐近更快的算法是存在的，但是没有计算机能够容纳足够大的数字，使这些算法超过Schönhage-Strassen。GNU多精度(GMP)库提供了一个仅顺序实现的Schönhage-Strassen。然而，一些算法包含一个步骤，这是一个大的乘法。因此，当试图并行化这样一个算法时，需要一个并行的乘法算法。这种算法的一个例子是Number Field Sieve的批量分解。因此，试图实现这种算法的并行版本的人需要找到一个实现并行整数乘法的算术库。此类库的一个示例是Flint (Number Theory的快速库)库，它包含Schönhage-Strassen的并行实现。在本文中，我们介绍了一个Schönhage-Strassen的实现，使用32核的Xeon Gold，对于两个64位107字整数的乘法，其加速速度达到20。

{"title":"Parallel integer multiplication","authors":"Vivien Samuel","doi":"10.1109/pdp55904.2022.00024","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00024","url":null,"abstract":"Multiplication is a fundamental step in many algorithms. If the multiplication of two integers of n words has a complexity of M(n), divisions and squares can be computed in O(M(n)) as well and the greatest common divisor can be computed in O(M(n)logn). Thus being able to have a small value for M(n) is extremely important.To this day, the best known algorithm for reachable values is the Schönhage-Strassen algorithm which is implemented by a few arithmetic libraries. Asymptotically faster algorithms exist, however no computer is able to hold numbers big enough for those algorithms to outrun Schönhage-Strassen.The GNU Multiple Precision (GMP) library has a sequential-only implementation of Schönhage-Strassen.However some algorithms contains a step which is a single big multiplication. Thus when trying to parallelize such an algorithm, one requires a parallel algorithm for multiplication. An example of such an algorithm is the batch factorization for Number Field Sieve. Thus people trying to implement a parallel version of such algorithms need to find an arithmetic library that implements a parallel integer multiplication.An example of such a library is the Flint (Fast LIbrary for Number Theory) library that contains a parallel implementation of Schönhage-Strassen. In this article we present an implementation of Schönhage-Strassen, that reaches a speedup of 20 for the multiplication of two integers of 107 words of 64 bits using a Xeon Gold with 32 cores.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129791973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An efficient compilation of coarse-grained reconfigurable architectures utilizing pre-optimized sub-graph mappings 利用预先优化的子图映射的粗粒度可重构架构的有效编译

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00010

Ayaka Ohwada, Takuya Kojima, H. Amano

In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing elements (PEs) arranged in a two-dimensional array. The operations of PEs and the interconnections between them are adaptively changed depending on a target application, and this contributes to a higher energy efficiency compared to general-purpose processors. The application kernel executed on CGRAs is represented as a data flow graph (DFG), and CGRA compilers are responsible for mapping the DFG onto the PE array. Thus, mapping algorithms significantly influence the performance and power efficiency of CGRAs as well as the compile time. This paper proposes POCOCO, a compiler framework for CGRAs that can use pre-optimized subgraph mappings. This contributes to reducing the compiler optimization task. To leverage the subgraph mappings, we extend an existing mapping method based on a genetic algorithm. Experiments on three architectures demonstrated that the proposed method reduces the optimization time by 48%, on an average, for the best case of the three architectures.

近年来，随着物联网设备的普及，节能的粗粒度可重构架构(CGRAs)受到了人们的关注。CGRAs包括以二维数组排列的称为处理单元(pe)的几个处理单元。pe的操作和它们之间的互连会根据目标应用程序自适应地改变，与通用处理器相比，这有助于提高能源效率。在CGRAs上执行的应用程序内核表示为数据流图(DFG)， CGRA编译器负责将DFG映射到PE阵列。因此，映射算法显著影响CGRAs的性能和能效以及编译时间。本文提出了POCOCO，一个可以使用预优化子图映射的CGRAs编译框架。这有助于减少编译器优化任务。为了利用子图映射，我们扩展了基于遗传算法的现有映射方法。在三种体系结构上的实验表明，对于三种体系结构中最优的情况，该方法平均缩短了48%的优化时间。

{"title":"An efficient compilation of coarse-grained reconfigurable architectures utilizing pre-optimized sub-graph mappings","authors":"Ayaka Ohwada, Takuya Kojima, H. Amano","doi":"10.1109/pdp55904.2022.00010","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00010","url":null,"abstract":"In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing elements (PEs) arranged in a two-dimensional array. The operations of PEs and the interconnections between them are adaptively changed depending on a target application, and this contributes to a higher energy efficiency compared to general-purpose processors. The application kernel executed on CGRAs is represented as a data flow graph (DFG), and CGRA compilers are responsible for mapping the DFG onto the PE array. Thus, mapping algorithms significantly influence the performance and power efficiency of CGRAs as well as the compile time. This paper proposes POCOCO, a compiler framework for CGRAs that can use pre-optimized subgraph mappings. This contributes to reducing the compiler optimization task. To leverage the subgraph mappings, we extend an existing mapping method based on a genetic algorithm. Experiments on three architectures demonstrated that the proposed method reduces the optimization time by 48%, on an average, for the best case of the three architectures.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀