Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00041
Giuseppe Agapito, M. Cannataro
The continuous technological development of experimental omics technologies such as microarrays, allows to perform large scale genomics studies. After the initial enthusiasm, it became pretty clear that even the results provided by microarrays in form of lists of differential expressed genes (DEGs), were mainly as enigmatic as the first sequence of the genome, because these lists of DEGs are detached from the influenced biological mechanisms. Pathway enrichment analysis (PEA) supports researchers to provide the clues necessary to link DEGs to the influenced biological pathways and consequently to the underlying biological mechanisms and processes. Putting DEGs data sets in a suitable format for the PEA can be a tedious error-prone and laborious process even for bioinformaticians, who needs to perform it manually before to be ready for the PEA. To fill this lack, we present a parallel software pipeline which uploads a list of DEGs and automatically provides as results the enriched pathways.The parallel software pipeline is implemented in Python and provides the following automated actions: i) parallel splitting of DEGs in groups; ii) parallel building of the similarity matrices related to the DEGs groups; iii) parallel mapping of similarity matrices in networks; iv) parallel pathway enrichment analysis for each group of identified DEGs.Preliminary results shown that the pipeline can help to analyze DEGs and easily generate in a few minutes a list of pathway enrichment results that otherwise would require numerous hours of manual work and several different scripts.The parallel software pipeline provides a two-fold benefits: first, it contributes to speed up the computation of pathway enrichment, automating several steps currently performed manually. Second, it provides a more peculiar list of DEGs to calculate pathway enrichment, contributing to improve the relevance and significance of the enriched pathways.
{"title":"A parallel software pipeline to select relevant genes for pathway enrichment","authors":"Giuseppe Agapito, M. Cannataro","doi":"10.1109/pdp55904.2022.00041","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00041","url":null,"abstract":"The continuous technological development of experimental omics technologies such as microarrays, allows to perform large scale genomics studies. After the initial enthusiasm, it became pretty clear that even the results provided by microarrays in form of lists of differential expressed genes (DEGs), were mainly as enigmatic as the first sequence of the genome, because these lists of DEGs are detached from the influenced biological mechanisms. Pathway enrichment analysis (PEA) supports researchers to provide the clues necessary to link DEGs to the influenced biological pathways and consequently to the underlying biological mechanisms and processes. Putting DEGs data sets in a suitable format for the PEA can be a tedious error-prone and laborious process even for bioinformaticians, who needs to perform it manually before to be ready for the PEA. To fill this lack, we present a parallel software pipeline which uploads a list of DEGs and automatically provides as results the enriched pathways.The parallel software pipeline is implemented in Python and provides the following automated actions: i) parallel splitting of DEGs in groups; ii) parallel building of the similarity matrices related to the DEGs groups; iii) parallel mapping of similarity matrices in networks; iv) parallel pathway enrichment analysis for each group of identified DEGs.Preliminary results shown that the pipeline can help to analyze DEGs and easily generate in a few minutes a list of pathway enrichment results that otherwise would require numerous hours of manual work and several different scripts.The parallel software pipeline provides a two-fold benefits: first, it contributes to speed up the computation of pathway enrichment, automating several steps currently performed manually. Second, it provides a more peculiar list of DEGs to calculate pathway enrichment, contributing to improve the relevance and significance of the enriched pathways.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117177548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00039
Andrea Giordano, Francesca Amelia, Salvatore Gigliotti, R. Rongo, W. Spataro
Load Balancing is generally referred as the technique to properly partition computation among processing elements in order to achieve optimal resource usage and thus reduce computation time. In this paper, we present a dynamic load balancing application in the context of the parallel execution of Cellular Automata where the domain space is partitioned in two dimensional regions that are assigned to different processing elements. Starting from general closed-form expressions that allow to compute the optimal workload assignment in a dynamic fashion when partitioning takes place along only one dimension, we extend the procedure to allow partitioning and balancing along both dimensions. As confirmed by the experimental results, two dimensional partitioning itself enables to speedup the execution, and further improvements are obtained when the load balancing occurs along both dimensions.
{"title":"Load Balancing of the Parallel Execution of Two Dimensional Partitioned Cellular Automata","authors":"Andrea Giordano, Francesca Amelia, Salvatore Gigliotti, R. Rongo, W. Spataro","doi":"10.1109/pdp55904.2022.00039","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00039","url":null,"abstract":"Load Balancing is generally referred as the technique to properly partition computation among processing elements in order to achieve optimal resource usage and thus reduce computation time. In this paper, we present a dynamic load balancing application in the context of the parallel execution of Cellular Automata where the domain space is partitioned in two dimensional regions that are assigned to different processing elements. Starting from general closed-form expressions that allow to compute the optimal workload assignment in a dynamic fashion when partitioning takes place along only one dimension, we extend the procedure to allow partitioning and balancing along both dimensions. As confirmed by the experimental results, two dimensional partitioning itself enables to speedup the execution, and further improvements are obtained when the load balancing occurs along both dimensions.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115342447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00027
Raafat Feki, E. Gabriel
Todays high-end parallel clusters are architecturally very complex. Most large scale applications nowadays are utilizing multiple parallel programming paradigms to achieve the required scalability, with MPI+threads being the most common approach. Yet, as of today, there is no parallel I/O library that matches this hybrid programming model. File I/O operations are typically executed by a single thread for each process. This paper explores multi-threaded optimizations for individual MPI I/O operations, an important step towards matching the execution model of modern parallel applications. We describe the changes necessary to the internal processing in the MPI I/O library as well as to the file access phase. We demonstrate the performance improvement of the redesigned functions using multiple benchmarks and on multiple platforms for many scenarios over the original, single-threaded version.
{"title":"Design and Evaluation of Multi-threaded Optimizations for Individual MPI I/O Operations","authors":"Raafat Feki, E. Gabriel","doi":"10.1109/pdp55904.2022.00027","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00027","url":null,"abstract":"Todays high-end parallel clusters are architecturally very complex. Most large scale applications nowadays are utilizing multiple parallel programming paradigms to achieve the required scalability, with MPI+threads being the most common approach. Yet, as of today, there is no parallel I/O library that matches this hybrid programming model. File I/O operations are typically executed by a single thread for each process. This paper explores multi-threaded optimizations for individual MPI I/O operations, an important step towards matching the execution model of modern parallel applications. We describe the changes necessary to the internal processing in the MPI I/O library as well as to the file access phase. We demonstrate the performance improvement of the redesigned functions using multiple benchmarks and on multiple platforms for many scenarios over the original, single-threaded version.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116415256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00022
D. Carrizales-Espinoza, Dante D. Sánchez-Gallegos, J. L. González-Compeán, J. Carretero, R. Marcelín-Jiménez
Cloud storage has been the solution for organizations to manage the exponential growth of data observed over the past few years. However, end-users still suffer from side-effects of cloud service outages, which particularly affect edge-fog-cloud environments. This paper presents SeRSS, a storage mesh architecture to create and operate reliable, configurable, and flexible serverless storage services for heterogeneous infrastructures. A case study was conducted based on-the-fly building of storage services to manage medical imagery. The experimental evaluation revealed the efficiency of SeRSS to manage and store data in a reliable manner in heterogeneous infrastructures.
{"title":"SeRSS: a storage mesh architecture to build serverless reliable storage services","authors":"D. Carrizales-Espinoza, Dante D. Sánchez-Gallegos, J. L. González-Compeán, J. Carretero, R. Marcelín-Jiménez","doi":"10.1109/pdp55904.2022.00022","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00022","url":null,"abstract":"Cloud storage has been the solution for organizations to manage the exponential growth of data observed over the past few years. However, end-users still suffer from side-effects of cloud service outages, which particularly affect edge-fog-cloud environments. This paper presents SeRSS, a storage mesh architecture to create and operate reliable, configurable, and flexible serverless storage services for heterogeneous infrastructures. A case study was conducted based on-the-fly building of storage services to manage medical imagery. The experimental evaluation revealed the efficiency of SeRSS to manage and store data in a reliable manner in heterogeneous infrastructures.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132978989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00037
G. Folino, Carla Otranto Godano, F. S. Pisani
Large user and application logs are generated and stored by many organisations at a rate that makes it really hard to analyse, especially in real-time. In particular, in the field of cybersecurity, it is of great interest to analyse fast user logs, coming from different and heterogeneous sources, in order to prevent data breach issues caused by user behaviour. In addition to these problems, often part of the data or some entire sources are missing. To overcome these issues, we propose a framework based on the Elastic Stack (ELK) to process and store log data coming from different users and applications to generate an ensemble of classifiers, in order to classify the user behaviour, and eventually to detect anomalies. The system exploits the scalable architecture of ELK by running on top of a Kubernetes platform and adopts a distributed evolutionary algorithm for classifying the users, on the basis of their digital footprints, derived by many sources of data. Preliminary experiments show that the system is effective in classifying the behaviour of the different users and that this can be considered as an auxiliary task for detecting anomalies in their behaviour, by helping to reduce the number of false alarms.
{"title":"A Scalable Architecture Exploiting Elastic Stack and Meta Ensemble of Classifiers for Profiling User Behaviour","authors":"G. Folino, Carla Otranto Godano, F. S. Pisani","doi":"10.1109/pdp55904.2022.00037","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00037","url":null,"abstract":"Large user and application logs are generated and stored by many organisations at a rate that makes it really hard to analyse, especially in real-time. In particular, in the field of cybersecurity, it is of great interest to analyse fast user logs, coming from different and heterogeneous sources, in order to prevent data breach issues caused by user behaviour. In addition to these problems, often part of the data or some entire sources are missing. To overcome these issues, we propose a framework based on the Elastic Stack (ELK) to process and store log data coming from different users and applications to generate an ensemble of classifiers, in order to classify the user behaviour, and eventually to detect anomalies. The system exploits the scalable architecture of ELK by running on top of a Kubernetes platform and adopts a distributed evolutionary algorithm for classifying the users, on the basis of their digital footprints, derived by many sources of data. Preliminary experiments show that the system is effective in classifying the behaviour of the different users and that this can be considered as an auxiliary task for detecting anomalies in their behaviour, by helping to reduce the number of false alarms.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126587457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00018
Manel Lurbe, Josué Feliu, S. Petit, M. E. Gómez, J. Sahuquillo
When multiple applications are running on a platform with shared resources like multicore CPUs, the behaviour of the running application can be altered by the co-runners. In this case, the system resources need to be managed (e.g. by repartitioning the cache space, re-schedule applications in distinct cores, modifying the prefetcher configuration, etc.) to reduce the inter-application interference in order to minimize the performance losses over isolated execution. In this context, a main challenge in different computing scenarios like the public cloud or soft real-time systems is knowing the performance impact of a given management action on each application with respect to its isolated execution. With this aim, in this work we present a neural network-based approach that estimates the performance an application would have had in isolation from multi-program executions. Experimental results show that the proposal dynamically adapts to changes in application behavior. On average, the predicted performance presents an error deviation by 11.7% and 2.3% for MAPE and MSE respectively.
{"title":"A Neural Network to Estimate Isolated Performance from Multi-Program Execution","authors":"Manel Lurbe, Josué Feliu, S. Petit, M. E. Gómez, J. Sahuquillo","doi":"10.1109/pdp55904.2022.00018","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00018","url":null,"abstract":"When multiple applications are running on a platform with shared resources like multicore CPUs, the behaviour of the running application can be altered by the co-runners. In this case, the system resources need to be managed (e.g. by repartitioning the cache space, re-schedule applications in distinct cores, modifying the prefetcher configuration, etc.) to reduce the inter-application interference in order to minimize the performance losses over isolated execution. In this context, a main challenge in different computing scenarios like the public cloud or soft real-time systems is knowing the performance impact of a given management action on each application with respect to its isolated execution. With this aim, in this work we present a neural network-based approach that estimates the performance an application would have had in isolation from multi-program executions. Experimental results show that the proposal dynamically adapts to changes in application behavior. On average, the predicted performance presents an error deviation by 11.7% and 2.3% for MAPE and MSE respectively.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127243616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00013
D. D. Domenico, G. H. Cavalheiro, J. F. Lima
GPU devices are currently seen as one of the trending topics for parallel computing. Commonly, GPU applications are developed with programming tools based on compiled languages, like C/C++ and Fortran. This paper presents a performance and programming effort analysis employing the Python high-level language to implement the NAS Parallel Benchmark kernels targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement a GPU application with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB kernels. Furthermore, Python codes required less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. However, our Python versions demanded more operations than OpenACC implementations.
{"title":"NAS Parallel Benchmark Kernels with Python: A performance and programming effort analysis focusing on GPUs","authors":"D. D. Domenico, G. H. Cavalheiro, J. F. Lima","doi":"10.1109/pdp55904.2022.00013","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00013","url":null,"abstract":"GPU devices are currently seen as one of the trending topics for parallel computing. Commonly, GPU applications are developed with programming tools based on compiled languages, like C/C++ and Fortran. This paper presents a performance and programming effort analysis employing the Python high-level language to implement the NAS Parallel Benchmark kernels targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement a GPU application with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB kernels. Furthermore, Python codes required less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. However, our Python versions demanded more operations than OpenACC implementations.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124393405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00045
Navonil Chatterjee, Marcelo Ruaro, Kevin J. M. Martin, J. Diguet
Conventional wired Network-on-Chip (NoC) designs suffer from performance degradation due to multi-hop long-distance communication. To address such a problem, in the past decade, researchers have been focused on investigating Wireless NoC (WiNoC), which evolved as a viable solution to mitigate this communication bottleneck by using single-hop long-range wireless links. However, many researchers reported that these interconnects may suffer failure due to the complexity of implementation. Although few works in the literature tackle faults in WiNoC, none of them provides a comprehensive study related to channel access mechanisms in the presence of faults. To fill this gap, we propose a fault aware WiNoC architecture. We discuss two types of faults in wireless interconnects, namely, transceiver faults and token controller faults. We provide different fault-tolerant techniques to deal with such faults. The proposed FTWiNoC presents, on average, 17.8% and 8.9% improvement in latency compared to two different fault mitigation strategies in the literature.
{"title":"Mitigating Transceiver and Token Controller Permanent Faults in Wireless Network-on-Chip","authors":"Navonil Chatterjee, Marcelo Ruaro, Kevin J. M. Martin, J. Diguet","doi":"10.1109/pdp55904.2022.00045","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00045","url":null,"abstract":"Conventional wired Network-on-Chip (NoC) designs suffer from performance degradation due to multi-hop long-distance communication. To address such a problem, in the past decade, researchers have been focused on investigating Wireless NoC (WiNoC), which evolved as a viable solution to mitigate this communication bottleneck by using single-hop long-range wireless links. However, many researchers reported that these interconnects may suffer failure due to the complexity of implementation. Although few works in the literature tackle faults in WiNoC, none of them provides a comprehensive study related to channel access mechanisms in the presence of faults. To fill this gap, we propose a fault aware WiNoC architecture. We discuss two types of faults in wireless interconnects, namely, transceiver faults and token controller faults. We provide different fault-tolerant techniques to deal with such faults. The proposed FTWiNoC presents, on average, 17.8% and 8.9% improvement in latency compared to two different fault mitigation strategies in the literature.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122023611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00024
Vivien Samuel
Multiplication is a fundamental step in many algorithms. If the multiplication of two integers of n words has a complexity of M(n), divisions and squares can be computed in O(M(n)) as well and the greatest common divisor can be computed in O(M(n)logn). Thus being able to have a small value for M(n) is extremely important.To this day, the best known algorithm for reachable values is the Schönhage-Strassen algorithm which is implemented by a few arithmetic libraries. Asymptotically faster algorithms exist, however no computer is able to hold numbers big enough for those algorithms to outrun Schönhage-Strassen.The GNU Multiple Precision (GMP) library has a sequential-only implementation of Schönhage-Strassen.However some algorithms contains a step which is a single big multiplication. Thus when trying to parallelize such an algorithm, one requires a parallel algorithm for multiplication. An example of such an algorithm is the batch factorization for Number Field Sieve. Thus people trying to implement a parallel version of such algorithms need to find an arithmetic library that implements a parallel integer multiplication.An example of such a library is the Flint (Fast LIbrary for Number Theory) library that contains a parallel implementation of Schönhage-Strassen. In this article we present an implementation of Schönhage-Strassen, that reaches a speedup of 20 for the multiplication of two integers of 107 words of 64 bits using a Xeon Gold with 32 cores.
乘法是许多算法的基本步骤。如果n个单词的两个整数的乘法的复杂度为M(n),那么除法和平方也可以在O(M(n))中计算出来,最大公约数可以在O(M(n)logn)中计算出来。因此,M(n)的小值是非常重要的。到目前为止,最著名的可达值算法是Schönhage-Strassen算法,它由几个算法库实现。渐近更快的算法是存在的,但是没有计算机能够容纳足够大的数字,使这些算法超过Schönhage-Strassen。GNU多精度(GMP)库提供了一个仅顺序实现的Schönhage-Strassen。然而,一些算法包含一个步骤,这是一个大的乘法。因此,当试图并行化这样一个算法时,需要一个并行的乘法算法。这种算法的一个例子是Number Field Sieve的批量分解。因此,试图实现这种算法的并行版本的人需要找到一个实现并行整数乘法的算术库。此类库的一个示例是Flint (Number Theory的快速库)库,它包含Schönhage-Strassen的并行实现。在本文中,我们介绍了一个Schönhage-Strassen的实现,使用32核的Xeon Gold,对于两个64位107字整数的乘法,其加速速度达到20。
{"title":"Parallel integer multiplication","authors":"Vivien Samuel","doi":"10.1109/pdp55904.2022.00024","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00024","url":null,"abstract":"Multiplication is a fundamental step in many algorithms. If the multiplication of two integers of n words has a complexity of M(n), divisions and squares can be computed in O(M(n)) as well and the greatest common divisor can be computed in O(M(n)logn). Thus being able to have a small value for M(n) is extremely important.To this day, the best known algorithm for reachable values is the Schönhage-Strassen algorithm which is implemented by a few arithmetic libraries. Asymptotically faster algorithms exist, however no computer is able to hold numbers big enough for those algorithms to outrun Schönhage-Strassen.The GNU Multiple Precision (GMP) library has a sequential-only implementation of Schönhage-Strassen.However some algorithms contains a step which is a single big multiplication. Thus when trying to parallelize such an algorithm, one requires a parallel algorithm for multiplication. An example of such an algorithm is the batch factorization for Number Field Sieve. Thus people trying to implement a parallel version of such algorithms need to find an arithmetic library that implements a parallel integer multiplication.An example of such a library is the Flint (Fast LIbrary for Number Theory) library that contains a parallel implementation of Schönhage-Strassen. In this article we present an implementation of Schönhage-Strassen, that reaches a speedup of 20 for the multiplication of two integers of 107 words of 64 bits using a Xeon Gold with 32 cores.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129791973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00010
Ayaka Ohwada, Takuya Kojima, H. Amano
In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing elements (PEs) arranged in a two-dimensional array. The operations of PEs and the interconnections between them are adaptively changed depending on a target application, and this contributes to a higher energy efficiency compared to general-purpose processors. The application kernel executed on CGRAs is represented as a data flow graph (DFG), and CGRA compilers are responsible for mapping the DFG onto the PE array. Thus, mapping algorithms significantly influence the performance and power efficiency of CGRAs as well as the compile time. This paper proposes POCOCO, a compiler framework for CGRAs that can use pre-optimized subgraph mappings. This contributes to reducing the compiler optimization task. To leverage the subgraph mappings, we extend an existing mapping method based on a genetic algorithm. Experiments on three architectures demonstrated that the proposed method reduces the optimization time by 48%, on an average, for the best case of the three architectures.
{"title":"An efficient compilation of coarse-grained reconfigurable architectures utilizing pre-optimized sub-graph mappings","authors":"Ayaka Ohwada, Takuya Kojima, H. Amano","doi":"10.1109/pdp55904.2022.00010","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00010","url":null,"abstract":"In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing elements (PEs) arranged in a two-dimensional array. The operations of PEs and the interconnections between them are adaptively changed depending on a target application, and this contributes to a higher energy efficiency compared to general-purpose processors. The application kernel executed on CGRAs is represented as a data flow graph (DFG), and CGRA compilers are responsible for mapping the DFG onto the PE array. Thus, mapping algorithms significantly influence the performance and power efficiency of CGRAs as well as the compile time. This paper proposes POCOCO, a compiler framework for CGRAs that can use pre-optimized subgraph mappings. This contributes to reducing the compiler optimization task. To leverage the subgraph mappings, we extend an existing mapping method based on a genetic algorithm. Experiments on three architectures demonstrated that the proposed method reduces the optimization time by 48%, on an average, for the best case of the three architectures.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}