Pub Date : 2018-11-15DOI: 10.1109/PDP2018.2018.00036
Manuel Pedrero, E. Gutiérrez, O. Plata
Barrier is a very common synchronization method used in parallel programming. Barriers are used typically to enforce a partial thread execution order, since there may be dependences between code sections before and after the barrier. This work proposes TMbarrier, a new design of a barrier intended to be used in transactional applications. TMbarrier allows threads to continue executing speculatively after the barrier assuming that there are not dependences with safe threads that have not yet reached the barrier. Our design leverages transactional memory (TM) (specifically, the implementation offered by the IBM POWER8 processor) to hold the speculative updates and to detect possible conflicts between speculative and safe threads. Despite the limitations of the best-effort hardware TM implementation present in current processors, experiments show a reduction in wasted time due to synchronization compared to standard barriers.
{"title":"TMbarrier: Speculative Barriers Using Hardware Transactional Memory","authors":"Manuel Pedrero, E. Gutiérrez, O. Plata","doi":"10.1109/PDP2018.2018.00036","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00036","url":null,"abstract":"Barrier is a very common synchronization method used in parallel programming. Barriers are used typically to enforce a partial thread execution order, since there may be dependences between code sections before and after the barrier. This work proposes TMbarrier, a new design of a barrier intended to be used in transactional applications. TMbarrier allows threads to continue executing speculatively after the barrier assuming that there are not dependences with safe threads that have not yet reached the barrier. Our design leverages transactional memory (TM) (specifically, the implementation offered by the IBM POWER8 processor) to hold the speculative updates and to detect possible conflicts between speculative and safe threads. Despite the limitations of the best-effort hardware TM implementation present in current processors, experiments show a reduction in wasted time due to synchronization compared to standard barriers.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129428847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-07DOI: 10.1109/PDP2018.2018.00010
Christina Herzog, J. Pierson
This paper proposes an agent based approach to the scheduling of jobs in data centers under thermal constraints. The model encompasses both temporal and spatial aspects of the temperature evolution using a unified model, taking into account the dynamics of heat production and dissipation. Agents coordinate to eventually move jobs to the best suitable place and to adapt dynamically the frequency settings of the nodes to the best combination. Several objectives of the agents are compared under different circumstances by an extensive set of experiments.
{"title":"A Generic Learning Multi-agent-System Approach for Spatio-Temporal-, Thermal- and Energy-Aware Scheduling","authors":"Christina Herzog, J. Pierson","doi":"10.1109/PDP2018.2018.00010","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00010","url":null,"abstract":"This paper proposes an agent based approach to the scheduling of jobs in data centers under thermal constraints. The model encompasses both temporal and spatial aspects of the temperature evolution using a unified model, taking into account the dynamics of heat production and dissipation. Agents coordinate to eventually move jobs to the best suitable place and to adapt dynamically the frequency settings of the nodes to the best combination. Several objectives of the agents are compared under different circumstances by an extensive set of experiments.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132031753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-07DOI: 10.1109/PDP2018.2018.00047
A. Adewojo, J. Bass
Multi-tenancy in cloud computing describes the extent to which resources can be shared while guaranteeing isolation among components (tenants) using these resources. There are three multi-tenancy patterns: shared, tenant-isolated and dedicated component patterns. These patterns have not previously been formally specified. In order to create a precise definition and verify each pattern, we formally specify each pattern using the Z language. To validate the interpretation of our formal description, We empirically evaluate each pattern using the data-tier of a cloud hosted distributed content man- agement application, WordPress, deployed in a Docker container. Experimental results show that the dedicated pattern successfully managed larger numbers of tenants with fewer unhandled request errors. The shared and tenant isolated patterns exhibited larger number of unhandled request errors when the number of tenants increased. We present a selection algorithm to choose suitable multi-tenancy pattern for cloud deployment of content management system.
{"title":"Evaluating the Effect of Multi-Tenancy Patterns in Containerized Cloud-Hosted Content Management System","authors":"A. Adewojo, J. Bass","doi":"10.1109/PDP2018.2018.00047","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00047","url":null,"abstract":"Multi-tenancy in cloud computing describes the extent to which resources can be shared while guaranteeing isolation among components (tenants) using these resources. There are three multi-tenancy patterns: shared, tenant-isolated and dedicated component patterns. These patterns have not previously been formally specified. In order to create a precise definition and verify each pattern, we formally specify each pattern using the Z language. To validate the interpretation of our formal description, We empirically evaluate each pattern using the data-tier of a cloud hosted distributed content man- agement application, WordPress, deployed in a Docker container. Experimental results show that the dedicated pattern successfully managed larger numbers of tenants with fewer unhandled request errors. The shared and tenant isolated patterns exhibited larger number of unhandled request errors when the number of tenants increased. We present a selection algorithm to choose suitable multi-tenancy pattern for cloud deployment of content management system.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130406609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-06DOI: 10.1109/PDP2018.2018.00018
A. Owenson, Steven A. Wright, Richard A. Bunt, S. Jarvis, Y. Ho, Matthew J. Street
Achieving high-performance of large scientific codes is a difficult task. This has led to the development of numerous mini-applications that are more tractable to analyse, while retaining performance characteristics of their full-sized counterparts. These "mini-apps" also enable faster hardware evaluation, and for sensitive codes allow evaluation of systems outside of access approval processes. In this paper we develop a mini-application of a geometric multigrid, unstructured grid Computational Fluid Dynamics (CFD) code, designed to exhibit similar performance characteristics without sharing code. We detail our experiences developing this application, using guidelines detailed in existing research, and contribute further additions to these to aid future mini-application developers. Our application is validated against the inviscid flux routine of HYDRA, a CFD code developed by Rolls-Royce, which confirms that the parent kernel and mini-application share fundamental causes of parallel inefficiency. We then use the mini-application to assess the impact of Intel's Knights Landing (KNL) on performance. We find that the mini-app and parent kernel continue to share scaling characteristics, however a comparison with Broadwell performance exposed significant differences between the kernels that were undetected by the validation.
{"title":"Developing and Using a Geometric Multigrid, Unstructured Grid Mini-Application to Assess Many-Core Architectures","authors":"A. Owenson, Steven A. Wright, Richard A. Bunt, S. Jarvis, Y. Ho, Matthew J. Street","doi":"10.1109/PDP2018.2018.00018","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00018","url":null,"abstract":"Achieving high-performance of large scientific codes is a difficult task. This has led to the development of numerous mini-applications that are more tractable to analyse, while retaining performance characteristics of their full-sized counterparts. These \"mini-apps\" also enable faster hardware evaluation, and for sensitive codes allow evaluation of systems outside of access approval processes. In this paper we develop a mini-application of a geometric multigrid, unstructured grid Computational Fluid Dynamics (CFD) code, designed to exhibit similar performance characteristics without sharing code. We detail our experiences developing this application, using guidelines detailed in existing research, and contribute further additions to these to aid future mini-application developers. Our application is validated against the inviscid flux routine of HYDRA, a CFD code developed by Rolls-Royce, which confirms that the parent kernel and mini-application share fundamental causes of parallel inefficiency. We then use the mini-application to assess the impact of Intel's Knights Landing (KNL) on performance. We find that the mini-app and parent kernel continue to share scaling characteristics, however a comparison with Broadwell performance exposed significant differences between the kernels that were undetected by the validation.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125148908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00052
V. Buravlev, R. Nicola, Alberto Lluch-Lafuente, C. A. Mezzina
Data availability is a key aspect of modern distributed systems. We discuss an extension of coordination languages based on tuple spaces with programming abstractions for sharing data and guaranteeing availability with different consistency guarantees. Data can be spread over the system according to user-specified replica placement strategies and user-specified consistency requirements. The framework takes care then of low-level management of the replicas, so that the programmer can just focus on the business logic of the application. We advocate that the proposed programming primitives are beneficial for data-oriented applications where different kinds of data may have different needs in terms of availability and consistency.
{"title":"Improving Availability in Distributed Tuple Spaces Via Sharing Abstractions and Replication Strategies","authors":"V. Buravlev, R. Nicola, Alberto Lluch-Lafuente, C. A. Mezzina","doi":"10.1109/PDP2018.2018.00052","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00052","url":null,"abstract":"Data availability is a key aspect of modern distributed systems. We discuss an extension of coordination languages based on tuple spaces with programming abstractions for sharing data and guaranteeing availability with different consistency guarantees. Data can be spread over the system according to user-specified replica placement strategies and user-specified consistency requirements. The framework takes care then of low-level management of the replicas, so that the programmer can just focus on the business logic of the application. We advocate that the proposed programming primitives are beneficial for data-oriented applications where different kinds of data may have different needs in terms of availability and consistency.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117151901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00011
Roussian R. A. Gaioso, V. Gil-Costa, H. Guardia, H. Senger
In this paper we propose and evaluate new strategies for the parallel top-k query processing on GPUs. Our strategies are based on the document-at-a-time approach and have been implemented and tested with the WAND ranking algorithm. In our first strategy (named homogeneous), the posting lists are evenly partitioned among thread blocks. Our second algorithm, named heterogeneous, partitions the posting lists according to document identifier intervals, thus partitions may have different sizes. We also propose three threshold sharing policies, named Local, Safe-R and Safe-WR, which emulate the WAND algorithm global pruning technique. We evaluated our proposals using AND/OR queries, and the results show that the homogeneous algorithm allows better speedups through higher occupancy of the SMs, but at the cost of a lower recall. The heterogeneous algorithm produces the exact top-k documents and shows promising speedups. Also, the Shared-R and Shared-WR policies for threshold propagation allowed better performance, provided there is enough amount of work per thread block, which proved true for queries composed of at least a few millions documents.
{"title":"A Parallel Implementation of WAND on GPUs","authors":"Roussian R. A. Gaioso, V. Gil-Costa, H. Guardia, H. Senger","doi":"10.1109/PDP2018.2018.00011","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00011","url":null,"abstract":"In this paper we propose and evaluate new strategies for the parallel top-k query processing on GPUs. Our strategies are based on the document-at-a-time approach and have been implemented and tested with the WAND ranking algorithm. In our first strategy (named homogeneous), the posting lists are evenly partitioned among thread blocks. Our second algorithm, named heterogeneous, partitions the posting lists according to document identifier intervals, thus partitions may have different sizes. We also propose three threshold sharing policies, named Local, Safe-R and Safe-WR, which emulate the WAND algorithm global pruning technique. We evaluated our proposals using AND/OR queries, and the results show that the homogeneous algorithm allows better speedups through higher occupancy of the SMs, but at the cost of a lower recall. The heterogeneous algorithm produces the exact top-k documents and shows promising speedups. Also, the Shared-R and Shared-WR policies for threshold propagation allowed better performance, provided there is enough amount of work per thread block, which proved true for queries composed of at least a few millions documents.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115064726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00120
Dalvan Griebler, Junior Loff, G. Mencagli, M. Danelutto, L. G. Fernandes
Benchmarking is a way to study the performance of new architectures and parallel programming frameworks. Well-established benchmark suites such as the NAS Parallel Benchmarks (NPB) comprise legacy codes that still lack portability to C++ language. As a consequence, a set of high-level and easy-to-use C++ parallel programming frameworks cannot be tested in NPB. Our goal is to describe a C++ porting of the NPB kernels and to analyze the performance achieved by different parallel implementations written using the Intel TBB, OpenMP and FastFlow frameworks for Multi-Cores. The experiments show an efficient code porting from Fortran to C++ and an efficient parallelization on average.
{"title":"Efficient NAS Benchmark Kernels with C++ Parallel Programming","authors":"Dalvan Griebler, Junior Loff, G. Mencagli, M. Danelutto, L. G. Fernandes","doi":"10.1109/PDP2018.2018.00120","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00120","url":null,"abstract":"Benchmarking is a way to study the performance of new architectures and parallel programming frameworks. Well-established benchmark suites such as the NAS Parallel Benchmarks (NPB) comprise legacy codes that still lack portability to C++ language. As a consequence, a set of high-level and easy-to-use C++ parallel programming frameworks cannot be tested in NPB. Our goal is to describe a C++ porting of the NPB kernels and to analyze the performance achieved by different parallel implementations written using the Intel TBB, OpenMP and FastFlow frameworks for Multi-Cores. The experiments show an efficient code porting from Fortran to C++ and an efficient parallelization on average.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122524098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00019
S. Aali, H. Shahhoseini, N. Bagherzadeh
The divisible load scheduling of image processing applications on the heterogeneous star network is addressed in this paper. In our platform, processors and links have different speeds. Also the computation and communication overheads are considered. A new genetic algorithm for minimizing the processing time of low level image applications using divisible load theory is introduced. A closed form solution for the processing time and the image fractions that should be assigned to each processor are obtained. The optimum number of participating processors and the optimal sequence for load distribution with a new genetic algorithm are derived. The effect of different image and kernel sizes on processing time and speed up are investigated. Finally, to indicate the efficiency of our algorithm, several numerical experiments are presented.
{"title":"Divisible Load Scheduling of Image Processing Applications on the Heterogeneous Star Network Using a new Genetic Algorithm","authors":"S. Aali, H. Shahhoseini, N. Bagherzadeh","doi":"10.1109/PDP2018.2018.00019","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00019","url":null,"abstract":"The divisible load scheduling of image processing applications on the heterogeneous star network is addressed in this paper. In our platform, processors and links have different speeds. Also the computation and communication overheads are considered. A new genetic algorithm for minimizing the processing time of low level image applications using divisible load theory is introduced. A closed form solution for the processing time and the image fractions that should be assigned to each processor are obtained. The optimum number of participating processors and the optimal sequence for load distribution with a new genetic algorithm are derived. The effect of different image and kernel sizes on processing time and speed up are investigated. Finally, to indicate the efficiency of our algorithm, several numerical experiments are presented.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126558312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00059
Richard Grunzke, Volker Hartmann, T. Jejkal, H. Kollai, C. Dressler, Julia Dolhoff, Julia Stanek, H. Herold, A. Hoffmann, R. Müller-Pfefferkorn, Torsten Schrade, S. Herres‐Pawlis, G. Meinel, W. Nagel
Research data is increasingly important in order to gain insights from scientific data. To optimally foster this, the management of research data is required to be usable, customizable and fast. We enable this by building up the MASi research data management repository service, based on the KIT DM framework. The aim is on utilizing a single repository instance to serve multiple arbitrary community use cases. Due to their diverse data characteristics the performance of the MASi service has to be fitting across the different cases. We evaluate the performance along three initial heterogeneous use cases. Various aspects are investigated; First, the object insertion and query performance of the database along the object fill level. Second and third, the ingest and download performance of digital objects using real-life data sets. Highly favorable performance characteristics are shown.
{"title":"Performance Evaluation of the Metadata-Driven MASi Research Data Management Repository Service","authors":"Richard Grunzke, Volker Hartmann, T. Jejkal, H. Kollai, C. Dressler, Julia Dolhoff, Julia Stanek, H. Herold, A. Hoffmann, R. Müller-Pfefferkorn, Torsten Schrade, S. Herres‐Pawlis, G. Meinel, W. Nagel","doi":"10.1109/PDP2018.2018.00059","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00059","url":null,"abstract":"Research data is increasingly important in order to gain insights from scientific data. To optimally foster this, the management of research data is required to be usable, customizable and fast. We enable this by building up the MASi research data management repository service, based on the KIT DM framework. The aim is on utilizing a single repository instance to serve multiple arbitrary community use cases. Due to their diverse data characteristics the performance of the MASi service has to be fitting across the different cases. We evaluate the performance along three initial heterogeneous use cases. Various aspects are investigated; First, the object insertion and query performance of the database along the object fill level. Second and third, the ingest and download performance of digital objects using real-life data sets. Highly favorable performance characteristics are shown.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131511032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00042
N. Tanabe, Toshio Endo
Intel announced to launch a Xeon with high-latency main memory based on 3D Xpoint in 2018. This paper presents the performance evaluation of sparse matrix kernels on the future supercomputers with high-latency main memory such as 3D Xpoint. The authors propose a high throughput evaluation methodology for exhaustive experiments, which use the University of Florida sparse matrix collection and/or LIS (a Library of Iterative Solvers for linear systems) etc. Proposed methodology is very simple to use, highly flexible for environment and high-throughput. Latency sensitivity of SpMV is measured based on the proposed methodology with 208 sparse matrices and ten storage formats only in two days, which would take for about ten years by conventional simulators. We got several interesting knowledge about latency-sensitive kernels, sparse matrices, storage formats, and preconditioners, etc. We observed notable latency sensitivity in some applications, which are Graph500, HPCG and a part of preconditioners of iterative solvers. We found latency sensitivities of SpMV are high for larger matrices than the capacity of last level cache. This suggests main memory using 3D Xpoint must be combined with large DRAM cache.
{"title":"Characterizing Memory-Latency Sensitivity of Sparse Matrix Kernels","authors":"N. Tanabe, Toshio Endo","doi":"10.1109/PDP2018.2018.00042","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00042","url":null,"abstract":"Intel announced to launch a Xeon with high-latency main memory based on 3D Xpoint in 2018. This paper presents the performance evaluation of sparse matrix kernels on the future supercomputers with high-latency main memory such as 3D Xpoint. The authors propose a high throughput evaluation methodology for exhaustive experiments, which use the University of Florida sparse matrix collection and/or LIS (a Library of Iterative Solvers for linear systems) etc. Proposed methodology is very simple to use, highly flexible for environment and high-throughput. Latency sensitivity of SpMV is measured based on the proposed methodology with 208 sparse matrices and ten storage formats only in two days, which would take for about ten years by conventional simulators. We got several interesting knowledge about latency-sensitive kernels, sparse matrices, storage formats, and preconditioners, etc. We observed notable latency sensitivity in some applications, which are Graph500, HPCG and a part of preconditioners of iterative solvers. We found latency sensitivities of SpMV are high for larger matrices than the capacity of last level cache. This suggests main memory using 3D Xpoint must be combined with large DRAM cache.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115045005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}