Pub Date : 2020-11-01DOI: 10.1109/MCHPC51950.2020.00011
Neil A. Butcher, Stephen L. Olivier, P. Kogge
Many-core systems are beginning to feature novel large, high-bandwidth intermediate memory as a visible part of the memory hierarchy. This paper discusses how to make use of intermediate memory when composing matrix multiply with transpose to compute $A$ * AT. We re-purpose the cache-oblivious approach developed by Frigo et al. and apply it to the composition of a bandwidth-bound kernel (transpose) with a compute-bound kernel (matrix multiply). Particular focus is on regions of matrix shapes far from square that are not usually considered. Our codes are simpler than optimized codes, but reasonably close in performance. Also, perhaps of more importance is developing a paradigm for how to construct other codes using intermediate memories.
{"title":"Cache Oblivious Strategies to Exploit Multi-Level Memory on Manycore Systems","authors":"Neil A. Butcher, Stephen L. Olivier, P. Kogge","doi":"10.1109/MCHPC51950.2020.00011","DOIUrl":"https://doi.org/10.1109/MCHPC51950.2020.00011","url":null,"abstract":"Many-core systems are beginning to feature novel large, high-bandwidth intermediate memory as a visible part of the memory hierarchy. This paper discusses how to make use of intermediate memory when composing matrix multiply with transpose to compute $A$ * AT. We re-purpose the cache-oblivious approach developed by Frigo et al. and apply it to the composition of a bandwidth-bound kernel (transpose) with a compute-bound kernel (matrix multiply). Particular focus is on regions of matrix shapes far from square that are not usually considered. Our codes are simpler than optimized codes, but reasonably close in performance. Also, perhaps of more importance is developing a paradigm for how to construct other codes using intermediate memories.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132988296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/MCHPC51950.2020.00012
Mohammad Alaul Haque Monil, Seyong Lee, J. Vetter, A. Malony
Because of increasing complexity in the memory hierarchy, predicting the performance of a given application in a given processor is becoming more difficult. The problem is worsened by the fact that the hardware needed to deal with more complex memory traffic also affects energy consumption. Moreover, in a heterogeneous system with shared main memory, the memory traffic between the last level cache (LLC) and the memory creates contention between other processors and accelerator devices. For these reasons, it is important to investigate and understand the impact of different memory access patterns on the memory system. This study investigates the interplay between Intel processors' memory hierarchy and different memory access patterns in applications. The authors explore sequential streaming and strided memory access patterns with the objective of predicting LLC-dynamic random access memory (DRAM) traffic for a given application in given Intel architectures. Moreover, the impact of prefetching is also investigated in this study. Experiments with different Intel micro-architectures uncover mechanisms to predict LLC-DRAM traffic that can yield up to 99% accuracy for sequential streaming access patterns and up to 95% accuracy for strided access patterns.
{"title":"Understanding the Impact of Memory Access Patterns in Intel Processors","authors":"Mohammad Alaul Haque Monil, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/MCHPC51950.2020.00012","DOIUrl":"https://doi.org/10.1109/MCHPC51950.2020.00012","url":null,"abstract":"Because of increasing complexity in the memory hierarchy, predicting the performance of a given application in a given processor is becoming more difficult. The problem is worsened by the fact that the hardware needed to deal with more complex memory traffic also affects energy consumption. Moreover, in a heterogeneous system with shared main memory, the memory traffic between the last level cache (LLC) and the memory creates contention between other processors and accelerator devices. For these reasons, it is important to investigate and understand the impact of different memory access patterns on the memory system. This study investigates the interplay between Intel processors' memory hierarchy and different memory access patterns in applications. The authors explore sequential streaming and strided memory access patterns with the objective of predicting LLC-dynamic random access memory (DRAM) traffic for a given application in given Intel architectures. Moreover, the impact of prefetching is also investigated in this study. Experiments with different Intel micro-architectures uncover mechanisms to predict LLC-DRAM traffic that can yield up to 99% accuracy for sequential streaming access patterns and up to 95% accuracy for strided access patterns.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132128609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/MCHPC51950.2020.00007
T. Effler, Michael R. Jantz, T. Jones
Many high-performance systems now include different types of memory devices within the same compute platform to meet strict performance and cost constraints. Such heterogeneous memory systems often include an upper-level tier with better performance, but limited capacity, and lower-level tiers with higher capacity, but less bandwidth and longer latencies for reads and writes. To utilize the different memory layers efficiently, current systems rely on hardware-directed, memory -side caching or they provide facilities in the operating system (OS) that allow applications to make their own data-tier assignments. Since these data management options each come with their own set of trade-offs, many systems also include mixed data management configurations that allow applications to employ hardware- and software-directed management simultaneously, but for different portions of their address space. Despite the opportunity to address limitations of stand-alone data management options, such mixed management modes are under-utilized in practice, and have not been evaluated in prior studies of complex memory hardware. In this work, we develop custom program profiling, configurations, and policies to study the potential of mixed data management modes to outperform hardware- or software-based management schemes alone. Our experiments, conducted on an Intel ® Knights Landing platform with high-bandwidth memory, demonstrate that the mixed data management mode achieves the same or better performance than the best stand-alone option for five memory intensive benchmark applications (run separately and in isolation), resulting in an average speedup compared to the best stand-alone policy of over 10 %, on average.
{"title":"Performance Potential of Mixed Data Management Modes for Heterogeneous Memory Systems","authors":"T. Effler, Michael R. Jantz, T. Jones","doi":"10.1109/MCHPC51950.2020.00007","DOIUrl":"https://doi.org/10.1109/MCHPC51950.2020.00007","url":null,"abstract":"Many high-performance systems now include different types of memory devices within the same compute platform to meet strict performance and cost constraints. Such heterogeneous memory systems often include an upper-level tier with better performance, but limited capacity, and lower-level tiers with higher capacity, but less bandwidth and longer latencies for reads and writes. To utilize the different memory layers efficiently, current systems rely on hardware-directed, memory -side caching or they provide facilities in the operating system (OS) that allow applications to make their own data-tier assignments. Since these data management options each come with their own set of trade-offs, many systems also include mixed data management configurations that allow applications to employ hardware- and software-directed management simultaneously, but for different portions of their address space. Despite the opportunity to address limitations of stand-alone data management options, such mixed management modes are under-utilized in practice, and have not been evaluated in prior studies of complex memory hardware. In this work, we develop custom program profiling, configurations, and policies to study the potential of mixed data management modes to outperform hardware- or software-based management schemes alone. Our experiments, conducted on an Intel ® Knights Landing platform with high-bandwidth memory, demonstrate that the mixed data management mode achieves the same or better performance than the best stand-alone option for five memory intensive benchmark applications (run separately and in isolation), resulting in an average speedup compared to the best stand-alone policy of over 10 %, on average.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115034020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/mchpc51950.2020.00004
Nmic
The new computation technologies, such as big data analytics, modern machine learning technology, artificial intelligence (AI), blockchain, and security processing, have the great potential to be embedded into network to enable it to be intelligent and trustworthy. On the other hand, Information-Centric Networking (ICN), software-defined network (SDN), network function virtualization (NFV), Delay Tolerant Network (DTN), Vehicular Ad Hoc NETwork (VANET), network slicing, and data center network have emerged as the novel networking paradigms for fast and efficient delivering and retrieving data. Against this backdrop, there is a strong trend to move the computations from the cloud to not only the edges but also the resource-sufficient networking nodes, which triggers the convergence between the emerging networking concepts and the new computation technologies. The NMIC workshop 2019 brings together researchers to discuss the technical challenges and applications of the distributed computations for networking, the intelligent computations supported by the novel networking technologies, and the enforcement of series of computations. The accepted papers combine the computation technologies with ICN, SDN, DTN, VANET, and mobile network to enable them to be more intelligent, trustworthy, and efficient. We would like to especially thank the ICDCS 2019 organization team and all the TPC members of NMIC workshop 2019. Without their kind help, the NMIC workshop would not be possible.
{"title":"Message from the Workshop Chairs","authors":"Nmic","doi":"10.1109/mchpc51950.2020.00004","DOIUrl":"https://doi.org/10.1109/mchpc51950.2020.00004","url":null,"abstract":"The new computation technologies, such as big data analytics, modern machine learning technology, artificial intelligence (AI), blockchain, and security processing, have the great potential to be embedded into network to enable it to be intelligent and trustworthy. On the other hand, Information-Centric Networking (ICN), software-defined network (SDN), network function virtualization (NFV), Delay Tolerant Network (DTN), Vehicular Ad Hoc NETwork (VANET), network slicing, and data center network have emerged as the novel networking paradigms for fast and efficient delivering and retrieving data. Against this backdrop, there is a strong trend to move the computations from the cloud to not only the edges but also the resource-sufficient networking nodes, which triggers the convergence between the emerging networking concepts and the new computation technologies. The NMIC workshop 2019 brings together researchers to discuss the technical challenges and applications of the distributed computations for networking, the intelligent computations supported by the novel networking technologies, and the enforcement of series of computations. The accepted papers combine the computation technologies with ICN, SDN, DTN, VANET, and mobile network to enable them to be more intelligent, trustworthy, and efficient. We would like to especially thank the ICDCS 2019 organization team and all the TPC members of NMIC workshop 2019. Without their kind help, the NMIC workshop would not be possible.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122434497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/MCHPC51950.2020.00008
Steffen Christgau, T. Steinke
Large capacity Storage Class Memory (SCM) opens new possibilities for workloads requiring a large memory footprint. We examine optimization strategies for a legacy Fortran application on systems with an heterogeneous memory configuration comprising SCM and DRAM. We present a performance study for the multigrid solver component of the large-eddy simulation framework PALM for different memory configurations with large capacity SCM. An important optimization approach is the explicit assignment of storage locations depending on the data access characteristic to take advantage of the heterogeneous memory configuration. We are able to demonstrate that an explicit control over memory locations provides better performance compared to transparent hardware settings. As on aforementioned systems the page management by the OS appears as critical performance factor, we study the impact of different huge page settings.
{"title":"Leveraging a Heterogeneous Memory System for a Legacy Fortran Code: The Interplay of Storage Class Memory, DRAM and OS","authors":"Steffen Christgau, T. Steinke","doi":"10.1109/MCHPC51950.2020.00008","DOIUrl":"https://doi.org/10.1109/MCHPC51950.2020.00008","url":null,"abstract":"Large capacity Storage Class Memory (SCM) opens new possibilities for workloads requiring a large memory footprint. We examine optimization strategies for a legacy Fortran application on systems with an heterogeneous memory configuration comprising SCM and DRAM. We present a performance study for the multigrid solver component of the large-eddy simulation framework PALM for different memory configurations with large capacity SCM. An important optimization approach is the explicit assignment of storage locations depending on the data access characteristic to take advantage of the heterogeneous memory configuration. We are able to demonstrate that an explicit control over memory locations provides better performance compared to transparent hardware settings. As on aforementioned systems the page management by the OS appears as critical performance factor, we study the impact of different huge page settings.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116214584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper advocates a DRAM-only design strategy to architect high-performance low-cost heterogeneous memory systems in future computing systems, and demonstrates its potential in the context of relational database. In particular, we envision a heterogeneous DRAM fabric consisting of convenient but expensive byte-addressable DRAM and large-capacity low-cost DRAM with coarse access granularity (e.g., 1K-byte). Regardless of specific memory technology, one can reduce the manufacturing cost by sacrificing the memory raw reliability, and apply error correction code (ECC) to restore the data storage integrity. The efficiency of ECC significantly improves as the codeword length increases, which enlarges the memory access granularity. This leads to a fundamental trade-off between memory cost and access granularity. Following this principle, Intel 3DXP-based Optane memory DIMM internally operates with a 256-byte ECC codeword length (hence 256-byte access granularity), and Hynix recently demonstrated low-cost DRAM DIMM with a 64-byte access granularity. This paper presents a design approach that enables relational database to take full advantage of the envisioned low-cost heterogeneous DRAM fabric to improve performance with only minimal database source code modification. Using MySQL as a test vehicle, we implemented a prototyping system, on which we have demonstrated its effectiveness under TPC-C and Sysbench OLTP benchmarks.
{"title":"Architecting Heterogeneous Memory Systems with DRAM Technology Only: A Case Study on Relational Database","authors":"Yifan Qiao, Xubin Chen, Jingpeng Hao, Tong Zhang, C. Xie, Fei Wu","doi":"10.1109/MCHPC51950.2020.00009","DOIUrl":"https://doi.org/10.1109/MCHPC51950.2020.00009","url":null,"abstract":"This paper advocates a DRAM-only design strategy to architect high-performance low-cost heterogeneous memory systems in future computing systems, and demonstrates its potential in the context of relational database. In particular, we envision a heterogeneous DRAM fabric consisting of convenient but expensive byte-addressable DRAM and large-capacity low-cost DRAM with coarse access granularity (e.g., 1K-byte). Regardless of specific memory technology, one can reduce the manufacturing cost by sacrificing the memory raw reliability, and apply error correction code (ECC) to restore the data storage integrity. The efficiency of ECC significantly improves as the codeword length increases, which enlarges the memory access granularity. This leads to a fundamental trade-off between memory cost and access granularity. Following this principle, Intel 3DXP-based Optane memory DIMM internally operates with a 256-byte ECC codeword length (hence 256-byte access granularity), and Hynix recently demonstrated low-cost DRAM DIMM with a 64-byte access granularity. This paper presents a design approach that enables relational database to take full advantage of the envisioned low-cost heterogeneous DRAM fabric to improve performance with only minimal database source code modification. Using MySQL as a test vehicle, we implemented a prototyping system, on which we have demonstrated its effectiveness under TPC-C and Sysbench OLTP benchmarks.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129297386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/MCHPC51950.2020.00006
Awais Khan, Hyogi Sim, Sudharshan S. Vazhkudai, Jinseok Ma, Myeonghoon Oh, Youngjae Kim
This paper presents Mosiqs, a persistent memory object storage framework with metadata indexing and querying for scientific computing. We design Mosiqs based on the key idea that memory objects on shared PM pool can live beyond the application lifetime and can become the sharing currency for applications and scientists. Mosiqs provides an aggregate memory pool atop an array of persistent memory devices to store and access memory objects. Mosiqs uses a lightweight persistent memory key-value store to manage the metadata of memory objects such as persistent pointer mappings, which enables memory object sharing for effective scientific collaborations. Mosiqs is implemented atop PMDK. We evaluate the proposed approach on many-core server with an array of real PM devices. The preliminary evaluation confirms a 100% improvement for write and 30% in read performance against a PM-aware file system approach.
{"title":"Persistent Memory Object Storage and Indexing for Scientific Computing","authors":"Awais Khan, Hyogi Sim, Sudharshan S. Vazhkudai, Jinseok Ma, Myeonghoon Oh, Youngjae Kim","doi":"10.1109/MCHPC51950.2020.00006","DOIUrl":"https://doi.org/10.1109/MCHPC51950.2020.00006","url":null,"abstract":"This paper presents Mosiqs, a persistent memory object storage framework with metadata indexing and querying for scientific computing. We design Mosiqs based on the key idea that memory objects on shared PM pool can live beyond the application lifetime and can become the sharing currency for applications and scientists. Mosiqs provides an aggregate memory pool atop an array of persistent memory devices to store and access memory objects. Mosiqs uses a lightweight persistent memory key-value store to manage the metadata of memory objects such as persistent pointer mappings, which enables memory object sharing for effective scientific collaborations. Mosiqs is implemented atop PMDK. We evaluate the proposed approach on many-core server with an array of real PM devices. The preliminary evaluation confirms a 100% improvement for write and 30% in read performance against a PM-aware file system approach.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116520062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-02DOI: 10.1109/MCHPC51950.2020.00010
Tom Deakin, J. Cownie, Simon McIntosh-Smith, J. Lovegrove, R. Smedley-Stevenson
The full assembly of the stiffness matrix in finite element codes can be prohibitive in terms of memory footprint resulting from storing that enormous matrix. An optimisation and work around, particularly effective for discontinuous Galerkin based approaches, is to construct and solve the small dense linear systems locally within each element and avoid the global assembly entirely. The different independent linear systems can be solved concurrently in a batched manner, however we have found that the memory subsystem can show destructive behaviour in this paradigm, severely affecting the performance. In this paper we demonstrate the range of performance that can be obtained by allocating the local systems differently, along with evidence to attribute the reasons behind these differences.
{"title":"Hostile Cache Implications for Small, Dense Linear Solves","authors":"Tom Deakin, J. Cownie, Simon McIntosh-Smith, J. Lovegrove, R. Smedley-Stevenson","doi":"10.1109/MCHPC51950.2020.00010","DOIUrl":"https://doi.org/10.1109/MCHPC51950.2020.00010","url":null,"abstract":"The full assembly of the stiffness matrix in finite element codes can be prohibitive in terms of memory footprint resulting from storing that enormous matrix. An optimisation and work around, particularly effective for discontinuous Galerkin based approaches, is to construct and solve the small dense linear systems locally within each element and avoid the global assembly entirely. The different independent linear systems can be solved concurrently in a batched manner, however we have found that the memory subsystem can show destructive behaviour in this paradigm, severely affecting the performance. In this paper we demonstrate the range of performance that can be obtained by allocating the local systems differently, along with evidence to attribute the reasons behind these differences.","PeriodicalId":318919,"journal":{"name":"2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129869141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}