2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献

英文中文

Exposing data locality in HPC-based systems by using the HDFS backend 通过使用HDFS后端，在基于hpc的系统中暴露数据局部性

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00038

José Rivadeneira, Félix García Carballeira, J. Carretero, Francisco Javier García Blas

Nowadays, there are two main approaches for dealing with data-intensive applications: parallel file systems in classical High-Performance Computing (HPC) centers and Big Data like parallel file system for ensuring the data centric vision. Furthermore, there is a growing overlap between HPC and Big Data applications, given that Big Data paradigm is a growing consumer of HPC resources. HDFS is one of the most important file systems for data intensive applications while, from the parallel file systems point of view, MPI-IO is the most used interface for parallel I/O. In this paper, we propose a novel solution for taking advantage of HDFS through MPI-based parallel applications. To demonstrate its feasibility, we have included our approach in MIMIR, a MapReduce framework for MPI-based applications. We have optimized MIMIR framework by providing data locality features provided by our approach. The experimental evaluation demonstrates that our solution offers around 25% performance for map phase compared with the MIMIR baseline solution.

目前，处理数据密集型应用的主要方法有两种:一种是传统高性能计算(HPC)中心的并行文件系统，另一种是像大数据这样的并行文件系统，以确保数据中心的愿景。此外，HPC和大数据应用之间的重叠越来越多，因为大数据范式是HPC资源的不断增长的消费者。HDFS是数据密集型应用程序中最重要的文件系统之一，而从并行文件系统的角度来看，MPI-IO是并行I/O最常用的接口。在本文中，我们提出了一种新的解决方案，通过基于mpi的并行应用程序来利用HDFS。为了证明其可行性，我们将我们的方法包含在MIMIR中，这是一个基于mpi的应用程序的MapReduce框架。我们通过提供我们的方法提供的数据局部性特性来优化MIMIR框架。实验评估表明，与MIMIR基线解决方案相比，我们的解决方案在地图阶段的性能约为25%。

{"title":"Exposing data locality in HPC-based systems by using the HDFS backend","authors":"José Rivadeneira, Félix García Carballeira, J. Carretero, Francisco Javier García Blas","doi":"10.1109/HiPC50609.2020.00038","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00038","url":null,"abstract":"Nowadays, there are two main approaches for dealing with data-intensive applications: parallel file systems in classical High-Performance Computing (HPC) centers and Big Data like parallel file system for ensuring the data centric vision. Furthermore, there is a growing overlap between HPC and Big Data applications, given that Big Data paradigm is a growing consumer of HPC resources. HDFS is one of the most important file systems for data intensive applications while, from the parallel file systems point of view, MPI-IO is the most used interface for parallel I/O. In this paper, we propose a novel solution for taking advantage of HDFS through MPI-based parallel applications. To demonstrate its feasibility, we have included our approach in MIMIR, a MapReduce framework for MPI-based applications. We have optimized MIMIR framework by providing data locality features provided by our approach. The experimental evaluation demonstrates that our solution offers around 25% performance for map phase compared with the MIMIR baseline solution.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133645128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fair Allocation of Asymmetric Operations in Storage Systems 存储系统中非对称操作的公平分配

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00030

Thomas Keller, P. Varman

Managing the trade-off between efficiency and fairness in a storage system is challenging due to high variability in workload behavior. Most workloads are made up of a mix of asymmetric operations (e.g. read/write, sequential/random, or striped/isolated I/Os) in different proportions, which places different resource demands on the storage device. The problem is to allocate device resources to the heterogeneous workloads fairly while maintaining high device throughput. In this paper, we present a new model for fair allocation of heterogeneous workloads with different ratios of asymmetric operations. We propose an adaptive scheme that chooses between two policies-the traditional Time-Balanced Allocation (TBA) and our proposed Bottleneck-Balanced Allocation (BBA)-based on workload characteristics. The fairness and throughput of these allocation policies are established through formal analysis. Our algorithms are tested with an adaptive, dynamic scheduler implemented in a simulation testbed, and the results validate the performance benefits of our approach.

由于工作负载行为的高度可变性，在存储系统中管理效率和公平性之间的权衡是具有挑战性的。大多数工作负载都是由不同比例的非对称操作(例如读/写、顺序/随机或条纹/隔离I/ o)混合组成的，这对存储设备产生了不同的资源需求。问题是在保持高设备吞吐量的同时公平地为异构工作负载分配设备资源。在本文中，我们提出了一个新的模型来公平分配具有不同比例的非对称操作的异构工作负载。我们提出了一种自适应方案，在传统的时间平衡分配(TBA)和基于工作负载特征的瓶颈平衡分配(BBA)两种策略之间进行选择。通过形式化分析，确定了这些分配策略的公平性和吞吐量。我们的算法在仿真测试平台上实现了自适应动态调度器，结果验证了我们的方法的性能优势。

引用次数: 1

SimGQ: Simultaneously Evaluating Iterative Graph Queries SimGQ:同时评估迭代图查询

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00014

Chengshuo Xu, Abbas Mazloumi, Xiaolin Jiang, Rajiv Gupta

Graph processing frameworks are typically designed to optimize the evaluation of a single graph query. However, in practice, we often need to respond to multiple graph queries, either from different users or from a single user performing a complex analytics task. Therefore in this paper we develop SimGQ, a system that optimizes simultaneous evaluation of a group of vertex queries that originate at different source vertices (e.g., multiple shortest path queries originating at different source vertices) and delivers substantial speedups over a conventional framework that evaluates and responds to queries one by one. The performance benefits are achieved via batching and sharing. Batching fully utilizes system resources to evaluate a batch of queries and amortizes runtime overheads incurred due to fetching vertices and edge lists, synchronizing threads, and maintaining computation frontiers. Sharing dynamically identifies shared queries that substantially represent subcomputations in the evaluation of different queries in a batch, evaluates the shared queries, and then uses their results to accelerate the evaluation of all queries in the batch. With four input power-law graphs and four graph algorithms SimGQ achieves speedups of up to 45.67 × with batch sizes of up to 512 queries over the baseline implementation that evaluates the queries one by one using the state of the art Ligra system. Moreover, both batching and sharing contribute substantially to the speedups.

图处理框架通常被设计为优化单个图查询的评估。然而，在实践中，我们经常需要响应来自不同用户或执行复杂分析任务的单个用户的多个图形查询。因此，在本文中，我们开发了SimGQ，这是一个系统，可以优化同时评估来自不同源顶点的一组顶点查询(例如，来自不同源顶点的多个最短路径查询)，并提供了比传统框架更大的速度，传统框架是一个接一个地评估和响应查询。性能优势是通过批处理和共享实现的。批处理充分利用系统资源来评估一批查询，并分摊由于获取顶点和边缘列表、同步线程和维护计算边界而产生的运行时开销。共享动态地识别实质上表示批处理中不同查询的子计算的共享查询，计算共享查询，然后使用它们的结果来加速批处理中所有查询的计算。使用四个输入幂律图和四个图算法，SimGQ在使用最先进的Ligra系统逐个评估查询的基线实现上实现了高达45.67倍的加速，批处理大小高达512个查询。此外，批处理和共享都对加速有很大贡献。

{"title":"SimGQ: Simultaneously Evaluating Iterative Graph Queries","authors":"Chengshuo Xu, Abbas Mazloumi, Xiaolin Jiang, Rajiv Gupta","doi":"10.1109/HiPC50609.2020.00014","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00014","url":null,"abstract":"Graph processing frameworks are typically designed to optimize the evaluation of a single graph query. However, in practice, we often need to respond to multiple graph queries, either from different users or from a single user performing a complex analytics task. Therefore in this paper we develop SimGQ, a system that optimizes simultaneous evaluation of a group of vertex queries that originate at different source vertices (e.g., multiple shortest path queries originating at different source vertices) and delivers substantial speedups over a conventional framework that evaluates and responds to queries one by one. The performance benefits are achieved via batching and sharing. Batching fully utilizes system resources to evaluate a batch of queries and amortizes runtime overheads incurred due to fetching vertices and edge lists, synchronizing threads, and maintaining computation frontiers. Sharing dynamically identifies shared queries that substantially represent subcomputations in the evaluation of different queries in a batch, evaluates the shared queries, and then uses their results to accelerate the evaluation of all queries in the batch. With four input power-law graphs and four graph algorithms SimGQ achieves speedups of up to 45.67 × with batch sizes of up to 512 queries over the baseline implementation that evaluates the queries one by one using the state of the art Ligra system. Moreover, both batching and sharing contribute substantially to the speedups.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115342422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

[Title page] (标题页)

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00001

引用次数: 0

Content-defined Merkle Trees for Efficient Container Delivery 内容定义的默克尔树，用于高效的集装箱交付

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00026

Yuta Nakamura, Raza Ahmad, T. Malik

Containerization simplifies the sharing and deployment of applications when environments change in the software delivery chain. To deploy an application, container delivery methods push and pull container images. These methods operate on file and layer (set of files) granularity, and introduce redundant data within a container. Several container operations such as upgrading, installing, and maintaining become inefficient, because of copying and provisioning of redundant data. In this paper, we reestablish recent results that block-level deduplication reduces the size of individual containers, by verifying the result using content-defined chunking. Block-level deduplication, however, does not improve the efficiency of push/pull operations which must determine the specific blocks to transfer. We introduce a content-defined Merkle Tree (CDMT) over deduplicated storage in a container. CDMT indexes deduplicated blocks and determines changes to blocks in logarithmic time on the client. CDMT efficiently pushes and pulls container images from a registry, especially as containers are upgraded and (re-)provisioned on a client. We also describe how a registry can efficiently maintain the CDMT index as new image versions are pushed. We show the scalability of CDMT over Merkle Trees in terms of disk and network I/O savings using 15 container images and 233 image versions from Docker Hub.

当软件交付链中的环境发生变化时，容器化简化了应用程序的共享和部署。要部署应用程序，容器交付方法可以推送和拉取容器映像。这些方法对文件和层(文件集)粒度进行操作，并在容器中引入冗余数据。由于复制和提供冗余数据，一些容器操作(如升级、安装和维护)变得效率低下。在本文中，我们通过使用内容定义的分块验证结果，重新建立了最近的结果，即块级重复数据删除减少了单个容器的大小。然而，块级重复数据删除并不能提高推/拉操作的效率，因为推/拉操作必须确定要传输的特定块。我们在容器中的重复数据删除存储上引入了内容定义的Merkle树(CDMT)。CDMT对重复数据删除块进行索引，并在客户端上以对数时间确定对块的更改。CDMT有效地从注册中心推送和提取容器映像，特别是在客户机上升级和(重新)供应容器时。我们还描述了注册中心如何在推出新映像版本时有效地维护CDMT索引。我们使用Docker Hub的15个容器映像和233个映像版本，从磁盘和网络I/O节省方面展示了CDMT在Merkle树上的可伸缩性。

{"title":"Content-defined Merkle Trees for Efficient Container Delivery","authors":"Yuta Nakamura, Raza Ahmad, T. Malik","doi":"10.1109/HiPC50609.2020.00026","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00026","url":null,"abstract":"Containerization simplifies the sharing and deployment of applications when environments change in the software delivery chain. To deploy an application, container delivery methods push and pull container images. These methods operate on file and layer (set of files) granularity, and introduce redundant data within a container. Several container operations such as upgrading, installing, and maintaining become inefficient, because of copying and provisioning of redundant data. In this paper, we reestablish recent results that block-level deduplication reduces the size of individual containers, by verifying the result using content-defined chunking. Block-level deduplication, however, does not improve the efficiency of push/pull operations which must determine the specific blocks to transfer. We introduce a content-defined Merkle Tree (CDMT) over deduplicated storage in a container. CDMT indexes deduplicated blocks and determines changes to blocks in logarithmic time on the client. CDMT efficiently pushes and pulls container images from a registry, especially as containers are upgraded and (re-)provisioned on a client. We also describe how a registry can efficiently maintain the CDMT index as new image versions are pushed. We show the scalability of CDMT over Merkle Trees in terms of disk and network I/O savings using 15 container images and 233 image versions from Docker Hub.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122257709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

HiPC 2020 ORGANIZATION hipc2020组织

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00007

引用次数: 0

HyPR: Hybrid Page Ranking on Evolving Graphs HyPR:基于进化图的混合页面排名

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00020

Hemant Kumar Giri, Mridul Haque, D. Banerjee

PageRank (PR) is the standard metric used by the Google search engine to compute the importance of a web page via modeling the entire web as a first order Markov chain. The challenge of computing PR efficiently and quickly has been already addressed by several works previously who have shown innovations in both algorithms and in the use of parallel computing. The standard method of computing PR is handled by modelling the web as a graph. The fast growing internet adds several new web pages everyday and hence more nodes (representing the web pages) and edges (the hyperlinks) are added to this graph in an incremental fashion. Computing PR on this evolving graph is now an emerging challenge since computations from scratch on the massive graph is time consuming and unscalable. In this work, we propose Hybrid Page Rank (HyPR), which computes PR on evolving graphs using collaborative executions on muti-core CPUs and massively parallel GPUs. We exploit data parallelism via efficiently partitioning the graph into different regions that are affected and unaffected by the new updates. The different partitions are then processed in an overlapped manner for PR updates. The novelty of our technique is in utilizing the hybrid platform to scale the solution to massive graphs. The technique also provides high performance through parallel processing of every batch of updates using a parallel algorithm. HyPR efficiently executes on a NVIDIA V100 GPU hosted on a 6th Gen Intel Xeon CPU and is able to update a graph with 640M edges with a single batch of 100,000 edges in 12 ms. HyPR outperforms other state of the art techniques for computing PR on evolving graphs [1] by 4.8x. Additionally HyPR provides 1.2x speedup over GPU only executions, and 95x speedup over CPU only parallel executions.

PageRank (PR)是谷歌搜索引擎通过将整个网页建模为一阶马尔可夫链来计算网页重要性的标准度量。高效和快速计算PR的挑战已经在之前的几项工作中得到了解决，这些工作在算法和并行计算的使用方面都展示了创新。计算PR的标准方法是通过将web建模为图形来处理。快速发展的互联网每天增加几个新的网页，因此更多的节点(代表网页)和边(超链接)以增量的方式添加到这个图中。在这个不断发展的图上计算PR现在是一个新出现的挑战，因为在大规模图上从零开始计算既耗时又不可扩展。在这项工作中，我们提出了混合页面排名(HyPR)，它通过在多核cpu和大规模并行gpu上协同执行来计算进化图上的PR。我们通过有效地将图划分为受新更新影响和不受新更新影响的不同区域来利用数据并行性。然后以重叠的方式处理不同的分区以进行PR更新。我们技术的新颖之处在于利用混合平台将解决方案扩展到海量图形。该技术还通过使用并行算法并行处理每批更新来提供高性能。HyPR在搭载第六代Intel至强CPU的NVIDIA V100 GPU上有效执行，并且能够在12毫秒内用单个批次的100,000条边更新具有640M条边的图形。在演化图b[1]上，HyPR比其他最先进的PR计算技术高出4.8倍。此外，HyPR比仅GPU执行提供1.2倍的加速，比仅CPU并行执行提供95倍的加速。

{"title":"HyPR: Hybrid Page Ranking on Evolving Graphs","authors":"Hemant Kumar Giri, Mridul Haque, D. Banerjee","doi":"10.1109/HiPC50609.2020.00020","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00020","url":null,"abstract":"PageRank (PR) is the standard metric used by the Google search engine to compute the importance of a web page via modeling the entire web as a first order Markov chain. The challenge of computing PR efficiently and quickly has been already addressed by several works previously who have shown innovations in both algorithms and in the use of parallel computing. The standard method of computing PR is handled by modelling the web as a graph. The fast growing internet adds several new web pages everyday and hence more nodes (representing the web pages) and edges (the hyperlinks) are added to this graph in an incremental fashion. Computing PR on this evolving graph is now an emerging challenge since computations from scratch on the massive graph is time consuming and unscalable. In this work, we propose Hybrid Page Rank (HyPR), which computes PR on evolving graphs using collaborative executions on muti-core CPUs and massively parallel GPUs. We exploit data parallelism via efficiently partitioning the graph into different regions that are affected and unaffected by the new updates. The different partitions are then processed in an overlapped manner for PR updates. The novelty of our technique is in utilizing the hybrid platform to scale the solution to massive graphs. The technique also provides high performance through parallel processing of every batch of updates using a parallel algorithm. HyPR efficiently executes on a NVIDIA V100 GPU hosted on a 6th Gen Intel Xeon CPU and is able to update a graph with 640M edges with a single batch of 100,000 edges in 12 ms. HyPR outperforms other state of the art techniques for computing PR on evolving graphs [1] by 4.8x. Additionally HyPR provides 1.2x speedup over GPU only executions, and 95x speedup over CPU only parallel executions.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132546601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Force-directed Graph Layout with Processing-in-Memory Architecture 用内存处理架构加速力导向图形布局

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00041

Ruihao Li, Shuang Song, Qinzhe Wu, L. John

In the big data domain, the visualization of graph systems provides users more intuitive experiences, especially in the field of social networks, transportation systems, and even medical and biological domains. Processing-in-Memory (PIM) has been a popular choice for deploying emerging applications as a result of its high parallelism and low energy consumption. Furthermore, memory cells of PIM platforms can serve as both compute units and storage units, making PIM solutions able to efficiently support visualizing graphs at different scales. In this paper, we focus on using the PIM platform to accelerate the Force-directed Graph Layout (FdGL) algorithm, which is one of the most fundamental algorithms in the field of visualization. We fully explore the parallelism inside the FdGL algorithm and integrate an algorithm level optimization strategy into our PIM system. In addition, we use programmable instruction sets to achieve more flexibility in our PIM system. Our PIM architecture can achieve 8.07× speedup compared with a GPU platform of the same peak throughput. Compared with state-of-the-art CPU and GPU platforms, our PIM system can achieve an average of 13.33× and 2.14× performance speedup with 74.51× and 14.30× energy consumption reduction on six real world graphs.

在大数据领域，图形系统的可视化为用户提供了更直观的体验，特别是在社交网络、交通系统，甚至医疗和生物领域。内存中处理(PIM)由于其高并行性和低能耗而成为部署新兴应用程序的流行选择。此外，PIM平台的内存单元可以同时作为计算单元和存储单元，使得PIM解决方案能够有效地支持不同尺度的图形可视化。本文重点研究了利用PIM平台加速力导向图布局(Force-directed Graph Layout, FdGL)算法，该算法是可视化领域中最基本的算法之一。我们充分探索了FdGL算法内部的并行性，并将算法级优化策略集成到我们的PIM系统中。此外，我们使用可编程指令集在我们的PIM系统中实现更大的灵活性。与相同峰值吞吐量的GPU平台相比，我们的PIM架构可以实现8.07倍的加速。与最先进的CPU和GPU平台相比，我们的PIM系统在六个真实世界图形上可以实现平均13.33倍和2.14倍的性能加速，降低74.51倍和14.30倍的能耗。

{"title":"Accelerating Force-directed Graph Layout with Processing-in-Memory Architecture","authors":"Ruihao Li, Shuang Song, Qinzhe Wu, L. John","doi":"10.1109/HiPC50609.2020.00041","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00041","url":null,"abstract":"In the big data domain, the visualization of graph systems provides users more intuitive experiences, especially in the field of social networks, transportation systems, and even medical and biological domains. Processing-in-Memory (PIM) has been a popular choice for deploying emerging applications as a result of its high parallelism and low energy consumption. Furthermore, memory cells of PIM platforms can serve as both compute units and storage units, making PIM solutions able to efficiently support visualizing graphs at different scales. In this paper, we focus on using the PIM platform to accelerate the Force-directed Graph Layout (FdGL) algorithm, which is one of the most fundamental algorithms in the field of visualization. We fully explore the parallelism inside the FdGL algorithm and integrate an algorithm level optimization strategy into our PIM system. In addition, we use programmable instruction sets to achieve more flexibility in our PIM system. Our PIM architecture can achieve 8.07× speedup compared with a GPU platform of the same peak throughput. Compared with state-of-the-art CPU and GPU platforms, our PIM system can achieve an average of 13.33× and 2.14× performance speedup with 74.51× and 14.30× energy consumption reduction on six real world graphs.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131022316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Batched Small Tensor-Matrix Multiplications on GPUs gpu上的批量小张量矩阵乘法

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00044

Keke Zhai, Tania Banerjee-Mishra, A. Wijayasiri, S. Ranka

We present a fine-tuned library, ZTMM, for batched small tensor-matrix multiplication on GPU architectures. Libraries performing optimized matrix-matrix multiplications involving large matrices are available for many architectures, including a GPU. However, these libraries do not provide optimal performance for applications requiring efficient multiplication of a matrix with a batch of small matrices or tensors. There has been recent interest in developing fine-tuned libraries for batched small matrix-matrix multiplication - these efforts are limited to square matrices. ZTMM supports both square and rectangular matrices. We experimentally demonstrate that our library has significantly higher performance than cuBLAS and Magma libraries. We demonstrate our library's use on a spectral element-based solver called CMT-nek that performs high-fidelity predictive simulations using compressible Navier-Stokes equations. CMT-nek involves three-dimensional tensors, but it is possible to apply the same techniques to higher dimensional tensors.

我们提出了一个微调库，ZTMM，用于GPU架构上的批量小张量矩阵乘法。执行涉及大矩阵的优化矩阵乘法的库可用于许多体系结构，包括GPU。然而，这些库不能为需要用一批小矩阵或张量进行矩阵的有效乘法的应用程序提供最佳性能。最近有兴趣开发用于批量小矩阵-矩阵乘法的微调库——这些努力仅限于方阵。ZTMM支持正方形和矩形矩阵。实验证明，我们的库比cuBLAS和Magma库具有更高的性能。我们演示了我们的库在一个名为CMT-nek的基于光谱元素的求解器上的使用，该求解器使用可压缩的Navier-Stokes方程执行高保真的预测模拟。CMT-nek涉及三维张量，但可以将相同的技术应用于高维张量。

引用次数: 0

Temporal Based Intelligent LRU Cache Construction 基于时序的智能LRU缓存构建

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00045

Pavan Nittur, Anuradha Kanukotla, Narendra Mutyala

In the Android platform, the cache-slots store applications upon their launch, which it later uses for prefetching. The Least Recently Used (LRU) based caching algorithm which governs these cache-slots can fail to maintain essential applications in the slot, especially in scenarios like memory-crunch, temporal-burst or volatile environment situations. The construction of these cache-slots can be ameliorated by selectively storing user critical applications before their launch. This reform would require a successful forecast of the user-app-launch pattern using intelligent machine learning agents without hindering the smooth execution of parallel processes. In this paper, we propose a sophisticated Temporal based Intelligent Process Management (TIPM) system, which learns to predict a Smart Application List (SAL) based on the usage pattern. Using SAL, we construct Intelligent LRU cache-slots, that retains essential user applications in the memory and provide improved launch rates. Our experimental results from testing TIPM with different users demonstrate significant improvement in cache-hit rate (95%) and yielding a gain of 26% to the current baseline (LRU), thereby making it a valuable enhancement to the platform.

在Android平台上，缓存槽在应用程序启动时存储应用程序，之后用于预取。管理这些缓存槽的基于最近最少使用(LRU)的缓存算法可能无法在槽中维护必要的应用程序，特别是在内存紧张、时间突发或易变环境的情况下。这些缓存槽的构造可以通过在用户关键应用程序启动之前有选择地存储它们来改进。这种改革需要使用智能机器学习代理成功预测用户应用程序启动模式，同时不妨碍并行进程的顺利执行。在本文中，我们提出了一个复杂的基于时间的智能过程管理(TIPM)系统，该系统学习预测基于使用模式的智能应用程序列表(SAL)。使用SAL，我们构建了智能LRU缓存槽，它在内存中保留了基本的用户应用程序，并提供了改进的启动率。我们对不同用户测试TIPM的实验结果表明，缓存命中率(95%)有了显著提高，并且比当前基线(LRU)增加了26%，从而使其成为对平台的有价值的增强。

{"title":"Temporal Based Intelligent LRU Cache Construction","authors":"Pavan Nittur, Anuradha Kanukotla, Narendra Mutyala","doi":"10.1109/HiPC50609.2020.00045","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00045","url":null,"abstract":"In the Android platform, the cache-slots store applications upon their launch, which it later uses for prefetching. The Least Recently Used (LRU) based caching algorithm which governs these cache-slots can fail to maintain essential applications in the slot, especially in scenarios like memory-crunch, temporal-burst or volatile environment situations. The construction of these cache-slots can be ameliorated by selectively storing user critical applications before their launch. This reform would require a successful forecast of the user-app-launch pattern using intelligent machine learning agents without hindering the smooth execution of parallel processes. In this paper, we propose a sophisticated Temporal based Intelligent Process Management (TIPM) system, which learns to predict a Smart Application List (SAL) based on the usage pattern. Using SAL, we construct Intelligent LRU cache-slots, that retains essential user applications in the memory and provide improved launch rates. Our experimental results from testing TIPM with different users demonstrate significant improvement in cache-hit rate (95%) and yielding a gain of 26% to the current baseline (LRU), thereby making it a valuable enhancement to the platform.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133627930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀