2019 IEEE International Conference on Networking, Architecture and Storage (NAS)最新文献

英文中文

Load-aware Elastic Data Reduction and Re-computation for Adaptive Mesh Refinement 基于负载感知的弹性数据约简与自适应网格细化的重计算

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834727

Mengxiao Wang, Huizhang Luo, Qing Liu, Hong Jiang

The increasing performance gap between computation and I/O creates huge data management challenges for simulation-based scientific discovery. Data reduction, among others, is deemed to be a promising technique to bridge the gap through reducing the amount of data migrated to persistent storage. However, the reduction performance is still far from what is being demanded from production applications. To this end, we propose a new methodology that aggressively reduces data despite the substantial loss of information, and re-computes the original accuracy on-demand. As a result, our scheme creates an illusion of a fast and large storage medium with the availability of high-accuracy data. We further design a load-aware data reduction strategy that monitors the I/O overhead at runtime, and dynamically adjusts the reduction ratio. We verify the efficacy of our methodology through adaptive mesh refinement, a popular numerical technique for solving partial differential equations. We evaluate data reduction and selective data re-computation on Titan, using a real application in FLASH and mini-applications in Chombo. To clearly demonstrate the benefits of re-computation, we compare it with other state-of-the-art data reduction methods including SZ, ZFP, FPC and deduplication, and it is shown to be superior in both write and read speeds, particularly when a small amount of data (e.g., 1%) need to be retrieved, as well as reduction ratio. Our results confirm that data reduction and selective data re-computation can 1) reduce the performance gap between I/O and compute via aggressively reducing AMR levels, and more importantly 2) can recover the target accuracy efficiently for AMR through re-computation.

计算和I/O之间越来越大的性能差距给基于仿真的科学发现带来了巨大的数据管理挑战。除其他外，数据缩减被认为是一种很有前途的技术，可以通过减少迁移到持久存储的数据量来弥合这一差距。然而，减少性能仍然远远不能满足生产应用程序的要求。为此，我们提出了一种新的方法，该方法在大量信息丢失的情况下积极减少数据，并按需重新计算原始精度。因此，我们的方案创造了一种具有高精度数据可用性的快速和大型存储介质的幻觉。我们进一步设计了一个负载感知的数据缩减策略，该策略在运行时监视I/O开销，并动态调整缩减比例。我们通过自适应网格细化验证了我们的方法的有效性，这是一种解决偏微分方程的流行数值技术。我们使用FLASH中的真实应用程序和Chombo中的迷你应用程序来评估Titan上的数据缩减和选择性数据重新计算。为了清楚地展示重新计算的好处，我们将其与其他最先进的数据缩减方法(包括SZ、ZFP、FPC和重复数据删除)进行了比较，结果表明，它在写入和读取速度方面都更优越，特别是当需要检索少量数据(例如1%)时，以及缩减率。我们的研究结果证实，数据缩减和选择性数据重新计算可以1)通过积极降低AMR水平来缩小I/O和计算之间的性能差距，更重要的是2)通过重新计算可以有效地恢复AMR的目标精度。

{"title":"Load-aware Elastic Data Reduction and Re-computation for Adaptive Mesh Refinement","authors":"Mengxiao Wang, Huizhang Luo, Qing Liu, Hong Jiang","doi":"10.1109/NAS.2019.8834727","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834727","url":null,"abstract":"The increasing performance gap between computation and I/O creates huge data management challenges for simulation-based scientific discovery. Data reduction, among others, is deemed to be a promising technique to bridge the gap through reducing the amount of data migrated to persistent storage. However, the reduction performance is still far from what is being demanded from production applications. To this end, we propose a new methodology that aggressively reduces data despite the substantial loss of information, and re-computes the original accuracy on-demand. As a result, our scheme creates an illusion of a fast and large storage medium with the availability of high-accuracy data. We further design a load-aware data reduction strategy that monitors the I/O overhead at runtime, and dynamically adjusts the reduction ratio. We verify the efficacy of our methodology through adaptive mesh refinement, a popular numerical technique for solving partial differential equations. We evaluate data reduction and selective data re-computation on Titan, using a real application in FLASH and mini-applications in Chombo. To clearly demonstrate the benefits of re-computation, we compare it with other state-of-the-art data reduction methods including SZ, ZFP, FPC and deduplication, and it is shown to be superior in both write and read speeds, particularly when a small amount of data (e.g., 1%) need to be retrieved, as well as reduction ratio. Our results confirm that data reduction and selective data re-computation can 1) reduce the performance gap between I/O and compute via aggressively reducing AMR levels, and more importantly 2) can recover the target accuracy efficiently for AMR through re-computation.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124694911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Optimizing Tail Latency of LDPC based Flash Memory Storage Systems Via Smart Refresh 基于LDPC的闪存存储系统尾部延迟智能刷新优化

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834728

Yina Lv, Liang Shi, Qiao Li, Congming Gao, C. Xue, E. Sha

Flash memory has been developed with bit density improvement, technology scaling, and 3D stacking. With this trend, its reliability has been degraded significantly. Error correction code, low density parity code (LDPC), which has strong error correction capability, has been employed to solve this issue. However, one of the critical issues of LDPC is that it would introduce a long decoding latency on devices with low reliability. In this case, tail latency would happen, which will significantly impact the quality of service (QoS). In this work, a set of smart refresh schemes is proposed to optimize the tail latency. The basic idea of the work is to refresh data when the accessed data has a long decoding latency. Two smart refresh schemes are proposed for this work: The first refresh scheme is designed to refresh long access latency data when it is accessed several times for access performance optimization; The second refresh scheme is designed to periodical detecting data with extremely long access latency and refreshing them for tail latency optimization. Experiment results show that the proposed schemes are able to significantly improve the tail latency and access performance with little overhead.

闪存的发展伴随着比特密度的提高、技术的缩放和3D堆叠。在这种趋势下，其可靠性大大降低。纠错码——低密度奇偶码(LDPC)具有较强的纠错能力，被用来解决这一问题。然而，LDPC的一个关键问题是它会在低可靠性的设备上引入很长的解码延迟。在这种情况下，会出现尾部延迟，这将严重影响服务质量(QoS)。在这项工作中，提出了一套智能刷新方案来优化尾部延迟。该工作的基本思想是当访问的数据具有较长的解码延迟时刷新数据。为此，提出了两种智能刷新方案:第一种刷新方案设计为在多次访问时刷新长访问延迟数据，以优化访问性能;第二种刷新方案旨在周期性地检测访问延迟极长的数据，并对其进行刷新，以优化尾部延迟。实验结果表明，所提出的方案能够在很小的开销下显著提高尾部延迟和访问性能。

{"title":"Optimizing Tail Latency of LDPC based Flash Memory Storage Systems Via Smart Refresh","authors":"Yina Lv, Liang Shi, Qiao Li, Congming Gao, C. Xue, E. Sha","doi":"10.1109/NAS.2019.8834728","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834728","url":null,"abstract":"Flash memory has been developed with bit density improvement, technology scaling, and 3D stacking. With this trend, its reliability has been degraded significantly. Error correction code, low density parity code (LDPC), which has strong error correction capability, has been employed to solve this issue. However, one of the critical issues of LDPC is that it would introduce a long decoding latency on devices with low reliability. In this case, tail latency would happen, which will significantly impact the quality of service (QoS). In this work, a set of smart refresh schemes is proposed to optimize the tail latency. The basic idea of the work is to refresh data when the accessed data has a long decoding latency. Two smart refresh schemes are proposed for this work: The first refresh scheme is designed to refresh long access latency data when it is accessed several times for access performance optimization; The second refresh scheme is designed to periodical detecting data with extremely long access latency and refreshing them for tail latency optimization. Experiment results show that the proposed schemes are able to significantly improve the tail latency and access performance with little overhead.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"339 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113982818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

NAS 2019 Messages

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/nas.2019.8834712

引用次数: 0

Thermo-GC: Reducing Write Amplification by Tagging Migrated Pages during Garbage Collection 热gc:通过在垃圾收集期间标记迁移页面来减少写放大

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834722

Jing Yang, Shuyi Pei

Flash memory based solid-state drive (SSD) has been deployed in various systems because of its significant advantages over hard disk drive in terms of throughput and IOPS. One inherent operation that is necessary in SSD is garbage collection (GC), a procedure that selects an erasure candidate block and moves valid data on the selected candidate to another block. The performance of SSD is greatly influenced by GC. While existing studies have made advances in minimizing GC cost, few took advantages of the procedure of GC itself. As GC goes on, valid pages in an erasure candidate block tend to have similar lifetimes that can be exploited to minimize page’s movements. In this paper, we introduce Thermo-GC. The idea is to identify data’s hotness during GC operations and group data that have similar lifetimes to the same block. By clustering valid pages based on their hotness, Thermo-GC can minimize valid page movements and reduce GC cost. Experiment results show that Thermo-GC reduces data movements during GC by 78% and write amplification factor by 29.7% on average, implying extended lifetimes of SSDs.

基于闪存的固态硬盘(SSD)由于其在吞吐量和IOPS方面比硬盘驱动器有显著的优势，已经被部署在各种系统中。SSD中必需的一个固有操作是垃圾收集(GC)，这是一个选择擦除候选块并将所选候选块上的有效数据移动到另一个块的过程。GC对SSD的性能影响很大。虽然现有的研究在最小化GC成本方面取得了进展，但很少有人利用GC过程本身的优势。随着GC的进行，擦除候选块中的有效页面往往具有相似的生存期，可以利用这些生存期来最小化页面的移动。本文介绍了热-气相色谱。其思想是在GC操作期间识别数据的热度，并将与同一块具有相似生存期的数据分组。通过基于热度对有效页面进行聚类，thermal -GC可以最小化有效页面移动并降低GC成本。实验结果表明，thermal -GC在GC过程中平均减少了78%的数据移动，写入放大因子平均减少了29.7%，这意味着延长了ssd的寿命。

{"title":"Thermo-GC: Reducing Write Amplification by Tagging Migrated Pages during Garbage Collection","authors":"Jing Yang, Shuyi Pei","doi":"10.1109/NAS.2019.8834722","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834722","url":null,"abstract":"Flash memory based solid-state drive (SSD) has been deployed in various systems because of its significant advantages over hard disk drive in terms of throughput and IOPS. One inherent operation that is necessary in SSD is garbage collection (GC), a procedure that selects an erasure candidate block and moves valid data on the selected candidate to another block. The performance of SSD is greatly influenced by GC. While existing studies have made advances in minimizing GC cost, few took advantages of the procedure of GC itself. As GC goes on, valid pages in an erasure candidate block tend to have similar lifetimes that can be exploited to minimize page’s movements. In this paper, we introduce Thermo-GC. The idea is to identify data’s hotness during GC operations and group data that have similar lifetimes to the same block. By clustering valid pages based on their hotness, Thermo-GC can minimize valid page movements and reduce GC cost. Experiment results show that Thermo-GC reduces data movements during GC by 78% and write amplification factor by 29.7% on average, implying extended lifetimes of SSDs.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130021923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

HCMA: Supporting High Concurrency of Memory Accesses with Scratchpad Memory in FPGAs HCMA: fpga中使用刮板存储器支持高并发内存访问

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834726

Yangyang Zhao, Yuhang Liu, Wei Li, Mingyu Chen

Currently many researches focus on new methods of accelerating memory accesses between memory controller and memory modules. However, the absence of an accelerator for memory accesses between CPU and memory controller wastes the performance benefits of new methods. Therefore, we propose a coordinated batch method to support high concurrency of memory accesses (HCMA). Compared to the conventional method of holding outstanding memory access requests in miss status handling registers (MSHRs), HCMA method takes advantage of scratchpad memory in FPGAs or SoCs to circumvent the limitation of MSHR entries. The concurrency of requests is only limited by the capacity of scratchpad memory. Moreover, to avoid the higher latency when searching more entries, we design an efficient coordinating mechanism based on circular queues.We evaluate the performance of HCMA method on an MP-SoC FPGA platform. Compared to conventional methods based on MSHRs, HCMA method supports ten times of concurrent memory accesses (from 10 to 128 entries on our evaluation platform). HCMA method achieves up to 2.72× memory bandwidth utilization for applications that access memory with massive fine-grained random requests, and to 3.46× memory bandwidth utilization for stream-based memory accesses. For real applications like CG, our method improves speedup performance by 29.87%.

目前许多研究都集中在加速存储器控制器和存储器模块之间存储器访问的新方法上。但是，由于CPU和内存控制器之间没有内存访问加速器，因此浪费了新方法的性能优势。因此，我们提出了一种协调批处理方法来支持内存访问的高并发性。与在miss状态处理寄存器(MSHR)中保存未完成的内存访问请求的传统方法相比，HCMA方法利用fpga或soc中的刮板存储器来规避MSHR条目的限制。请求的并发性仅受临时存储器容量的限制。此外，为了避免在搜索更多条目时产生更高的延迟，我们设计了一种基于循环队列的高效协调机制。我们在MP-SoC FPGA平台上评估了HCMA方法的性能。与基于MSHRs的传统方法相比，HCMA方法支持10倍的并发内存访问(在我们的评估平台上从10到128个条目)。对于具有大量细粒度随机请求访问内存的应用程序，HCMA方法的内存带宽利用率最高可达2.72倍，对于基于流的内存访问，HCMA方法的内存带宽利用率最高可达3.46倍。对于像CG这样的实际应用，我们的方法将加速性能提高了29.87%。

{"title":"HCMA: Supporting High Concurrency of Memory Accesses with Scratchpad Memory in FPGAs","authors":"Yangyang Zhao, Yuhang Liu, Wei Li, Mingyu Chen","doi":"10.1109/NAS.2019.8834726","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834726","url":null,"abstract":"Currently many researches focus on new methods of accelerating memory accesses between memory controller and memory modules. However, the absence of an accelerator for memory accesses between CPU and memory controller wastes the performance benefits of new methods. Therefore, we propose a coordinated batch method to support high concurrency of memory accesses (HCMA). Compared to the conventional method of holding outstanding memory access requests in miss status handling registers (MSHRs), HCMA method takes advantage of scratchpad memory in FPGAs or SoCs to circumvent the limitation of MSHR entries. The concurrency of requests is only limited by the capacity of scratchpad memory. Moreover, to avoid the higher latency when searching more entries, we design an efficient coordinating mechanism based on circular queues.We evaluate the performance of HCMA method on an MP-SoC FPGA platform. Compared to conventional methods based on MSHRs, HCMA method supports ten times of concurrent memory accesses (from 10 to 128 entries on our evaluation platform). HCMA method achieves up to 2.72× memory bandwidth utilization for applications that access memory with massive fine-grained random requests, and to 3.46× memory bandwidth utilization for stream-based memory accesses. For real applications like CG, our method improves speedup performance by 29.87%.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116700796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ares: A Scalable High-Performance Passive Measurement Tool Using a Multicore System Ares:使用多核系统的可扩展高性能被动测量工具

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834734

Xiaoban Wu, Yan Luo, Jeronimo Bezerra, Liang-Min Wang

Network measurement tools must support the collection of fine-grain flow statistics and scale well to the increasing line rates. However, conventional network measurement software tools are inadequate in high-speed network at the current scale. In this paper, we present Ares, a scalable high-performance passive network measurement tool to collect accurate per-flow metrics. Ares is built on a multicore platform, consisting of an effective hierarchical core assignment strategy, an efficient hash table for keeping flow statistics, a novel lockless flow statistics management scheme, as well as cache friendly prefetching. Our extensive performance evaluation shows that Ares brings about 19x speedup for 64-byte packets over existing approaches and can sustain up to a line rate of 100Gbps, while delivering the same level of fine-grained flow metrics.

网络测量工具必须支持细粒度流量统计数据的收集，并能很好地适应不断增加的管线速率。然而，传统的网络测量软件工具在当前规模下的高速网络中是不够的。在本文中，我们介绍了Ares，一种可扩展的高性能无源网络测量工具，用于收集准确的每流指标。Ares建立在一个多核平台上，包括一个有效的分层核心分配策略，一个有效的保持流量统计的哈希表，一个新的无锁流量统计管理方案，以及缓存友好的预取。我们广泛的性能评估表明，Ares为64字节数据包提供了比现有方法快19倍的加速，并且可以维持高达100Gbps的线路速率，同时提供相同级别的细粒度流指标。

引用次数: 3

Learning Workflow Scheduling on Multi-Resource Clusters 学习多资源集群的工作流调度

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834720

Yang Hu, C. D. Laat, Zhiming Zhao

Workflow scheduling is one of the key issues in the management of workflow execution. Typically, a workflow application can be modeled as a Directed-Acyclic Graph (DAG). In this paper, we present GoDAG, an approach that can learn to well schedule workflows on multi-resource clusters. GoDAG directly learns the scheduling policy from experience through deep reinforcement learning. In order to adapt deep reinforcement learning methods, we propose a novel state representation, a practical action space and a corresponding reward definition for workflow scheduling problem. We implement a GoDAG prototype and a simulator to simulate task running on multi-resource clusters. In the evaluation, we compare the GoDAG with three state-of-the-art heuristics. The results show that GoDAG outperforms the baseline heuristics, leading to less average makespan to different workflow structures.

工作流调度是工作流执行管理的关键问题之一。通常，工作流应用程序可以建模为有向无环图(DAG)。在本文中，我们提出了GoDAG，一种可以学习在多资源集群上很好地调度工作流的方法。GoDAG通过深度强化学习直接从经验中学习调度策略。为了适应深度强化学习方法，我们提出了一种新的工作流调度问题的状态表示、实际动作空间和相应的奖励定义。我们实现了一个GoDAG原型和一个模拟器来模拟在多资源集群上运行的任务。在评价中，我们将GoDAG与三种最先进的启发式方法进行了比较。结果表明，GoDAG优于基线启发式方法，使得不同工作流结构的平均完工时间更短。

引用次数: 9

Contention Aware Workload and Resource Co-Scheduling on Power-Bounded Systems 电力有限系统的竞争感知工作负载和资源协同调度

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834721

Pengfei Zou, Xizhou Feng, Rong Ge

As power becomes a top challenge in HPC systems and data centers, how to sustain the system performance growth under limited available or permissible power becomes an important research topic. Traditionally, researchers have explored collocating non-interfering jobs on the same nodes to improve system performance. Nevertheless, power limits reduce the capacity of components, nodes, and systems, and induce or aggravate contention between jobs. Using prior power-oblivious job collocation strategies on power limited systems can adversely degrade system throughput. In this paper, we quantitatively estimate contention induced by power limits, and propose a Contention-Aware Power-bounded Scheduling (CAPS) for systems with finite power budgets. CAPS chooses to collocate jobs that are complementary when power is limited, and distributes the available power to nodes and components to minimize their interference. Experimental results show that CAPS improves system throughput and power efficiency by 10% or greater than power-oblivious job collocation strategies, depending on the available power, for hybrid MPI/OpenMP benchmarks on a 192-core 8-node cluster.

随着功率成为高性能计算系统和数据中心面临的最大挑战，如何在有限的可用或允许功率下保持系统性能增长成为一个重要的研究课题。传统上，研究人员一直在探索在相同节点上配置互不干扰的作业以提高系统性能。然而，功率限制降低了组件、节点和系统的容量，并引发或加剧了工作之间的竞争。在功率有限的系统上使用先验功率无关的作业配置策略会降低系统吞吐量。本文定量地估计了由功率限制引起的争用，并针对有限功率预算的系统提出了一种感知争用的功率有限调度方法。当功率有限时，CAPS会选择互补的作业并配，并将可用的功率分配给节点和组件，以最大限度地减少它们之间的干扰。实验结果表明，在192核8节点集群的混合MPI/OpenMP基准测试中，CAPS比功率无关作业搭配策略提高了10%或更高的系统吞吐量和功率效率，具体取决于可用功率。

{"title":"Contention Aware Workload and Resource Co-Scheduling on Power-Bounded Systems","authors":"Pengfei Zou, Xizhou Feng, Rong Ge","doi":"10.1109/NAS.2019.8834721","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834721","url":null,"abstract":"As power becomes a top challenge in HPC systems and data centers, how to sustain the system performance growth under limited available or permissible power becomes an important research topic. Traditionally, researchers have explored collocating non-interfering jobs on the same nodes to improve system performance. Nevertheless, power limits reduce the capacity of components, nodes, and systems, and induce or aggravate contention between jobs. Using prior power-oblivious job collocation strategies on power limited systems can adversely degrade system throughput. In this paper, we quantitatively estimate contention induced by power limits, and propose a Contention-Aware Power-bounded Scheduling (CAPS) for systems with finite power budgets. CAPS chooses to collocate jobs that are complementary when power is limited, and distributes the available power to nodes and components to minimize their interference. Experimental results show that CAPS improves system throughput and power efficiency by 10% or greater than power-oblivious job collocation strategies, depending on the available power, for hybrid MPI/OpenMP benchmarks on a 192-core 8-node cluster.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124082411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Leveraging Array Mapped Tries in KSM for Lightweight Memory Deduplication 利用KSM中的数组映射尝试实现轻量级内存重复数据删除

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/NAS.2019.8834730

Lingjing You, Yongkun Li, Fan Guo, Yinlong Xu, Jinzhong Chen, Liu Yuan

In cloud computing, how to use limited hardware resources to meet the increasing demands has become a major issue. KSM (Kernel Same-page Merging) is a content-based page sharing mechanism used in Linux that merges equal memory pages, thereby significantly reducing memory usage and increasing the density of virtual machines or containers. However, KSM introduces a large overhead in CPU and memory bandwidth usage due to the use of red-black trees and content-based page comparison. To reduce the deduplication overhead, in this paper, we propose a new design called AMT-KSM, which leverages array mapped tries to realize lightweight memory deduplication. The basic idea is to divide each memory page into multiple segments and use the concatenated strings of the hash values of segments as indexed keys in the tries. By doing this, we can significantly reduce the time required for searching duplicate pages as well as the number of page comparisons. We conduct experiments to evaluate the performance of our design, and results show that compared with the conventional KSM, AMT-KSM can reduce up to 44.9% CPU usage and 31.6% memory bandwidth usage.

在云计算中，如何利用有限的硬件资源来满足日益增长的需求已成为一个主要问题。KSM(内核同页合并)是Linux中使用的一种基于内容的页共享机制，它合并相等的内存页，从而显著减少内存使用并增加虚拟机或容器的密度。然而，由于使用红黑树和基于内容的页面比较，KSM在CPU和内存带宽使用方面带来了很大的开销。为了减少重复数据删除开销，本文提出了一种新的设计，称为AMT-KSM，它利用阵列映射尝试实现轻量级内存重复数据删除。基本思想是将每个内存页划分为多个段，并在尝试中使用段哈希值的连接字符串作为索引键。通过这样做，我们可以显著减少搜索重复页面所需的时间以及页面比较的次数。实验结果表明，与传统的KSM相比，AMT-KSM可以减少44.9%的CPU占用率和31.6%的内存带宽占用率。

{"title":"Leveraging Array Mapped Tries in KSM for Lightweight Memory Deduplication","authors":"Lingjing You, Yongkun Li, Fan Guo, Yinlong Xu, Jinzhong Chen, Liu Yuan","doi":"10.1109/NAS.2019.8834730","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834730","url":null,"abstract":"In cloud computing, how to use limited hardware resources to meet the increasing demands has become a major issue. KSM (Kernel Same-page Merging) is a content-based page sharing mechanism used in Linux that merges equal memory pages, thereby significantly reducing memory usage and increasing the density of virtual machines or containers. However, KSM introduces a large overhead in CPU and memory bandwidth usage due to the use of red-black trees and content-based page comparison. To reduce the deduplication overhead, in this paper, we propose a new design called AMT-KSM, which leverages array mapped tries to realize lightweight memory deduplication. The basic idea is to divide each memory page into multiple segments and use the concatenated strings of the hash values of segments as indexed keys in the tries. By doing this, we can significantly reduce the time required for searching duplicate pages as well as the number of page comparisons. We conduct experiments to evaluate the performance of our design, and results show that compared with the conventional KSM, AMT-KSM can reduce up to 44.9% CPU usage and 31.6% memory bandwidth usage.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128101178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

NAS 2019 Keynotes

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

Pub Date : 2019-08-01 DOI: 10.1109/nas.2019.8834717

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE International Conference on Networking, Architecture and Storage (NAS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀