Proceedings of the 40th Annual International Symposium on Computer Architecture最新文献

Triggered instructions: a control paradigm for spatially-programmed architectures 触发指令:空间编程架构的控制范例

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485935

A. Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, N. Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, M. Gambhir, A. Jaleel, R. Allmon, Rachid Rayess, S. Maresh, J. Emer

In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture. Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.

在本文中，我们提出了触发指令，这是一种旨在利用空间并行性的处理元素阵列(pe)的新型控制范式。触发指令完全消除了程序计数器，并允许程序在没有显式分支指令的情况下在状态之间简洁地转换。它们还允许对pe间通信流量进行有效的响应。这种方法提供了一种统一的机制来避免过度序列化的执行，本质上达到了动态指令重排序和多线程等技术的效果，这两种技术在传统的顺序体系结构中都需要不同的硬件机制。我们的分析表明，基于触发指令的空间加速器可以实现比传统通用处理器高8倍的面积标准化性能。进一步的分析表明，与程序计数器风格的空间基线相比，触发控制将关键路径中的静态和动态指令的数量分别减少了62%和64%，从而使速度提高了2.0倍。

{"title":"Triggered instructions: a control paradigm for spatially-programmed architectures","authors":"A. Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, N. Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, M. Gambhir, A. Jaleel, R. Allmon, Rachid Rayess, S. Maresh, J. Emer","doi":"10.1145/2485922.2485935","DOIUrl":"https://doi.org/10.1145/2485922.2485935","url":null,"abstract":"In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture. Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81602990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 114

Protozoa: adaptive granularity cache coherence 原生动物:自适应粒度缓存一致性

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485969

Hongzhou Zhao, Arrvindh Shriraman, Snehasish Kumar, S. Dwarkadas

State-of-the-art multiprocessor cache hierarchies propagate the use of a fixed granularity in the cache organization to the design of the coherence protocol. Unfortunately, the fixed granularity, generally chosen to match average spatial locality across a range of applications, not only results in wasted bandwidth to serve an individual thread's access needs, but also results in unnecessary coherence traffic for shared data. The additional bandwidth has a direct impact on both the scalability of parallel applications and overall energy consumption. In this paper, we present the design of Protozoa, a family of coherence protocols that eliminate unnecessary coherence traffic and match data movement to an application's spatial locality. Protozoa continues to maintain metadata at a conventional fixed cache line granularity while 1) supporting variable read and write caching granularity so that data transfer matches application spatial granularity, 2) invalidating at the granularity of the write miss request so that readers to disjoint data can co-exist with writers, and 3) potentially supporting multiple non-overlapping writers within the cache line, thereby avoiding the traditional ping-pong effect of both read-write and write-write false sharing. Our evaluation demonstrates that Protozoa consistently reduce miss rate and improve the fraction of transmitted data that is actually utilized.

最先进的多处理器缓存层次结构将缓存组织中固定粒度的使用传播到一致性协议的设计中。不幸的是，通常选择固定粒度来匹配应用程序范围内的平均空间局部性，不仅会导致浪费带宽来满足单个线程的访问需求，而且还会导致共享数据的不必要的一致性流量。额外的带宽对并行应用程序的可伸缩性和总体能耗都有直接影响。在本文中，我们提出了Protozoa的设计，这是一组相干协议，可以消除不必要的相干流量并将数据移动匹配到应用程序的空间位置。Protozoa继续以传统的固定缓存线粒度维护元数据，同时1)支持可变读写缓存粒度，以便数据传输与应用程序的空间粒度相匹配;2)在写丢失请求的粒度上失效，以便不相关数据的读取器可以与写入器共存;3)潜在地支持缓存线内多个不重叠的写入器。从而避免了读写和读写错误共享的传统乒乓效应。我们的评估表明，原生动物持续降低失误率，提高实际利用的传输数据的比例。

{"title":"Protozoa: adaptive granularity cache coherence","authors":"Hongzhou Zhao, Arrvindh Shriraman, Snehasish Kumar, S. Dwarkadas","doi":"10.1145/2485922.2485969","DOIUrl":"https://doi.org/10.1145/2485922.2485969","url":null,"abstract":"State-of-the-art multiprocessor cache hierarchies propagate the use of a fixed granularity in the cache organization to the design of the coherence protocol. Unfortunately, the fixed granularity, generally chosen to match average spatial locality across a range of applications, not only results in wasted bandwidth to serve an individual thread's access needs, but also results in unnecessary coherence traffic for shared data. The additional bandwidth has a direct impact on both the scalability of parallel applications and overall energy consumption. In this paper, we present the design of Protozoa, a family of coherence protocols that eliminate unnecessary coherence traffic and match data movement to an application's spatial locality. Protozoa continues to maintain metadata at a conventional fixed cache line granularity while 1) supporting variable read and write caching granularity so that data transfer matches application spatial granularity, 2) invalidating at the granularity of the write miss request so that readers to disjoint data can co-exist with writers, and 3) potentially supporting multiple non-overlapping writers within the cache line, thereby avoiding the traditional ping-pong effect of both read-write and write-write false sharing. Our evaluation demonstrates that Protozoa consistently reduce miss rate and improve the fraction of transmitted data that is actually utilized.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81811969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness 对缓存分区进行硬件评估，以提高利用率和能源效率，同时保持响应性

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485949

Henry Cook, Miquel Moretó, Sarah Bird, Khanh Dao, D. Patterson, K. Asanović

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.

计算工作负载通常包含交互式的、对延迟敏感的前台应用程序和反复出现的后台计算。为了保证响应性，交互式和批处理应用程序通常在不相交的资源集上运行，但这会产生额外的能源、电力和资本成本。在本文中，我们评估了硬件缓存分区机制和策略的潜力，通过允许后台应用程序与交互式前台应用程序同时运行来提高效率，同时避免交互响应能力的下降。我们使用支持缓存分区的商用x86多核硬件评估了这些权衡，发现完整应用程序的真实硬件测量提供了与过去基于模拟的评估不同的观察结果。与单独运行任务相比，没有LLC分区的协同调度应用程序可以节省10%的能源，平均吞吐量提高54%，但可能导致前台性能下降高达34%，平均下降6%。使用最优静态LLC分区，平均能量改进提高到12%，平均吞吐量提高到60%，而最坏情况下的减速明显降低到7%，平均减速仅为2%。我们还评估了一种实用的低开销动态算法来控制分区大小，并且能够实现最优静态方法的潜在性能保证，同时将后台吞吐量额外提高19%。

{"title":"A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness","authors":"Henry Cook, Miquel Moretó, Sarah Bird, Khanh Dao, D. Patterson, K. Asanović","doi":"10.1145/2485922.2485949","DOIUrl":"https://doi.org/10.1145/2485922.2485949","url":null,"abstract":"Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"108 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89135317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135

On the feasibility of online malware detection with performance counters 基于性能计数器的在线恶意软件检测可行性研究

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485970

J. Demme, Matthew Maycock, J. Schmitz, Adrian Tang, A. Waksman, S. Sethumadhavan, S. Stolfo

The proliferation of computers in any domain is followed by the proliferation of malware in that domain. Systems, including the latest mobile platforms, are laden with viruses, rootkits, spyware, adware and other classes of malware. Despite the existence of anti-virus software, malware threats persist and are growing as there exist a myriad of ways to subvert anti-virus (AV) software. In fact, attackers today exploit bugs in the AV software to break into systems. In this paper, we examine the feasibility of building a malware detector in hardware using existing performance counters. We find that data from performance counters can be used to identify malware and that our detection techniques are robust to minor variations in malware programs. As a result, after examining a small set of variations within a family of malware on Android ARM and Intel Linux platforms, we can detect many variations within that family. Further, our proposed hardware modifications allow the malware detector to run securely beneath the system software, thus setting the stage for AV implementations that are simpler and less buggy than software AV. Combined, the robustness and security of hardware AV techniques have the potential to advance state-of-the-art online malware detection.

计算机在任何领域的扩散都伴随着该领域恶意软件的扩散。系统，包括最新的移动平台，充斥着病毒、rootkit、间谍软件、广告软件和其他类型的恶意软件。尽管反病毒软件已经存在，但恶意软件的威胁仍然存在，而且还在不断增长，因为有无数种方法可以破坏反病毒(AV)软件。事实上，今天的攻击者利用反病毒软件中的漏洞侵入系统。在本文中，我们研究了利用现有的性能计数器在硬件上构建恶意软件检测器的可行性。我们发现来自性能计数器的数据可用于识别恶意软件，并且我们的检测技术对恶意软件程序中的微小变化具有鲁棒性。因此，在检查了Android、ARM和Intel Linux平台上的恶意软件家族中的一小部分变体后，我们可以在该家族中检测到许多变体。此外，我们提出的硬件修改允许恶意软件检测器在系统软件下安全运行，从而为比软件反病毒更简单、更少错误的反病毒实现奠定了基础。结合起来，硬件反病毒技术的鲁棒性和安全性有可能推进最先进的在线恶意软件检测。

{"title":"On the feasibility of online malware detection with performance counters","authors":"J. Demme, Matthew Maycock, J. Schmitz, Adrian Tang, A. Waksman, S. Sethumadhavan, S. Stolfo","doi":"10.1145/2485922.2485970","DOIUrl":"https://doi.org/10.1145/2485922.2485970","url":null,"abstract":"The proliferation of computers in any domain is followed by the proliferation of malware in that domain. Systems, including the latest mobile platforms, are laden with viruses, rootkits, spyware, adware and other classes of malware. Despite the existence of anti-virus software, malware threats persist and are growing as there exist a myriad of ways to subvert anti-virus (AV) software. In fact, attackers today exploit bugs in the AV software to break into systems. In this paper, we examine the feasibility of building a malware detector in hardware using existing performance counters. We find that data from performance counters can be used to identify malware and that our detection techniques are robust to minor variations in malware programs. As a result, after examining a small set of variations within a family of malware on Android ARM and Intel Linux platforms, we can detect many variations within that family. Further, our proposed hardware modifications allow the malware detector to run securely beneath the system software, thus setting the stage for AV implementations that are simpler and less buggy than software AV. Combined, the robustness and security of hardware AV techniques have the potential to advance state-of-the-art online malware detection.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"600 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83699062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 409

Thin servers with smart pipes: designing SoC accelerators for memcached 具有智能管道的瘦服务器:为memcached设计SoC加速器

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485926

Kevin T. Lim, David Meisner, A. Saidi, Parthasarathy Ranganathan, T. Wenisch

Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

分布式内存中的键值存储(如memcached)是现代互联网服务可伸缩性的核心。当前的部署使用带有高端处理器的商品服务器。然而，考虑到互联网服务的成本敏感性和最近批量低功耗片上系统(SoC)设计的激增，我们看到了替代架构的机会。我们对memcached进行了详细的表征，以揭示性能和功耗方面的低效率。我们的研究考虑了高性能和低功耗cpu和nic，这些cpu和nic都是经过精心设计的，可以测试memcached行为的范围。我们发现，无论CPU微架构如何，memcached的执行效率都非常低，既不会使网络链路饱和，也不会使可用的内存带宽饱和。相反，我们发现性能通常受到网卡和操作系统内核中每包处理开销的限制——由于分支可预测性差和指令获取瓶颈，较长的代码路径限制了CPU性能。我们的见解表明，高性能和低功耗内核都不能提供令人满意的功率性能权衡，并指出需要更紧密地集成网络接口。因此，我们主张采用另一种架构——带智能管道的瘦服务器(TSSP)——以实现经济高效的高性能memcached部署。TSSP将嵌入式低功耗核心与memcached加速器结合在一起，memcached加速器可以完全在硬件中处理GET请求，从而卸载网络处理和数据查找。我们通过FPGA原型平台展示了我们的TSSP架构的潜在优势，并展示了比传统服务器基准提高6X-16X功率性能的潜力。

{"title":"Thin servers with smart pipes: designing SoC accelerators for memcached","authors":"Kevin T. Lim, David Meisner, A. Saidi, Parthasarathy Ranganathan, T. Wenisch","doi":"10.1145/2485922.2485926","DOIUrl":"https://doi.org/10.1145/2485922.2485926","url":null,"abstract":"Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74044884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 197

An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms 现代DRAM设备中数据保留行为的实验研究:保留时间分析机制的含义

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485928

Jamie Liu, Ben Jaiyen, Yoongu Kim, C. Wilkerson, O. Mutlu

DRAM cells store data in the form of charge on a capacitor. This charge leaks off over time, eventually causing data to be lost. To prevent this data loss from occurring, DRAM cells must be periodically refreshed. Unfortunately, DRAM refresh operations waste energy and also degrade system performance by interfering with memory requests. These problems are expected to worsen as DRAM density increases. The amount of time that a DRAM cell can safely retain data without being refreshed is called the cell's retention time. In current systems, all DRAM cells are refreshed at the rate required to guarantee the integrity of the cell with the shortest retention time, resulting in unnecessary refreshes for cells with longer retention times. Prior work has proposed to reduce unnecessary refreshes by exploiting differences in retention time among DRAM cells; however, such mechanisms require knowledge of each cell's retention time. In this paper, we present a comprehensive quantitative study of retention behavior in modern DRAMs. Using a temperature-controlled FPGA-based testing platform, we collect retention time information from 248 commodity DDR3 DRAM chips from five major DRAM vendors. We observe two significant phenomena: data pattern dependence, where the retention time of each DRAM cell is significantly affected by the data stored in other DRAM cells, and variable retention time, where the retention time of some DRAM cells changes unpredictably over time. We discuss possible physical explanations for these phenomena, how their magnitude may be affected by DRAM technology scaling, and their ramifications for DRAM retention time profiling mechanisms.

DRAM单元在电容器上以电荷的形式存储数据。这种电荷随着时间的推移而泄漏，最终导致数据丢失。为了防止这种数据丢失的发生，DRAM单元必须定期刷新。不幸的是，DRAM刷新操作会浪费能源，还会干扰内存请求，从而降低系统性能。随着DRAM密度的增加，这些问题预计会恶化。DRAM单元可以安全地保留数据而不被刷新的时间量称为单元的保留时间。在当前的系统中，所有的DRAM单元都以最短保留时间保证单元完整性所需的速率进行刷新，从而导致保留时间较长的单元不必要的刷新。先前的工作已经提出通过利用DRAM单元之间的保留时间差异来减少不必要的刷新;然而，这种机制需要了解每个细胞的保留时间。在本文中，我们对现代dram中的保留行为进行了全面的定量研究。使用温控fpga测试平台，我们收集了来自五大DRAM供应商的248个商品DDR3 DRAM芯片的保留时间信息。我们观察到两个重要的现象:数据模式依赖，其中每个DRAM单元的保留时间受到其他DRAM单元中存储的数据的显着影响;可变保留时间，其中一些DRAM单元的保留时间随着时间的推移而不可预测地变化。我们讨论了这些现象的可能的物理解释，它们的大小如何受到DRAM技术缩放的影响，以及它们对DRAM保留时间分析机制的影响。

{"title":"An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms","authors":"Jamie Liu, Ben Jaiyen, Yoongu Kim, C. Wilkerson, O. Mutlu","doi":"10.1145/2485922.2485928","DOIUrl":"https://doi.org/10.1145/2485922.2485928","url":null,"abstract":"DRAM cells store data in the form of charge on a capacitor. This charge leaks off over time, eventually causing data to be lost. To prevent this data loss from occurring, DRAM cells must be periodically refreshed. Unfortunately, DRAM refresh operations waste energy and also degrade system performance by interfering with memory requests. These problems are expected to worsen as DRAM density increases. The amount of time that a DRAM cell can safely retain data without being refreshed is called the cell's retention time. In current systems, all DRAM cells are refreshed at the rate required to guarantee the integrity of the cell with the shortest retention time, resulting in unnecessary refreshes for cells with longer retention times. Prior work has proposed to reduce unnecessary refreshes by exploiting differences in retention time among DRAM cells; however, such mechanisms require knowledge of each cell's retention time. In this paper, we present a comprehensive quantitative study of retention behavior in modern DRAMs. Using a temperature-controlled FPGA-based testing platform, we collect retention time information from 248 commodity DDR3 DRAM chips from five major DRAM vendors. We observe two significant phenomena: data pattern dependence, where the retention time of each DRAM cell is significantly affected by the data stored in other DRAM cells, and variable retention time, where the retention time of some DRAM cells changes unpredictably over time. We discuss possible physical explanations for these phenomena, how their magnitude may be affected by DRAM technology scaling, and their ramifications for DRAM retention time profiling mechanisms.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74090564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 317

GPUWattch: enabling energy optimizations in GPGPUs gpuwatch:使能gpu的能量优化

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485964

Jingwen Leng, Tayler H. Hetherington, Ahmed Eltantawy, S. Gilani, N. Kim, Tor M. Aamodt, V. Reddi

General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

通用gpu (gpgpu)在主流计算中变得越来越普遍，每瓦特性能已经成为比峰值性能更重要的评估指标。因此，GPU架构师需要强大的工具，使他们能够快速探索优化gpgpu以提高能效的新方法。我们提出了一种新的GPGPU功耗模型，它是可配置的，能够进行周期级计算，并经过实际硬件测量的仔细验证。为了实现可配置性，我们使用自底向上的方法，并从微架构组件中抽象参数作为模型的输入。我们开发了一套严格的80个微基准，我们用它来约束任何建模的不确定性和不准确性。针对两款市售gpu的测量结果对功耗模型进行了全面验证，两款目标gpu (GTX 480和Quadro FX5600)的测量误差分别在9.9%和13.4%以内。该模型还准确地跟踪了一段时间内的电力消耗趋势。我们将功率模型与周期级模拟器GPGPU-Sim集成，并通过使用动态电压和频率缩放(DVFS)和时钟门控来演示节能。传统的DVFS通过利用内核内部运行时变化减少了14.4%的GPU能耗。对于那些显示集群执行行为的基准测试，更细粒度的SM集群级DVFS将能源节约从6.6%提高到13.6%。我们还表明，在发散期间时钟门控非活动通道可降低11.2%的动态功率。

{"title":"GPUWattch: enabling energy optimizations in GPGPUs","authors":"Jingwen Leng, Tayler H. Hetherington, Ahmed Eltantawy, S. Gilani, N. Kim, Tor M. Aamodt, V. Reddi","doi":"10.1145/2485922.2485964","DOIUrl":"https://doi.org/10.1145/2485922.2485964","url":null,"abstract":"General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84635209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 551

Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems 理解和减少高密度DDR4 DRAM系统中的刷新开销

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485927

Janani Mukundan, H. Hunter, Kyu-hyoun Kim, Jeffrey Stuecheli, José F. Martínez

Recent DRAM specifications exhibit increasing refresh latencies. A refresh command blocks a full rank, decreasing available parallelism in the memory subsystem significantly, thus decreasing performance. Fine Granularity Refresh (FGR) is a feature recently announced as part of JEDEC's DDR4 DRAM specification that attempts to tackle this problem by creating a range of refresh options that provide a trade-off between refresh latency and frequency. In this paper, we first conduct an analysis of DDR4 DRAM's FGR feature, and show that there is no one-size-fits-all option across a variety of applications. We then present Adaptive Refresh (AR), a simple yet effective mechanism that dynamically chooses the best FGR mode for each application and phase within the application. When looking at the refresh problem more closely, we identify in high-density DRAM systems a phenomenon that we call command queue seizure, whereby the memory controller's command queue seizes up temporarily because it is full with commands to a rank that is being refreshed. To attack this problem, we propose two complementary mechanisms called Delayed Command Expansion (DCE) and Preemptive Command Drain (PCD). Our results show that AR does exploit DDR4's FGR effectively. However, once our proposed DCE and PCD mechanisms are added, DDR4's FGR becomes redundant in most cases, except in a few highly memory-sensitive applications, where the use of AR does provide some additional benefit. In all, our simulations show that the proposed mechanisms yield 8% (14%) mean speedup with respect to traditional refresh, at normal (extended) DRAM operating temperatures, for a set of diverse parallel applications.

最近的DRAM规范显示刷新延迟增加。刷新命令阻塞了一个满秩，显著降低了内存子系统中的可用并行性，从而降低了性能。细粒度刷新(Fine Granularity Refresh, FGR)是JEDEC的DDR4 DRAM规范中最近宣布的一项功能，它试图通过创建一系列刷新选项来解决这个问题，这些选项提供了刷新延迟和频率之间的权衡。在本文中，我们首先对DDR4 DRAM的FGR特性进行了分析，并表明在各种应用中没有放之四海而皆准的选择。然后，我们介绍了自适应刷新(AR)，这是一种简单而有效的机制，可以动态地为每个应用程序和应用程序中的每个阶段选择最佳的FGR模式。当更仔细地观察刷新问题时，我们在高密度DRAM系统中发现了一种我们称之为命令队列扣押的现象，即内存控制器的命令队列暂时扣押，因为它充满了要刷新的等级的命令。为了解决这个问题，我们提出了两种互补的机制，即延迟命令扩展(DCE)和抢先命令耗尽(PCD)。我们的研究结果表明，AR确实有效地利用了DDR4的FGR。然而，一旦我们提出的DCE和PCD机制被加入，DDR4的FGR在大多数情况下变得多余，除了在一些对内存高度敏感的应用程序中，使用AR确实提供了一些额外的好处。总之，我们的模拟表明，在正常(扩展)DRAM工作温度下，对于一组不同的并行应用程序，所提出的机制相对于传统刷新产生8%(14%)的平均加速。

{"title":"Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems","authors":"Janani Mukundan, H. Hunter, Kyu-hyoun Kim, Jeffrey Stuecheli, José F. Martínez","doi":"10.1145/2485922.2485927","DOIUrl":"https://doi.org/10.1145/2485922.2485927","url":null,"abstract":"Recent DRAM specifications exhibit increasing refresh latencies. A refresh command blocks a full rank, decreasing available parallelism in the memory subsystem significantly, thus decreasing performance. Fine Granularity Refresh (FGR) is a feature recently announced as part of JEDEC's DDR4 DRAM specification that attempts to tackle this problem by creating a range of refresh options that provide a trade-off between refresh latency and frequency. In this paper, we first conduct an analysis of DDR4 DRAM's FGR feature, and show that there is no one-size-fits-all option across a variety of applications. We then present Adaptive Refresh (AR), a simple yet effective mechanism that dynamically chooses the best FGR mode for each application and phase within the application. When looking at the refresh problem more closely, we identify in high-density DRAM systems a phenomenon that we call command queue seizure, whereby the memory controller's command queue seizes up temporarily because it is full with commands to a rank that is being refreshed. To attack this problem, we propose two complementary mechanisms called Delayed Command Expansion (DCE) and Preemptive Command Drain (PCD). Our results show that AR does exploit DDR4's FGR effectively. However, once our proposed DCE and PCD mechanisms are added, DDR4's FGR becomes redundant in most cases, except in a few highly memory-sensitive applications, where the use of AR does provide some additional benefit. In all, our simulations show that the proposed mechanisms yield 8% (14%) mean speedup with respect to traditional refresh, at normal (extended) DRAM operating temperatures, for a set of diverse parallel applications.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83250797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 103

An energy-efficient and scalable eDRAM-based register file architecture for GPGPU 一种高效、可扩展的基于edram的GPGPU寄存器文件架构

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485952

Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, M. Guo, R. Canal, Xiaoyao Liang

The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.

GPGPU中流多处理器(SM)的高线程数据处理需求需要一个大的寄存器文件(RF)。RF尺寸的快速增长使得未来技术中传统SRAM设计的面积成本和功耗难以承受。在本文中，我们建议使用嵌入式dram (eDRAM)作为未来gpgpu的替代方案。与SRAM相比，eDRAM具有更高的密度和更低的泄漏功率。然而，eDRAM有限的数据保留时间提出了新的挑战。为了保证数据的完整性，需要定期进行刷新操作。这种情况随着eDRAM密度、工艺变化和温度的变化而加剧。与使用多端口RF的传统cpu不同，现代GPGPU中的大多数RF都是大量存储的，但不是多端口的，以降低硬件成本。这提供了一个独特的机会来隐藏刷新开销。我们提出了两种不同的基于3T1D和1T1C存储单元的eDRAM实现。为了减轻周期性刷新的影响，我们提出了两种新颖的刷新解决方案:银行气泡和银行演练。此外，对于1T1C射频，我们设计了一个交错的银行组织以及智能翘曲调度策略，以减少破坏性读取的影响。分析表明，我们的方案比传统的基于sram的设计具有更好的能效、可扩展性和变化容忍度。

{"title":"An energy-efficient and scalable eDRAM-based register file architecture for GPGPU","authors":"Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, M. Guo, R. Canal, Xiaoyao Liang","doi":"10.1145/2485922.2485952","DOIUrl":"https://doi.org/10.1145/2485922.2485952","url":null,"abstract":"The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81344944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

Virtualizing power distribution in datacenters 虚拟化数据中心配电

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485973

Di Wang, Chuangang Ren, A. Sivasubramaniam

Power infrastructure contributes to a significant portion of datacenter expenditures. Overbooking this infrastructure for a high percentile of the needs is becoming more attractive than for occasional peaks. There exist several computing knobs to cap the power draw within such under-provisioned capacity. Recently, batteries and other energy storage devices have been proposed to provide a complementary alternative to these knobs, which when decentralized (or hierarchically placed), can temporarily take the load to suppress power peaks propagating up the hierarchy. With aggressive under-provisioning, the power hierarchy becomes as central a datacenter resource as other computing resources, making it imperative to carefully allocate, isolate and manage this resource (including batteries), across applications. Towards this goal, we present vPower, a software system to virtualize power distribution. vPower includes mechanisms and policies to provide a virtual power hierarchy for each application. It leverages traditional computing knobs as well as batteries, to apportion and manage the infrastructure between co-existing applications in the hierarchy. vPower allows applications to specify their power needs, performs admission control and placement, dynamically monitors power usage, and enforces allocations for fairness and system efficiency. Using several datacenter applications, and a 2-level power hierarchy prototype containing batteries at both levels, we demonstrate the effectiveness of vPower when working in an under-provisioned power infrastructure, using the right computing knobs and the right batteries at the right time. Results show over 50% improved system utilization and scale-out for vPower's over-booking, and between 12-28% better application performance than traditional power-capping control knobs. It also ensures isolation between applications competing for power.

电力基础设施占数据中心支出的很大一部分。与偶尔的高峰相比，为高百分比的需求超额预订这种基础设施正变得更有吸引力。存在几个计算旋钮来限制在这种不足配置的容量范围内的功耗。最近，电池和其他能量存储设备被提议提供这些旋钮的补充替代方案，当分散(或分层放置)时，可以暂时承担负载以抑制向上传播的功率峰值。由于供应不足，电源层次结构就像其他计算资源一样成为数据中心的中心资源，因此必须跨应用程序仔细分配、隔离和管理该资源(包括电池)。为此，我们提出了一个虚拟配电软件系统vPower。vPower包括为每个应用程序提供虚拟权力层次结构的机制和策略。它利用传统的计算旋钮和电池来分配和管理层次结构中共存的应用程序之间的基础设施。vPower允许应用程序指定他们的电力需求，执行准入控制和放置，动态监控电力使用，并强制分配公平和系统效率。使用几个数据中心应用程序，以及包含两个级别电池的2级电源层次原型，我们展示了vPower在供应不足的电力基础设施中工作时的有效性，在正确的时间使用正确的计算旋钮和正确的电池。结果显示，与传统的功率封顶控制旋钮相比，vPower的超额预订提高了50%以上的系统利用率和可扩展性，应用程序性能提高了12-28%。它还确保了竞争电源的应用程序之间的隔离。

{"title":"Virtualizing power distribution in datacenters","authors":"Di Wang, Chuangang Ren, A. Sivasubramaniam","doi":"10.1145/2485922.2485973","DOIUrl":"https://doi.org/10.1145/2485922.2485973","url":null,"abstract":"Power infrastructure contributes to a significant portion of datacenter expenditures. Overbooking this infrastructure for a high percentile of the needs is becoming more attractive than for occasional peaks. There exist several computing knobs to cap the power draw within such under-provisioned capacity. Recently, batteries and other energy storage devices have been proposed to provide a complementary alternative to these knobs, which when decentralized (or hierarchically placed), can temporarily take the load to suppress power peaks propagating up the hierarchy. With aggressive under-provisioning, the power hierarchy becomes as central a datacenter resource as other computing resources, making it imperative to carefully allocate, isolate and manage this resource (including batteries), across applications. Towards this goal, we present vPower, a software system to virtualize power distribution. vPower includes mechanisms and policies to provide a virtual power hierarchy for each application. It leverages traditional computing knobs as well as batteries, to apportion and manage the infrastructure between co-existing applications in the hierarchy. vPower allows applications to specify their power needs, performs admission control and placement, dynamically monitors power usage, and enforces allocations for fairness and system efficiency. Using several datacenter applications, and a 2-level power hierarchy prototype containing batteries at both levels, we demonstrate the effectiveness of vPower when working in an under-provisioned power infrastructure, using the right computing knobs and the right batteries at the right time. Results show over 50% improved system utilization and scale-out for vPower's over-booking, and between 12-28% better application performance than traditional power-capping control knobs. It also ensures isolation between applications competing for power.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90830263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45