首页 > 最新文献

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture最新文献

英文 中文
Skewed Compressed Caches 扭曲压缩缓存
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.41
S. Sardashti, André Seznec, D. Wood
Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.
缓存压缩寻求的是大缓存的好处和小缓存的面积和功率。理想情况下,压缩缓存通过紧密压缩压缩块来增加有效容量,具有较低的标记和元数据开销,并允许快速查找。然而,以前的压缩缓存设计无法实现所有这些目标。在本文中,我们提出了倾斜压缩缓存(SCC),这是一种新的硬件压缩缓存,可以降低开销并提高性能。SCC跟踪超级块以减少标记开销,将块压缩成可变数量的子块以减少内部碎片,但保留直接的标记-数据映射以快速找到块并消除额外的元数据(即,没有向后指针)。使用新颖的稀疏超级块标记和考虑压缩大小的倾斜关联映射来解决这个问题。在我们的实验中,SCC提供了平均8%(高达22%)的更高性能,平均6%(高达20%)的总能量降低,实现了最近的解耦压缩缓存[26]的好处,面积开销降低了4倍,设计复杂性降低了。
{"title":"Skewed Compressed Caches","authors":"S. Sardashti, André Seznec, D. Wood","doi":"10.1109/MICRO.2014.41","DOIUrl":"https://doi.org/10.1109/MICRO.2014.41","url":null,"abstract":"Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"478 1","pages":"331-342"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77769594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors 使用ECC反馈来指导低压处理器中的电压推测
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.54
Anys Bacha, R. Teodorescu
Low-voltage computing is emerging as a promising energy-efficient solution to power-constrained environments. Unfortunately, low-voltage operation presents significant reliability challenges, including increased sensitivity to static and dynamic variability. To prevent errors, safety guard bands can be added to the supply voltage. While these guard bands are feasible at higher supply voltages, they are prohibitively expensive at low voltages, to the point of negating most of the energy savings. Voltage speculation techniques have been proposed to dynamically reduce voltage margins. Most require additional hardware to be added to the chip to correct or prevent timing errors caused by excessively aggressive speculation. This paper presents a mechanism for safely guiding voltage speculation using direct feedback from ECC-protected cache lines. We conduct extensive testing of an Intel Itanium processor running at low voltages. We find that as voltage margins are reduced, certain ECC-protected cache lines consistently exhibit correctable errors. We propose a hardware mechanism for continuously probing these cache lines to fine tune supply voltage at core granularity within a chip. Moreover, we demonstrate that this mechanism is sufficiently sensitive to detect and adapt to voltage noise caused by fluctuations in chip activity. We evaluate a proof-of-concept implementation of this mechanism in an Itanium-based server. We show that this solution lowers supply voltage by 18% on average, reducing power consumption by an average of 33% while running a mix of benchmark applications.
低压计算正在成为一种有前途的节能解决方案,适用于功率受限的环境。不幸的是,低压操作对可靠性提出了重大挑战,包括对静态和动态变化的敏感性增加。为了防止错误,可以在电源电压上增加安全防护带。虽然这些保护带在较高的电源电压下是可行的,但在低电压下它们的成本过高,以至于抵消了大部分的节能效果。电压推测技术已经被提出来动态地降低电压余量。大多数需要额外的硬件添加到芯片,以纠正或防止计时错误造成的过度激进的猜测。本文提出了一种利用ecc保护的高速缓存线路的直接反馈来安全引导电压推测的机制。我们对在低电压下运行的英特尔安腾处理器进行了广泛的测试。我们发现,随着电压裕度的降低,某些ecc保护的缓存线始终表现出可纠正的错误。我们提出了一种硬件机制,用于连续探测这些缓存线,以微调芯片内核心粒度的电源电压。此外,我们证明了这种机制足够敏感,可以检测和适应由芯片活性波动引起的电压噪声。我们在基于itanium的服务器上评估了该机制的概念验证实现。我们表明,在运行混合基准应用程序时,该解决方案平均降低了18%的电源电压,平均降低了33%的功耗。
{"title":"Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors","authors":"Anys Bacha, R. Teodorescu","doi":"10.1109/MICRO.2014.54","DOIUrl":"https://doi.org/10.1109/MICRO.2014.54","url":null,"abstract":"Low-voltage computing is emerging as a promising energy-efficient solution to power-constrained environments. Unfortunately, low-voltage operation presents significant reliability challenges, including increased sensitivity to static and dynamic variability. To prevent errors, safety guard bands can be added to the supply voltage. While these guard bands are feasible at higher supply voltages, they are prohibitively expensive at low voltages, to the point of negating most of the energy savings. Voltage speculation techniques have been proposed to dynamically reduce voltage margins. Most require additional hardware to be added to the chip to correct or prevent timing errors caused by excessively aggressive speculation. This paper presents a mechanism for safely guiding voltage speculation using direct feedback from ECC-protected cache lines. We conduct extensive testing of an Intel Itanium processor running at low voltages. We find that as voltage margins are reduced, certain ECC-protected cache lines consistently exhibit correctable errors. We propose a hardware mechanism for continuously probing these cache lines to fine tune supply voltage at core granularity within a chip. Moreover, we demonstrate that this mechanism is sufficiently sensitive to detect and adapt to voltage noise caused by fluctuations in chip activity. We evaluate a proof-of-concept implementation of this mechanism in an Itanium-based server. We show that this solution lowers supply voltage by 18% on average, reducing power consumption by an average of 33% while running a mix of benchmark applications.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"20 1","pages":"306-318"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81495383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
COMP: Compiler Optimizations for Manycore Processors 多核处理器的编译器优化
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.30
Linhai Song, Min Feng, N. Ravi, Yi Yang, S. Chakradhar
Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.
在多核处理器上执行的应用程序现在可以很容易地将计算转移到许多核心处理器上,例如Intel Xeon Phi协处理器。然而,调优此类卸载的应用程序以实现高性能执行需要高水平的专业知识和努力。以前的工作集中在优化卸载计算在许多核心处理器上的执行。然而,我们观察到,多核和多核处理器之间的数据传输开销,以及许多核心处理器有限的设备内存,通常会限制通过卸载计算可能获得的性能增益。在本文中,我们介绍了三种源到源的编译器优化,它们可以显著提高将计算任务转移到许多核心处理器上的应用程序的性能。第一个优化自动转换已卸载的代码以启用数据流,这使多核和多核处理器之间的数据传输与这些处理器上的计算重叠,以隐藏数据传输开销。此优化还旨在最大限度地减少许多核心处理器上的内存使用,同时实现最佳性能。第二个编译器优化重新排序计算以规范不规则的内存访问。它支持在许多核心处理器上进行数据流和因式分解,即使原始源代码中的内存访问模式是不规则的。最后,我们新的共享内存机制为在主机和许多核心处理器之间传输大型基于指针的数据结构提供了有效的支持。我们的评估表明,建议的编译器优化在12个基准测试中有9个受益。与简单卸载这些基准测试的原始并行实现相比,我们可以获得1.16 -52.21倍的速度提升。
{"title":"COMP: Compiler Optimizations for Manycore Processors","authors":"Linhai Song, Min Feng, N. Ravi, Yi Yang, S. Chakradhar","doi":"10.1109/MICRO.2014.30","DOIUrl":"https://doi.org/10.1109/MICRO.2014.30","url":null,"abstract":"Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"52 1","pages":"659-671"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87058466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Dodec: Random-Link, Low-Radix On-Chip Networks 十二编:随机链路,低基数片上网络
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.19
Haofan Yang, J. Tripathi, Natalie D. Enright Jerger, Dan Gibson
Network topology plays a vital role in chip design, it largely determines network cost (power and area) and significantly impacts communication performance in many-core architectures. Conventional topologies such as a 2D mesh have drawbacks including high diameter as the network scales and poor load balancing for the center nodes. We propose a methodology to design random topologies for on-chip networks. Random topologies provide better scalability in terms of network diameter and provide inherent load balancing. As a proof-of-concept for random on-chip topologies, we explore a novel set of networks -- do decs -- and illustrate how they reduce network diameter with randomized low-radix router connections. While a 4 × 4 mesh has a diameter of 6, our dodec has a diameter of 4 with lower cost. By introducing randomness, dodec networks exhibit more uniform message latency. By using low-radix routers, dodec networks simplify the router micro architecture and attain 20% area and 22% power reduction compared to mesh routers while delivering the same overall application performance for PARSEC.
网络拓扑结构在芯片设计中起着至关重要的作用,它在很大程度上决定了网络成本(功耗和面积),并对多核架构中的通信性能产生重大影响。传统的拓扑结构(如2D网格)存在一些缺点,包括随着网络规模的扩大,直径会变大,中心节点的负载平衡也会变差。我们提出了一种设计片上网络随机拓扑的方法。随机拓扑在网络直径方面提供了更好的可伸缩性,并提供了固有的负载平衡。作为随机片上拓扑的概念验证,我们探索了一组新颖的网络- do decs -并说明了它们如何通过随机低基数路由器连接减小网络直径。虽然4 × 4网格的直径为6,但我们的dodec的直径为4,成本较低。通过引入随机性,十二编网络表现出更均匀的消息延迟。通过使用低基数路由器,dodec网络简化了路由器的微架构,与网状路由器相比,可以减少20%的面积和22%的功耗,同时为PARSEC提供相同的整体应用性能。
{"title":"Dodec: Random-Link, Low-Radix On-Chip Networks","authors":"Haofan Yang, J. Tripathi, Natalie D. Enright Jerger, Dan Gibson","doi":"10.1109/MICRO.2014.19","DOIUrl":"https://doi.org/10.1109/MICRO.2014.19","url":null,"abstract":"Network topology plays a vital role in chip design, it largely determines network cost (power and area) and significantly impacts communication performance in many-core architectures. Conventional topologies such as a 2D mesh have drawbacks including high diameter as the network scales and poor load balancing for the center nodes. We propose a methodology to design random topologies for on-chip networks. Random topologies provide better scalability in terms of network diameter and provide inherent load balancing. As a proof-of-concept for random on-chip topologies, we explore a novel set of networks -- do decs -- and illustrate how they reduce network diameter with randomized low-radix router connections. While a 4 × 4 mesh has a diameter of 6, our dodec has a diameter of 4 with lower cost. By introducing randomness, dodec networks exhibit more uniform message latency. By using low-radix routers, dodec networks simplify the router micro architecture and attain 20% area and 22% power reduction compared to mesh routers while delivering the same overall application performance for PARSEC.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"38 1","pages":"496-508"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78739313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Wormhole: Wisely Predicting Multidimensional Branches 虫洞:明智地预测多维分支
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.40
Jorge Albericio, Joshua San Miguel, Natalie D. Enright Jerger, Andreas Moshovos
Improving branch prediction accuracy is essential in enabling high-performance processors to find more concurrency and to improve energy efficiency by reducing wrong path instruction execution, a paramount concern in today's power-constrained computing landscape. Branch prediction traditionally considers past branch outcomes as a linear, continuous bit stream through which it searches for patterns and correlations. The state-of-the-art TAGE predictor and its variants follow this approach while varying the length of the global history fragments they consider. This work identifies a construct, inherent to several applications that challenges existing, linear history based branch prediction strategies. It finds that applications have branches that exhibit multi-dimensional correlations. These are branches with the following two attributes: 1) they are enclosed within nested loops, and 2) they exhibit correlation across iterations of the outer loops. Folding the branch history and interpreting it as a multidimensional piece of information, exposes these cross-iteration correlations allowing predictors to search for more complex correlations in the history space with lower cost. We present wormhole, a new side-predictor that exploits these multidimensional histories. Wormhole is integrated alongside ISL-TAGE and leverages information from its existing side-predictors. Experiments show that the wormhole predictor improves accuracy more than existing side-predictors, some of which are commercially available, with a similar hardware cost. Considering 40 diverse application traces, the wormhole predictor reduces MPKI by an average of 2.53% and 3.15% on top of 4KB and 32KB ISL-TAGE predictors respectively. When considering the top four workloads that exhibit multi-dimensional history correlations, Wormhole achieves 22% and 20% MPKI average reductions over 4KB and 32KB ISL-TAGE.
提高分支预测的准确性对于使高性能处理器能够发现更多的并发性和通过减少错误路径指令执行来提高能源效率至关重要,这是当今受功率限制的计算环境中最重要的问题。分支预测传统上将过去的分支结果视为一个线性的、连续的比特流,通过它搜索模式和相关性。最先进的TAGE预测器及其变体遵循这种方法,同时改变它们所考虑的全球历史片段的长度。这项工作确定了一个结构,固有的几个应用程序,挑战现有的,基于线性历史的分支预测策略。它发现应用程序具有显示多维相关性的分支。这些分支具有以下两个属性:1)它们被封闭在嵌套循环中,2)它们在外部循环的迭代中表现出相关性。折叠分支历史并将其解释为多维信息片段,可以暴露这些交叉迭代相关性,从而允许预测者以较低的成本在历史空间中搜索更复杂的相关性。我们提出虫洞,一个利用这些多维历史的新的侧面预测器。Wormhole与is - tage集成在一起,并利用其现有的侧向预测器的信息。实验表明,虫洞预测器比现有的侧预测器提高了精度,其中一些侧预测器是市售的,硬件成本相似。考虑到40种不同的应用轨迹,虫洞预测器在4KB和32KB is - tage预测器的基础上,分别平均降低了2.53%和3.15%的MPKI。当考虑到表现出多维历史相关性的前四种工作负载时,在4KB和32KB的islage上,Wormhole的MPKI平均降低了22%和20%。
{"title":"Wormhole: Wisely Predicting Multidimensional Branches","authors":"Jorge Albericio, Joshua San Miguel, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1109/MICRO.2014.40","DOIUrl":"https://doi.org/10.1109/MICRO.2014.40","url":null,"abstract":"Improving branch prediction accuracy is essential in enabling high-performance processors to find more concurrency and to improve energy efficiency by reducing wrong path instruction execution, a paramount concern in today's power-constrained computing landscape. Branch prediction traditionally considers past branch outcomes as a linear, continuous bit stream through which it searches for patterns and correlations. The state-of-the-art TAGE predictor and its variants follow this approach while varying the length of the global history fragments they consider. This work identifies a construct, inherent to several applications that challenges existing, linear history based branch prediction strategies. It finds that applications have branches that exhibit multi-dimensional correlations. These are branches with the following two attributes: 1) they are enclosed within nested loops, and 2) they exhibit correlation across iterations of the outer loops. Folding the branch history and interpreting it as a multidimensional piece of information, exposes these cross-iteration correlations allowing predictors to search for more complex correlations in the history space with lower cost. We present wormhole, a new side-predictor that exploits these multidimensional histories. Wormhole is integrated alongside ISL-TAGE and leverages information from its existing side-predictors. Experiments show that the wormhole predictor improves accuracy more than existing side-predictors, some of which are commercially available, with a similar hardware cost. Considering 40 diverse application traces, the wormhole predictor reduces MPKI by an average of 2.53% and 3.15% on top of 4KB and 32KB ISL-TAGE predictors respectively. When considering the top four workloads that exhibit multi-dimensional history correlations, Wormhole achieves 22% and 20% MPKI average reductions over 4KB and 32KB ISL-TAGE.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"8 1","pages":"509-520"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83515456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Random Fill Cache Architecture 随机填充缓存架构
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.28
Fangfei Liu, R. Lee
Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.
正常运行的缓存已被证明可以通过各种类型的缓存侧通道攻击泄露关键机密,如加密密钥。这使强加密所提供的安全性失效,并允许违反机密性、假冒攻击和虚假服务。因此,未来的缓存设计必须考虑安全性,理想情况下不能降低性能和功率效率。我们引入了一种新的缓存侧通道攻击分类:基于争用的攻击和基于重用的攻击。以前的安全缓存设计只针对基于争用的攻击,我们表明它们不能防御基于重用的攻击。我们展示了一个令人惊讶的见解,即缓存的基本需求获取策略是一个安全漏洞,它会导致基于重用的攻击成功。我们提出了一种新的随机填充缓存架构,该架构在可配置的邻域窗口内用随机缓存填充取代需求提取。我们展示了随机填充缓存不会降低性能,事实上,对于某些类型的应用程序,它可以提高性能。我们还表明,它提供了针对基于重用的攻击的信息理论安全性。
{"title":"Random Fill Cache Architecture","authors":"Fangfei Liu, R. Lee","doi":"10.1109/MICRO.2014.28","DOIUrl":"https://doi.org/10.1109/MICRO.2014.28","url":null,"abstract":"Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"64 1","pages":"203-215"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78912212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 221
Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution 均衡器:动态调整GPU资源的有效执行
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.16
Ankit Sethia, S. Mahlke
GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode. Throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode.
gpu使用数千个线程来提供高性能和效率。通常,如果内核的一个线程更频繁地使用其中一种资源(计算、带宽、数据缓存),那么由于大量相同的并发线程,将会对该资源产生明显的争用。由于对瓶颈资源的争用,这种争用最终会使内核的性能饱和,同时使其他资源得不到充分利用。为了克服这个问题,一个可以调整硬件以匹配内核特征的运行时系统可以有效地缓解内核资源需求和GPU上存在的硬件资源之间的不平衡。我们提出了均衡器,一个低开销的硬件运行时系统,动态监控内核的资源需求,并管理片上并发的数量,核心频率和内存频率,以适应硬件,以最好地匹配运行内核的需求。均衡器提供两种模式的效率。首先,gpu使用数千个线程来提供高性能和效率,可以在不显著降低性能的情况下节省能源。通常,如果内核的一个线程更频繁地使用其中一种资源(计算、带宽、数据缓存),那么由于大量相同的并发线程,将会对该资源产生明显的争用。由于对瓶颈资源的争用,这种争用最终会使内核的性能饱和,同时使其他资源得不到充分利用。为了克服这个问题,一个可以调整硬件以匹配内核特征的运行时系统可以有效地缓解内核资源需求和GPU上存在的硬件资源之间的不平衡。我们提出了均衡器,一个低开销的硬件运行时系统,动态监控内核的资源需求,并管理片上并发的数量,核心频率和内存频率,以适应硬件,以最好地匹配运行内核的需求。均衡器提供两种模式的效率。首先,它可以通过节流未充分利用的资源来节省能源,而不会显著降低性能。其次,它可以增加瓶颈资源以减少争用,在不增加大量能量的情况下提供更高的性能。在27个内核的频谱中,均衡器在能量模式下节省15%,在性能模式下加速22%。限制未充分利用的资源。其次,它可以增加瓶颈资源以减少争用,在不增加大量能量的情况下提供更高的性能。在27个内核的频谱中,均衡器在能量模式下节省15%,在性能模式下加速22%。
{"title":"Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution","authors":"Ankit Sethia, S. Mahlke","doi":"10.1109/MICRO.2014.16","DOIUrl":"https://doi.org/10.1109/MICRO.2014.16","url":null,"abstract":"GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode. Throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"25 1","pages":"647-658"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73474568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Compiler Support for Optimizing Memory Bank-Level Parallelism 编译器支持优化内存组级并行性
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.34
W. Ding, D. Guttman, M. Kandemir
Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.
许多先前的基于编译器的优化方案只关注缓存数据的局部性。然而,缓存局部性只是在新兴的多核或多核上运行的应用程序整体性能的一部分。例如,即使在缓存优化的代码中,内存延迟也可能占执行时间的很大一部分,造成这种情况的主要原因之一是缺乏内存级并行性。基于此,我们提出了一种基于编译器的银行级并行(BLP)优化方案,该方案使用循环平铺调度。更具体地说,我们首先使用缓存丢失方程来预测每个块中最后一级缓存丢失将发生的位置,然后确定每个块中将被访问的内存库集。利用这些信息,提出了两种贴图调度算法来最大化BLP,每种算法针对不同的场景。我们进一步讨论如何增强基于编译器的方案,以考虑内存控制器级并行性和行缓冲区局部性。我们使用11个多线程应用程序进行的实验评估表明,所提出的BLP优化可以将平均BLP平均提高17.1%,从而使平均内存访问延迟减少9.2%。此外,考虑到内存控制器级并行性和行缓冲区局部性(除了BLP),我们在内存访问延迟方面的平均改进达到22.2%。
{"title":"Compiler Support for Optimizing Memory Bank-Level Parallelism","authors":"W. Ding, D. Guttman, M. Kandemir","doi":"10.1109/MICRO.2014.34","DOIUrl":"https://doi.org/10.1109/MICRO.2014.34","url":null,"abstract":"Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"13 1","pages":"571-582"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88604462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration 高上升:一个高基数开关与单周期仲裁的3D集成
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.45
Supreet Jeloka, R. Das, R. Dreslinski, T. Mudge, D. Blaauw
This paper proposes a novel 3D switch, called 'Hi-Rise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction.
本文提出了一种新颖的3D开关,称为“Hi-Rise”,它采用高基数开关来有效地跨多个堆叠层的芯片传输数据。所提出的互连是分层的,由每硅层两个交换机和一组专用的层对层通道组成。然而,分层3D切换可能导致不同层之间的不公平仲裁。为了解决这个问题,本文提出了一种独特的基于类的仲裁方案,该方案完全集成到交换结构中,并且易于实现。它使3D分层交换机的公平性可与最近最少授予仲裁的平面2D交换机相媲美。根据不同的根数、堆叠层数和不同的3D集成技术对3D开关进行评估。采用32nm技术评估的64位、128位宽度、4层Hi-Rise在均匀随机流量下的吞吐量为10.65 Tbps。与2D设计相比,这相当于吞吐量提高了15%,面积减少了33%,延迟减少了20%,每个事务的能耗减少了38%。
{"title":"Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration","authors":"Supreet Jeloka, R. Das, R. Dreslinski, T. Mudge, D. Blaauw","doi":"10.1109/MICRO.2014.45","DOIUrl":"https://doi.org/10.1109/MICRO.2014.45","url":null,"abstract":"This paper proposes a novel 3D switch, called 'Hi-Rise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"29 1","pages":"471-483"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85299126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
DaDianNao: A Machine-Learning Supercomputer DaDianNao:机器学习超级计算机
Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.58
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, O. Temam
Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
许多公司正在部署面向消费者或行业的服务,这些服务主要基于机器学习算法,用于对大量数据进行复杂处理。最先进和最流行的机器学习算法是卷积神经网络和深度神经网络(cnn和dnn),它们被认为是计算和内存密集型的。近年来,人们提出了许多神经网络加速器,它们可以提供较高的计算容量/面积比,但仍然受到内存访问的限制。然而,与处理器在通用工作负载上面临的内存墙不同,cnn和dnn的内存占用虽然很大,但不会超出多芯片系统的片上存储能力。该特性与CNN/DNN算法特性相结合,可以实现高内部带宽和低外部通信,从而以合理的面积成本实现高度并行。在本文中,我们将介绍一种定制的多芯片机器学习架构。我们表明,在已知最大的神经网络层的一个子集上,可以实现比GPU更快450.65倍的加速,并且在64芯片系统中平均减少150.31倍的能量。我们在28nm的位置和路由上实现节点,包含定制存储和计算单元的组合,具有工业级互连。
{"title":"DaDianNao: A Machine-Learning Supercomputer","authors":"Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, O. Temam","doi":"10.1109/MICRO.2014.58","DOIUrl":"https://doi.org/10.1109/MICRO.2014.58","url":null,"abstract":"Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"21 1","pages":"609-622"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91091300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1256
期刊
2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1