首页 > 最新文献

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献

英文 中文
Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors 短路调度:在嵌入式处理器上加速虚拟机解释器
Channoh Kim, Sungmin Kim, Hyeonwoong Cho, Doo-Young Kim, Jaehyeok Kim, Young H. Oh, Hakbeom Jang, Jae W. Lee
Interpreters are widely used to implement high-level language virtual machines (VMs), especially on resource-constrained embedded platforms. Many scripting languages employ interpreter-based VMs for their advantages over native code compilers, such as portability, smaller resource footprint, and compact codes. For efficient interpretation a script (program) is first compiled into an intermediate representation, or bytecodes. The canonical interpreter then runs an infinite loop that fetches, decodes, and executes one bytecode at a time. This bytecode dispatch loop is a well-known source of inefficiency, typically featuring a large jump table with a hard-to-predict indirect jump. Most existing techniques to optimize this loop focus on reducing the misprediction rate of this indirect jump in both hardware and software. However, these techniques are much less effective on embedded processors with shallow pipelines and low IPCs. Instead, we tackle another source of inefficiency more prominent on embedded platforms - redundant computation in the dispatch loop. To this end, we propose Short-Circuit Dispatch (SCD), a low cost architectural extension that enables fast, hardware-based bytecode dispatch with fewer instructions. The key idea of SCD is to overlay the software-created bytecode jump table on a branch target buffer (BTB). Once a bytecode is fetched, the BTB is looked up using the bytecode, instead of PC, as key. If it hits, the interpreter directly jumps to the target address retrieved from the BTB, otherwise, it goes through the original dispatch path. This effectively eliminates redundant computation in the dispatcher code for decode, bound check, and target address calculation, thus significantly reducing total instruction count. Our simulation results demonstrate that SCD achieves geomean speedups of 19.9% and 14.1% for two production-grade script interpreters for Lua and JavaScript, respectively. Moreover, our fully synthesizable RTL design based on a RISC-V embedded processor shows that SCD improves the EDP of the Lua interpreter by 24.2%, while increasing the chip area by only 0.72% at a 40nm technology node.
解释器被广泛用于实现高级语言虚拟机(vm),特别是在资源受限的嵌入式平台上。许多脚本语言使用基于解释器的vm,因为它们比本机代码编译器更有优势,比如可移植性、更小的资源占用和紧凑的代码。为了有效地解释,脚本(程序)首先被编译成中间表示形式或字节码。然后,规范解释器运行一个无限循环,一次获取、解码和执行一个字节码。这种字节码调度循环是效率低下的一个众所周知的原因,它通常具有难以预测的间接跳转的大型跳转表。大多数现有的优化这种循环的技术都集中在硬件和软件上减少这种间接跳跃的错误预测率。然而,这些技术在具有浅管道和低ipc的嵌入式处理器上的效果要差得多。相反,我们解决了另一个在嵌入式平台上更为突出的低效率来源——调度循环中的冗余计算。为此,我们提出了短路调度(SCD),这是一种低成本的架构扩展,可以用更少的指令实现快速的、基于硬件的字节码调度。SCD的关键思想是将软件创建的字节码跳转表覆盖在分支目标缓冲区(BTB)上。一旦获取字节码,就会使用字节码而不是PC作为密钥查找BTB。如果命中,解释器直接跳转到从BTB检索到的目标地址,否则,它将通过原始分派路径。这有效地消除了调度程序代码中解码、绑定检查和目标地址计算的冗余计算,从而大大减少了总指令计数。我们的模拟结果表明,SCD对于两个用于Lua和JavaScript的生产级脚本解释器分别实现了19.9%和14.1%的几何加速。此外,我们基于RISC-V嵌入式处理器的完全可合成RTL设计表明,在40nm技术节点上,SCD使Lua解释器的EDP提高了24.2%,而芯片面积仅增加了0.72%。
{"title":"Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors","authors":"Channoh Kim, Sungmin Kim, Hyeonwoong Cho, Doo-Young Kim, Jaehyeok Kim, Young H. Oh, Hakbeom Jang, Jae W. Lee","doi":"10.1145/3007787.3001168","DOIUrl":"https://doi.org/10.1145/3007787.3001168","url":null,"abstract":"Interpreters are widely used to implement high-level language virtual machines (VMs), especially on resource-constrained embedded platforms. Many scripting languages employ interpreter-based VMs for their advantages over native code compilers, such as portability, smaller resource footprint, and compact codes. For efficient interpretation a script (program) is first compiled into an intermediate representation, or bytecodes. The canonical interpreter then runs an infinite loop that fetches, decodes, and executes one bytecode at a time. This bytecode dispatch loop is a well-known source of inefficiency, typically featuring a large jump table with a hard-to-predict indirect jump. Most existing techniques to optimize this loop focus on reducing the misprediction rate of this indirect jump in both hardware and software. However, these techniques are much less effective on embedded processors with shallow pipelines and low IPCs. Instead, we tackle another source of inefficiency more prominent on embedded platforms - redundant computation in the dispatch loop. To this end, we propose Short-Circuit Dispatch (SCD), a low cost architectural extension that enables fast, hardware-based bytecode dispatch with fewer instructions. The key idea of SCD is to overlay the software-created bytecode jump table on a branch target buffer (BTB). Once a bytecode is fetched, the BTB is looked up using the bytecode, instead of PC, as key. If it hits, the interpreter directly jumps to the target address retrieved from the BTB, otherwise, it goes through the original dispatch path. This effectively eliminates redundant computation in the dispatcher code for decode, bound check, and target address calculation, thus significantly reducing total instruction count. Our simulation results demonstrate that SCD achieves geomean speedups of 19.9% and 14.1% for two production-grade script interpreters for Lua and JavaScript, respectively. Moreover, our fully synthesizable RTL design based on a RISC-V embedded processor shows that SCD improves the EDP of the Lua interpreter by 24.2%, while increasing the chip area by only 0.72% at a 40nm technology node.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"28 1","pages":"291-303"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77114916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
XED: Exposing On-Die Error Detection Information for Strong Memory Reliability XED:暴露芯片上的错误检测信息以提高内存可靠性
Prashant J. Nair, Vilas Sridharan, Moinuddin K. Qureshi
Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller. This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose eXposed On-Die Error Detection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined “catch-word” instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172× higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.
大粒度内存故障仍然是系统可靠性的一个关键障碍。更糟糕的是,随着DRAM扩展到更小的节点,DRAM芯片中不可靠位的频率继续增加。为了减轻这种与扩展相关的故障,内存供应商正计划为现有的DRAM芯片配备片上ECC。为了保持与内存标准的兼容性,On-Die ECC对内存控制器是不可见的。本文探讨了在片上ECC存在的情况下,如何设计高可靠性的存储系统。我们表明,如果片上ECC不暴露于内存系统,与8片非ECC DIMM相比,拥有9片ECC-DIMM(实现SECDED)几乎没有可靠性优势。我们还表明,如果片上ECC的错误检测可以暴露给内存控制器,那么即使使用9片ECC- dimm也可以实现芯片杀伤级可靠性。为此,我们提出了eXposed On-Die Error Detection (XED),它在不需要更改内存标准或消耗带宽开销的情况下暴露了On-Die错误检测信息。当片上ECC检测到错误时,XED传输一个预定义的“流行语”而不是纠正的数据值。在接收到口号时,内存控制器使用ECC-DIMM的9芯片中存储的奇偶校验来纠正故障芯片(类似于RAID-3)。我们的研究表明,XED提供了Chipkill级别的可靠性(比SECDED高172倍),同时产生的开销可以忽略不计,执行时间比Chipkill低21%。我们还展示了XED可以使Chipkill系统提供Double-Chipkill级别的可靠性,同时避免了相关的存储、性能和功耗开销。
{"title":"XED: Exposing On-Die Error Detection Information for Strong Memory Reliability","authors":"Prashant J. Nair, Vilas Sridharan, Moinuddin K. Qureshi","doi":"10.1145/3007787.3001174","DOIUrl":"https://doi.org/10.1145/3007787.3001174","url":null,"abstract":"Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller. This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose eXposed On-Die Error Detection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined “catch-word” instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172× higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"18 1","pages":"341-353"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83744021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Morpheus:高效地为异构计算创建应用对象
Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, S. Swanson
In high performance computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of general-purpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, co-processor-equipped systems, Morpheus allows application objects to be sent directly from a storage device to a coprocessor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66×, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32×. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39× in a heterogeneous computing platform.
在高性能计算系统中,对象反序列化可能成为一个非常重要的瓶颈——在我们的测试中,一组通用的、高度并行化的应用程序花费了总执行时间的64%将数据反序列化为对象。本文提出了Morpheus模型,该模型允许应用程序将此类计算移动到存储设备上。我们使用该模型将数据反序列化到存储设备内的应用程序对象中,而不是在主机CPU中。使用Morpheus模型进行对象反序列化可以避免不必要的系统开销,为计算密集型工作负载释放稀缺的CPU和主内存资源,节省I/O带宽,并降低功耗。在异构,协处理器装备的系统中,Morpheus允许应用程序对象通过点对点传输直接从存储设备发送到协处理器(例如,GPU),进一步提高应用程序性能,并降低CPU和主内存的利用率。本文实现了一种支持Morpheus模型的SSD——Morpheus-SSD。Morpheus-SSD提高了1.66倍的对象反序列化性能,降低了7%的功耗,减少了42%的能耗,总执行时间提高了1.32倍。通过使用NVMe-P2P技术实现Morpheus-SSD与GPU之间的点对点通信,在异构计算平台上,Morpheus-SSD的总执行时间提高了1.39倍。
{"title":"Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing","authors":"Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, S. Swanson","doi":"10.1145/3007787.3001143","DOIUrl":"https://doi.org/10.1145/3007787.3001143","url":null,"abstract":"In high performance computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of general-purpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, co-processor-equipped systems, Morpheus allows application objects to be sent directly from a storage device to a coprocessor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66×, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32×. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39× in a heterogeneous computing platform.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"97 1","pages":"53-65"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80520410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit 虚拟线程:最大化线程级并行超越GPU调度限制
M. Yoon, Keunsoo Kim, Sangpil Lee, W. Ro, M. Annavaram
Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.
现代gpu需要数以万计的并发线程才能充分利用海量的处理资源。然而,gpu中的线程并发性可能会由于线程调度结构(调度限制)的不足(如可用的程序计数器和单指令多线程堆栈)或片上内存(容量限制)的不足(如寄存器文件和共享内存)而减少。我们的评估表明,在实践中,在gpu上运行的许多通用应用程序中的并发性受到调度限制而不是容量限制的限制。在不过度增加调度复杂度的前提下,最大限度地利用片上存储资源是本文的主要目标。本文提出了一种虚拟线程(VT)体系结构,该体系结构将协作线程阵列(cta)分配到最大容量限制,而忽略调度限制。然而,为了降低并发管理更多线程的逻辑复杂性,我们建议将cta置于活动和非活动状态,这样活动cta的数量仍然尊重调度限制。当活动CTA中的所有翘曲遇到长延迟失速时,活动CTA将上下文切换出来,下一个就绪CTA将取代它的位置。我们利用了活动和非活动CTA仍然符合容量限制的事实,从而避免了保存和恢复大量CTA状态的需要。因此,VT显著降低了CTA交换的性能损失。通过在活动状态和非活动状态之间进行交换,VT可以在不增加逻辑复杂性的情况下利用更高程度的线程级并行性。仿真结果表明,VT平均提高了23.9%的性能。
{"title":"Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit","authors":"M. Yoon, Keunsoo Kim, Sangpil Lee, W. Ro, M. Annavaram","doi":"10.1145/3007787.3001201","DOIUrl":"https://doi.org/10.1145/3007787.3001201","url":null,"abstract":"Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"86 1","pages":"609-621"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80639541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Production-Run Software Failure Diagnosis via Adaptive Communication Tracking 基于自适应通信跟踪的生产运行软件故障诊断
Mohammad Mejbah Ul Alam, A. Muzahid
Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.
软件故障诊断技术通过在生产运行时对某些事件进行采样或使用某些错误检测算法来工作。有些技术需要多次重现失败。当执行平台、环境或代码发生变化时,那些不需要这样做的代码就不能足够适应。我们提出了一种基于神经硬件的机器智能的生产运行故障诊断技术ACT。ACT使用神经硬件实时学习一些不变量(例如,数据通信不变量),并记录任何潜在的违反。由于ACT可以动态地学习不变量,因此它可以适应执行设置或代码中的任何更改。由于它只记录可能违反的不变量,因此后处理阶段可以相当准确地查明根本原因,而无需再次观察故障。ACT可以无缝地解决许多顺序和并发错误。本文给出了一个典型的多处理器系统中ACT的详细设计和实现。对于部分可配置的单隐层神经网络,采用三级管道。我们已经在各种流行的基准测试和开源程序中对ACT进行了评估。ACT对这些程序中的16个错误进行了准确的排序诊断。与现有的基于学习和抽样的方法相比,ACT具有更好的诊断能力。对于默认配置,ACT的平均执行开销为8.2%。
{"title":"Production-Run Software Failure Diagnosis via Adaptive Communication Tracking","authors":"Mohammad Mejbah Ul Alam, A. Muzahid","doi":"10.1145/3007787.3001175","DOIUrl":"https://doi.org/10.1145/3007787.3001175","url":null,"abstract":"Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"94 1","pages":"354-366"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90378941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
RelaxFault Memory Repair 松弛故障记忆修复
Dong-Wan Kim, M. Erez
Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.
在当今的许多系统中,内存系统的可靠性是一个严重的问题,随着技术的扩展和系统规模的增长,它变得越来越令人担忧。因此,需要更强的容错能力,但代价往往很高。在本文中,我们提出了一种低成本,故障感知,仅硬件的弹性机制,松弛故障,修复绝大多数的内存故障使用少量的LLC来重新映射故障的内存位置。RelaxFault只需要小于100KiB的LLC容量,对性能和功率几乎没有影响。通过修复故障,放松了其他机制(如ECC)对高容错性的要求。通过理解内存系统架构和故障特征,可以更好地在弹性和开销之间进行权衡。我们表明,RelaxFault提供了比之前同等成本的工作更好的修复能力,在更大程度上提高了内存可靠性,并显著减少了维护事件和内存模块更换的数量。我们还提出了一个比以前的工作更精细的记忆故障模型,并证明了它的重要性。
{"title":"RelaxFault Memory Repair","authors":"Dong-Wan Kim, M. Erez","doi":"10.1145/3007787.3001205","DOIUrl":"https://doi.org/10.1145/3007787.3001205","url":null,"abstract":"Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"88 1","pages":"645-657"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72922058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Automatic Generation of Efficient Accelerators for Reconfigurable Hardware 用于可重构硬件的高效加速器的自动生成
D. Koeplinger, R. Prabhakar, Yaqi Zhang, Christina Delimitrou, C. Kozyrakis, K. Olukotun
Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration. We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.
自定义数据路径形式的加速提供了比通用处理器更大的性能和能耗改进。可重构结构(如FPGA)在实现特定应用的加速器中越来越受欢迎,从而增加了拥有良好的高级FPGA设计工具的重要性。然而,目前针对fpga的工具对高级编程、资源估计和快速自动设计空间探索的支持不足。我们描述了一个解决这些挑战的设计框架。我们引入了一种新的硬件表示,使用参数化模板,在多个嵌套级别捕获局部性和并行性信息。这种表示被设计成基于并行模式从高级语言自动生成。我们描述了一种混合面积估计技术,该技术使用模板级模型和设计级人工神经网络来考虑硬件放置和路由工具的影响,包括路由开销、寄存器和块RAM重复以及LUT打包。我们的运行时估计考虑了片外内存访问。我们使用我们的估计能力来快速探索跨越瓷砖大小、并行化因素和可选的粗粒度流水线的大型设计空间,所有这些都在多个循环级别上。我们表明,对逻辑资源的估计平均误差为4.8%,对运行时的估计平均误差为6.1%,并且比商业高级综合工具快279到6533倍。我们将性能最好的设计与在服务器级6核处理器上运行的优化CPU代码进行比较,并显示速度高达16.7倍。
{"title":"Automatic Generation of Efficient Accelerators for Reconfigurable Hardware","authors":"D. Koeplinger, R. Prabhakar, Yaqi Zhang, Christina Delimitrou, C. Kozyrakis, K. Olukotun","doi":"10.1145/3007787.3001150","DOIUrl":"https://doi.org/10.1145/3007787.3001150","url":null,"abstract":"Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration. We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"39 1","pages":"115-127"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87226950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 98
Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement 回到未来:利用Belady的算法改进缓存替换
Akanksha Jain, Calvin Lin
Belady's algorithm is optimal but infeasible because it requires knowledge of the future. This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions. We show that the implementation is surprisingly efficient, as we introduce a new method of efficiently simulating Belady's behavior, and we use known sampling techniques to compactly represent the long history information that is needed for high accuracy. For a 2MB LLC, our solution uses a 16KB hardware budget (excluding replacement state in the tag array). When applied to a memory-intensive subset of the SPEC 2006 CPU benchmarks, our solution improves performance over LRU by 8.4%, as opposed to 6.2% for the previous state-of-the-art. For a 4-core system with a shared 8MB LLC, our solution improves performance by 15.0%, compared to 12.0% for the previous state-of-the-art.
Belady的算法是最优的,但不可行,因为它需要对未来的了解。本文解释了缓存替换算法如何从Belady算法中学习,将其应用于过去的缓存访问,从而为未来的缓存替换决策提供信息。我们证明了实现是惊人的高效,因为我们引入了一种新的方法来有效地模拟Belady的行为,并且我们使用已知的采样技术来紧凑地表示高精度所需的长历史信息。对于2MB的LLC,我们的解决方案使用16KB的硬件预算(不包括标签数组中的替换状态)。当应用于SPEC 2006 CPU基准的内存密集型子集时,我们的解决方案比LRU提高了8.4%的性能,而之前的最先进的解决方案只提高了6.2%。对于具有共享8MB LLC的4核系统,我们的解决方案将性能提高了15.0%,而之前最先进的解决方案仅提高了12.0%。
{"title":"Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement","authors":"Akanksha Jain, Calvin Lin","doi":"10.1145/3007787.3001146","DOIUrl":"https://doi.org/10.1145/3007787.3001146","url":null,"abstract":"Belady's algorithm is optimal but infeasible because it requires knowledge of the future. This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions. We show that the implementation is surprisingly efficient, as we introduce a new method of efficiently simulating Belady's behavior, and we use known sampling techniques to compactly represent the long history information that is needed for high accuracy. For a 2MB LLC, our solution uses a 16KB hardware budget (excluding replacement state in the tag array). When applied to a memory-intensive subset of the SPEC 2006 CPU benchmarks, our solution improves performance over LRU by 8.4%, as opposed to 6.2% for the previous state-of-the-art. For a 4-core system with a shared 8MB LLC, our solution improves performance by 15.0%, compared to 12.0% for the previous state-of-the-art.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"85 1","pages":"78-89"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82187719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
ActivePointers: A Case for Software Address Translation on GPUs activepointer:一个在gpu上进行软件地址转换的案例
Sagi Shahar, Shai Bergman, M. Silberstein
Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.
现代离散gpu已经成为加速计算密集型应用程序的首选处理器,但在大规模数据处理中使用它们是极具挑战性的。不幸的是,它们没有提供在CPU上下文中建立的重要的I/O抽象,例如内存映射文件,它使程序员避免了缓冲区和I/O设备管理的复杂性。然而,在GPU上实现这些抽象带来了一个问题:有限的GPU虚拟内存系统没有为GPU开发人员提供地址空间管理和页面错误处理机制,并且不允许修改运行GPU程序的内存映射。我们实现了ActivePointers,这是一个软件地址转换层和分页系统,它为GPU程序引入了对页面错误和虚拟地址空间管理的本地支持,并能够在商品GPU上实现全功能的内存映射文件。映射到GPU内存中的文件是使用活动指针访问的,它的行为像常规指针一样,但是访问GPU页面缓存的底层,并触发在GPU上处理的页面错误。我们设计和评估了一些新的机制,包括硬件寄存器中的翻译缓存和翻译聚合,用于在单个warp中处理线程的无死锁页面错误。我们使用微基准测试对NVIDIA商用gpu上的activepointer进行了广泛的评估,并且还实现了一个复杂的图像处理应用程序,该应用程序从存储在40GB文件中的1000万张图像子集中构建照片拼贴。GPU实现将整个文件映射到GPU内存并通过活动指针访问它。活动指针的使用只增加了应用程序运行时的1%,而在CPU+GPU的组合实现中,速度提高了3.9倍,在使用AVX矢量指令的12核CPU实现中,速度提高了2.6倍。
{"title":"ActivePointers: A Case for Software Address Translation on GPUs","authors":"Sagi Shahar, Shai Bergman, M. Silberstein","doi":"10.1145/3007787.3001200","DOIUrl":"https://doi.org/10.1145/3007787.3001200","url":null,"abstract":"Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"33 1","pages":"596-608"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82279535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Evaluation of an Analog Accelerator for Linear Algebra 线性代数模拟加速器的评价
Yipeng Huang, Ning Guo, Mingoo Seok, Y. Tsividis, S. Sethumadhavan
Due to the end of supply voltage scaling and the increasing percentage of dark silicon in modern integrated circuits, researchers are looking for new scalable ways to get useful computation from existing silicon technology. In this paper we present a reconfigurable analog accelerator for solving systems of linear equations. Commonly perceived downsides of analog computing, such as low precision and accuracy, limited problem sizes, and difficulty in programming are all compensated for using methods we discuss. Based on a prototyped analog accelerator chip we compare the performance and energy consumption of the analog solver against an efficient digital algorithm running on a CPU, and find that the analog accelerator approach may be an order of magnitude faster and provide one third energy savings, depending on the accelerator design. Due to the speed and efficiency of linear algebra algorithms running on digital computers, an analog accelerator that matches digital performance needs a large silicon footprint. Finally, we conclude that problem classes outside of systems of linear equations may hold more promise for analog acceleration.
由于电源电压缩放的终结和现代集成电路中暗硅的比例的增加,研究人员正在寻找新的可扩展的方法来从现有的硅技术中获得有用的计算。本文提出了一种用于求解线性方程组的可重构模拟加速器。模拟计算常见的缺点,如低精度和准确性,有限的问题规模,以及编程困难,都可以通过使用我们讨论的方法来弥补。基于原型模拟加速器芯片,我们将模拟求解器的性能和能耗与在CPU上运行的高效数字算法进行了比较,发现模拟加速器方法可能快一个数量级,并提供三分之一的节能,具体取决于加速器设计。由于在数字计算机上运行的线性代数算法的速度和效率,与数字性能相匹配的模拟加速器需要大量的硅足迹。最后,我们得出结论,线性方程系统之外的问题类可能对模拟加速更有希望。
{"title":"Evaluation of an Analog Accelerator for Linear Algebra","authors":"Yipeng Huang, Ning Guo, Mingoo Seok, Y. Tsividis, S. Sethumadhavan","doi":"10.1145/3007787.3001197","DOIUrl":"https://doi.org/10.1145/3007787.3001197","url":null,"abstract":"Due to the end of supply voltage scaling and the increasing percentage of dark silicon in modern integrated circuits, researchers are looking for new scalable ways to get useful computation from existing silicon technology. In this paper we present a reconfigurable analog accelerator for solving systems of linear equations. Commonly perceived downsides of analog computing, such as low precision and accuracy, limited problem sizes, and difficulty in programming are all compensated for using methods we discuss. Based on a prototyped analog accelerator chip we compare the performance and energy consumption of the analog solver against an efficient digital algorithm running on a CPU, and find that the analog accelerator approach may be an order of magnitude faster and provide one third energy savings, depending on the accelerator design. Due to the speed and efficiency of linear algebra algorithms running on digital computers, an analog accelerator that matches digital performance needs a large silicon footprint. Finally, we conclude that problem classes outside of systems of linear equations may hold more promise for analog acceleration.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"570-582"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78076867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1