首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Improving the Integration of Task Nesting and Dependencies in OpenMP 改进OpenMP中任务嵌套和依赖关系的集成
Josep M. Pérez, Vicencc Beltran, Jesús Labarta, E. Ayguadé
The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then to further refine these tasks with additional subtasks. However, this top-down approach has some drawbacks since combining nesting with dependencies usually requires additional measures to enforce the correct coordination of dependencies across nesting levels. For instance, most non-leaf tasks need to include a taskwait at the end of their code. While these measures enforce the correct order of execution, as a side effect, they also limit the discovery of parallelism. In this paper we extend the OpenMP tasking model to improve the integration of nesting and dependencies. Our proposal builds on both formulas, nesting and dependencies, and benefits from their individual strengths. On one hand, it encourages a top-down approach to parallelizing codes that also enables the parallel instantiation of tasks. On the other hand, it allows the runtime to control dependencies at a fine grain that until now was only possible using a single domain of dependencies. Our proposal is realized through additions to the OpenMP task directive that ensure backward compatibility with current codes. We have implemented a new runtime with these extensions and used it to evaluate the impact on several benchmarks. Our initial findings show that our extensions improve performance in three areas. First, they expose more parallelism. Second, they uncover dependencies across nesting levels, which allows the runtime to make better scheduling decisions. And third, they allow the parallel instantiation of tasks with dependencies between them.
OpenMP 4.0的任务模型既支持嵌套,也支持定义兄弟任务之间的依赖关系。将许多代码与任务并行化的一种自然方法是,首先为高级函数指定任务,然后用额外的子任务进一步细化这些任务。然而,这种自顶向下的方法有一些缺点,因为将嵌套与依赖相结合通常需要额外的措施来强制跨嵌套级别的依赖的正确协调。例如,大多数非叶子任务需要在其代码末尾包含一个任务等待。虽然这些措施强制执行正确的顺序,但作为副作用,它们也限制了并行性的发现。在本文中,我们扩展了OpenMP任务模型,以提高嵌套和依赖关系的集成。我们的建议建立在公式、嵌套和依赖关系的基础上,并从它们各自的优势中获益。一方面,它鼓励采用自顶向下的方法来并行化代码,这种方法也支持任务的并行实例化。另一方面,它允许运行时以精细的粒度控制依赖关系,而到目前为止,这只能使用单个依赖域。我们的建议是通过增加OpenMP任务指令来实现的,以确保与当前代码的向后兼容性。我们已经用这些扩展实现了一个新的运行时,并用它来评估对几个基准测试的影响。我们的初步发现表明,我们的扩展在三个方面提高了性能。首先,它们暴露了更多的并行性。其次,它们揭示了跨嵌套级别的依赖关系,这允许运行时做出更好的调度决策。第三,它们允许并行实例化具有它们之间依赖关系的任务。
{"title":"Improving the Integration of Task Nesting and Dependencies in OpenMP","authors":"Josep M. Pérez, Vicencc Beltran, Jesús Labarta, E. Ayguadé","doi":"10.1109/IPDPS.2017.69","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.69","url":null,"abstract":"The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then to further refine these tasks with additional subtasks. However, this top-down approach has some drawbacks since combining nesting with dependencies usually requires additional measures to enforce the correct coordination of dependencies across nesting levels. For instance, most non-leaf tasks need to include a taskwait at the end of their code. While these measures enforce the correct order of execution, as a side effect, they also limit the discovery of parallelism. In this paper we extend the OpenMP tasking model to improve the integration of nesting and dependencies. Our proposal builds on both formulas, nesting and dependencies, and benefits from their individual strengths. On one hand, it encourages a top-down approach to parallelizing codes that also enables the parallel instantiation of tasks. On the other hand, it allows the runtime to control dependencies at a fine grain that until now was only possible using a single domain of dependencies. Our proposal is realized through additions to the OpenMP task directive that ensure backward compatibility with current codes. We have implemented a new runtime with these extensions and used it to evaluate the impact on several benchmarks. Our initial findings show that our extensions improve performance in three areas. First, they expose more parallelism. Second, they uncover dependencies across nesting levels, which allows the runtime to make better scheduling decisions. And third, they allow the parallel instantiation of tasks with dependencies between them.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121655390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Model-Driven Sparse CP Decomposition for Higher-Order Tensors 模型驱动的高阶张量稀疏CP分解
Jiajia Li, Jee W. Choi, Ioakeim Perros, Jimeng Sun, R. Vuduc
Given an input tensor, its CANDECOMP/PARAFAC decomposition (or CPD) is a low-rank representation. CPDs are of particular interest in data analysis and mining, especially when the data tensor is sparse and of higher order (dimension). This paper focuses on the central bottleneck of a CPD algorithm, which is evaluating a sequence of matricized tensor times Khatri-Rao products (MTTKRPs). To speed up the MTTKRP sequence, we propose a novel, adaptive tensor memoization algorithm, AdaTM. Besides removing redundant computations within the MTTKRP sequence, which potentially reduces its overall asymptotic complexity, our technique also allows a user to make a space-time tradeoff by automatically tuning algorithmic and machine parameters using a model-driven framework. Our method improves as the tensor order grows, making its performance more scalable for higher-order data problems. We show speedups of up to 8× and 820× on real sparse data tensors with orders as high as 85 over the SPLATT package and Tensor Toolbox library respectively; and on a full CPD algorithm (CP-ALS), AdaTM can be up to 8× faster than state-of-the-art method implemented in SPLATT.
给定一个输入张量,它的CANDECOMP/PARAFAC分解(或CPD)是一个低秩表示。cpd在数据分析和挖掘中特别有趣,特别是当数据张量是稀疏的和高阶(维)的时候。本文重点研究了一种CPD算法的中心瓶颈,即矩阵张量乘Khatri-Rao积(MTTKRPs)序列的求值。为了加快MTTKRP序列的速度,我们提出了一种新的自适应张量记忆算法AdaTM。除了消除MTTKRP序列中的冗余计算(这可能会降低其总体渐近复杂性)之外,我们的技术还允许用户通过使用模型驱动框架自动调整算法和机器参数来进行时空权衡。我们的方法随着张量阶的增长而改进,使其性能在高阶数据问题上更具可扩展性。我们分别在SPLATT包和Tensor Toolbox库上显示了高达85阶的真实稀疏数据张量的加速高达8倍和820倍;在全CPD算法(CP-ALS)上,AdaTM可以比SPLATT中实现的最先进方法快8倍。
{"title":"Model-Driven Sparse CP Decomposition for Higher-Order Tensors","authors":"Jiajia Li, Jee W. Choi, Ioakeim Perros, Jimeng Sun, R. Vuduc","doi":"10.1109/IPDPS.2017.80","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.80","url":null,"abstract":"Given an input tensor, its CANDECOMP/PARAFAC decomposition (or CPD) is a low-rank representation. CPDs are of particular interest in data analysis and mining, especially when the data tensor is sparse and of higher order (dimension). This paper focuses on the central bottleneck of a CPD algorithm, which is evaluating a sequence of matricized tensor times Khatri-Rao products (MTTKRPs). To speed up the MTTKRP sequence, we propose a novel, adaptive tensor memoization algorithm, AdaTM. Besides removing redundant computations within the MTTKRP sequence, which potentially reduces its overall asymptotic complexity, our technique also allows a user to make a space-time tradeoff by automatically tuning algorithmic and machine parameters using a model-driven framework. Our method improves as the tensor order grows, making its performance more scalable for higher-order data problems. We show speedups of up to 8× and 820× on real sparse data tensors with orders as high as 85 over the SPLATT package and Tensor Toolbox library respectively; and on a full CPD algorithm (CP-ALS), AdaTM can be up to 8× faster than state-of-the-art method implemented in SPLATT.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122685818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem 后缀树的并行构造与全最近邻小值问题
P. Flick, S. Aluru
A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.
后缀树是一种基本的、通用的字符串数据结构,经常用于重要的应用领域,如文本处理、信息检索和计算生物学。从序列上看,后缀树的构建需要线性时间,并且只有PRAM模型才存在最优并行算法。最近的工作主要针对低核数共享内存实现,但实现了次优复杂度,而先前的分布式内存并行算法具有二次最坏复杂度。通过求解全最近邻最小值(ANSV)问题,可以从后缀数组和最长公共前缀(LCP)数组构建后缀树。本文提出了一种广义的ANSV问题,并提出了一种分布式内存并行算法,在O(n/p +p)时间内求解该问题。我们的算法最大限度地减少了总体和每个节点的通信量。在此基础上,我们提出了一种用于构建后缀树的分布式表示的并行算法,与以前的分布式内存算法相比,它既具有优越的理论复杂性,又具有更好的实际性能。我们演示了在1024个Intel Xeon内核上构建人类基因组的后缀树和LCP阵列,用时不到2秒。
{"title":"Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem","authors":"P. Flick, S. Aluru","doi":"10.1109/IPDPS.2017.62","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.62","url":null,"abstract":"A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116226446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
DEFT-Cache: A Cost-Effective and Highly Reliable SSD Cache for RAID Storage DEFT-Cache:一种高性价比、高可靠性的RAID存储SSD缓存
Ji-guang Wan, Wei Wu, Ling Zhan, Q. Yang, Xiaoyang Qu, C. Xie
This paper proposes a new SSD cache architecture, DEFT-cache, Delayed Erasing and Fast Taping, that maximizes I/O performance and reliability of RAID storage. First of all, DEFT-Cache exploits the inherent physical properties of flash memory SSD by making use of old data that have been overwritten but still in existence in SSD to minimize small write penalty of RAID5/6. As data pages being overwritten in SSD, old data pages are invalidated and become candidates for erasure and garbage collections. Our idea is to selectively delay the erasure of the pages and let these otherwise useless old data in SSD contribute to I/O performance for parity computations upon write I/Os. Secondly, DEFT-Cache provides inexpensive redundancy to the SSD cache by having one physical SSD and one virtual SSD as a mirror cache. The virtual SSD is implemented on HDD but using log-structured data layout, i.e. write data are quickly logged to HDD using sequential write. The dual and redundant caches provide a cost-effective and highly reliable write-back SSD cache. We have implemented DEFT-Cache on Linux system. Extensive experiments have been carried out to evaluate the potential benefits of our new techniques. Experimental results on SPC and Microsoft traces have shown that DEFT-Cache improves I/O performance by 26.81% to 56.26% in terms of average user response time. The virtual SSD mirror cache can absorb write I/Os as fast as physical SSD providing the same reliability as two physical SSD caches without noticeable performance loss.
本文提出了一种新的SSD缓存架构,即DEFT-cache, Delayed Erasing and Fast tape,以最大限度地提高RAID存储的I/O性能和可靠性。首先,DEFT-Cache利用闪存SSD固有的物理特性,利用SSD中已经被覆盖但仍然存在的旧数据来最小化RAID5/6的小写损失。当数据页在SSD中被覆盖时,旧的数据页将失效,并成为擦除和垃圾收集的候选者。我们的想法是有选择地延迟页面的擦除,并让SSD中这些无用的旧数据在写I/O时为奇偶计算贡献I/O性能。其次,DEFT-Cache通过使用一个物理SSD和一个虚拟SSD作为镜像缓存,为SSD缓存提供廉价的冗余。虚拟SSD在HDD上实现,但使用日志结构的数据布局,即写入数据使用顺序写入快速记录到HDD。双冗余缓存提供了高性价比和高可靠性的回写SSD缓存。我们已经在Linux系统上实现了DEFT-Cache。为了评估我们的新技术的潜在效益,已经进行了大量的实验。SPC和Microsoft跟踪的实验结果表明,DEFT-Cache在平均用户响应时间方面提高了26.81%至56.26%的I/O性能。虚拟SSD镜像缓存可以像物理SSD一样快速地吸收写I/ o,提供与两个物理SSD缓存相同的可靠性,而不会出现明显的性能损失。
{"title":"DEFT-Cache: A Cost-Effective and Highly Reliable SSD Cache for RAID Storage","authors":"Ji-guang Wan, Wei Wu, Ling Zhan, Q. Yang, Xiaoyang Qu, C. Xie","doi":"10.1109/IPDPS.2017.54","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.54","url":null,"abstract":"This paper proposes a new SSD cache architecture, DEFT-cache, Delayed Erasing and Fast Taping, that maximizes I/O performance and reliability of RAID storage. First of all, DEFT-Cache exploits the inherent physical properties of flash memory SSD by making use of old data that have been overwritten but still in existence in SSD to minimize small write penalty of RAID5/6. As data pages being overwritten in SSD, old data pages are invalidated and become candidates for erasure and garbage collections. Our idea is to selectively delay the erasure of the pages and let these otherwise useless old data in SSD contribute to I/O performance for parity computations upon write I/Os. Secondly, DEFT-Cache provides inexpensive redundancy to the SSD cache by having one physical SSD and one virtual SSD as a mirror cache. The virtual SSD is implemented on HDD but using log-structured data layout, i.e. write data are quickly logged to HDD using sequential write. The dual and redundant caches provide a cost-effective and highly reliable write-back SSD cache. We have implemented DEFT-Cache on Linux system. Extensive experiments have been carried out to evaluate the potential benefits of our new techniques. Experimental results on SPC and Microsoft traces have shown that DEFT-Cache improves I/O performance by 26.81% to 56.26% in terms of average user response time. The virtual SSD mirror cache can absorb write I/Os as fast as physical SSD providing the same reliability as two physical SSD caches without noticeable performance loss.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126625551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
PaPar: A Parallel Data Partitioning Framework for Big Data Applications 面向大数据应用的并行数据分区框架
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.119
Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng
Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrixvector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.
今天,大数据应用程序可以以前所未有的速度生成大规模数据集;科学家们已经转向并行和分布式系统进行数据分析。尽管许多大数据处理系统提供了先进的数据分区和解决计算倾斜的机制,但由于不同分区的运行时间不仅取决于输入数据的大小,还取决于将应用于数据的算法,因此很难有效地实现抗倾斜机制。因此,已经进行了许多研究工作来探索针对不同类型的应用程序和算法的用户定义划分方法。然而,手动编写特定于应用程序的分区方法需要大量的编码工作,即使对于掌握了足够的应用程序知识的开发人员来说,找到最佳的数据分区策略也特别具有挑战性。本文提出了一种用于大数据应用的并行数据分区框架PaPar,以简化数据分区算法的实现。PaPar为程序员提供了一组计算运算符和分布策略来描述所需的数据划分方法。PaPar以输入数据配置文件和工作流配置文件为输入,通过将用户定义的工作流形式化为一系列键值操作和矩阵向量乘法,自动生成并行分区代码,并有效地映射到MPI和MapReduce的并行实现中。我们将我们的方法应用于两个应用程序:muBLAST,用于生物序列搜索的BLAST算法的MPI实现;和PowerLyra,一个计算和划分歪斜图的方法。实验结果表明,与应用程序的分区方法相比,PaPar生成的代码可以在相当或更少的分区时间内生成相同的数据分区。
{"title":"PaPar: A Parallel Data Partitioning Framework for Big Data Applications","authors":"Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng","doi":"10.1109/IPDPS.2017.119","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.119","url":null,"abstract":"Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrixvector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125259867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SimProf: A Sampling Framework for Data Analytic Workloads SimProf:数据分析工作负载的采样框架
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.118
Jen-Cheng Huang, Lifeng Nai, Pranith Kumar, Hyojong Kim, Hyesoon Kim
Today, there is a steep rise in the amount of data being collected from diverse applications. Consequently, data analytic workloads are gaining popularity to gain insight that can benefit the application, e.g., financial trading, social media analysis. To study the architectural behavior of the workloads, architectural simulation is one of the most common approaches. However, because of the long-running nature of the workloads, it is not trivial to identify which parts of the analysis to simulate. In the current work, we introduce SimProf, a sampling framework for data analytic workloads. Using this tool, we are able to select representative simulation points based on the phase behavior of the analysis at a method level granularity. This provides a better understanding of the simulation point and also reduces the simulation time for different input sets. We present the framework for Apache Hadoop and Apache Spark frameworks, which can be easily extended to other data analytic workloads.
今天,从各种应用程序收集的数据量急剧增加。因此,数据分析工作负载越来越受欢迎,以获得对应用程序有益的洞察力,例如金融交易、社交媒体分析。为了研究工作负载的体系结构行为,体系结构模拟是最常用的方法之一。但是,由于工作负载的长时间运行性质,确定要模拟分析的哪些部分并不是一件容易的事情。在当前的工作中,我们介绍了SimProf,这是一个用于数据分析工作负载的采样框架。使用此工具,我们能够根据方法级别粒度的分析的阶段行为选择具有代表性的模拟点。这样可以更好地理解模拟点,还可以减少不同输入集的模拟时间。我们为Apache Hadoop和Apache Spark框架提供了一个框架,它可以很容易地扩展到其他数据分析工作负载。
{"title":"SimProf: A Sampling Framework for Data Analytic Workloads","authors":"Jen-Cheng Huang, Lifeng Nai, Pranith Kumar, Hyojong Kim, Hyesoon Kim","doi":"10.1109/IPDPS.2017.118","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.118","url":null,"abstract":"Today, there is a steep rise in the amount of data being collected from diverse applications. Consequently, data analytic workloads are gaining popularity to gain insight that can benefit the application, e.g., financial trading, social media analysis. To study the architectural behavior of the workloads, architectural simulation is one of the most common approaches. However, because of the long-running nature of the workloads, it is not trivial to identify which parts of the analysis to simulate. In the current work, we introduce SimProf, a sampling framework for data analytic workloads. Using this tool, we are able to select representative simulation points based on the phase behavior of the analysis at a method level granularity. This provides a better understanding of the simulation point and also reduces the simulation time for different input sets. We present the framework for Apache Hadoop and Apache Spark frameworks, which can be easily extended to other data analytic workloads.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121549719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors 大规模并行SIMT处理器上高性能消息传递的松弛
Benjamin Klenk, H. Fröning, H. Eberle, Larry R. Dennison
Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power consumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators are recently being augmented with peer-to-peer communication capabilities that allow for autonomous traffic sourcing and sinking. While appropriate hardware support is becoming available, it seems that the right communication semantics are yet to be identified. Maintaining the semantics of existing communication models, such as the Message Passing Interface (MPI), seems problematic as they have been designed for the CPU’s execution model, which inherently differs from such specialized processors. In this paper, we analyze the compatibility of traditional message passing with massively parallel Single Instruction Multiple Thread (SIMT) architectures, as represented by GPUs, and focus on the message matching problem. We begin with a fully MPI-compliant set of guarantees, including tag and source wildcards and message ordering. Based on an analysis of exascale proxy applications, we start relaxing these guarantees to adapt message passing to the GPU’s execution model. We present suitable algorithms for message matching on GPUs that can yield matching rates of 60M and 500M matches/s, depending on the constraints that are being relaxed. We discuss our experiments and create an understanding of the mismatch of current message passing protocols and the architecture and execution model of SIMT processors.
加速器(如gpu)已被证明在减少计算密集型应用程序的执行时间和功耗方面非常成功。尽管它们已经被广泛使用,但它们通常由通用cpu监督,这导致cpu处理所有通信任务时频繁的控制流切换和数据传输。然而,我们观察到加速器最近被增强了点对点通信功能,允许自主流量来源和下沉。虽然适当的硬件支持正在变得可用,但似乎还没有确定正确的通信语义。维护现有通信模型(如消息传递接口(Message Passing Interface, MPI))的语义似乎存在问题,因为它们是为CPU的执行模型设计的,而CPU的执行模型本质上不同于此类专用处理器。本文分析了传统消息传递与以gpu为代表的大规模并行单指令多线程(SIMT)架构的兼容性,重点研究了消息匹配问题。我们从一组完全符合mpi的保证开始,包括标记和源通配符以及消息排序。基于对exascale代理应用程序的分析,我们开始放松这些保证,以使消息传递适应GPU的执行模型。我们提出了适合gpu上消息匹配的算法,可以产生60M和500M匹配/s的匹配速率,具体取决于正在放松的约束。我们讨论了我们的实验,并创建了对当前消息传递协议与SIMT处理器的体系结构和执行模型不匹配的理解。
{"title":"Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors","authors":"Benjamin Klenk, H. Fröning, H. Eberle, Larry R. Dennison","doi":"10.1109/IPDPS.2017.94","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.94","url":null,"abstract":"Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power consumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators are recently being augmented with peer-to-peer communication capabilities that allow for autonomous traffic sourcing and sinking. While appropriate hardware support is becoming available, it seems that the right communication semantics are yet to be identified. Maintaining the semantics of existing communication models, such as the Message Passing Interface (MPI), seems problematic as they have been designed for the CPU’s execution model, which inherently differs from such specialized processors. In this paper, we analyze the compatibility of traditional message passing with massively parallel Single Instruction Multiple Thread (SIMT) architectures, as represented by GPUs, and focus on the message matching problem. We begin with a fully MPI-compliant set of guarantees, including tag and source wildcards and message ordering. Based on an analysis of exascale proxy applications, we start relaxing these guarantees to adapt message passing to the GPU’s execution model. We present suitable algorithms for message matching on GPUs that can yield matching rates of 60M and 500M matches/s, depending on the constraints that are being relaxed. We discuss our experiments and create an understanding of the mismatch of current message passing protocols and the architecture and execution model of SIMT processors.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125567458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks MOCHA:卷积神经网络的变形局部性和压缩感知架构
Syed M. A. H. Jafri, A. Hemani, K. Paul, Naeem Abbas
Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-ofthe- art for many applications (e.g. video/audio classification). The main challenge in implementing the CNNs, in embedded systems, is their large computation, memory, and bandwidth requirements. To meet these demands, dedicated hardware accelerators have been proposed. Since memory is the major cost in CNNs, recent accelerators focus on reducing the memory accesses. In particular, they exploit data locality using either tiling, layer merging or intra/inter feature map parallelism to reduce the memory footprint. However, they lack the flexibility to interleave or cascade these optimizations. Moreover, most of the existing accelerators do not exploit compression that can simultaneously reduce memory requirements, increase the throughput, and enhance the energy efficiency. To tackle these limitations, we present a flexible accelerator called MOCHA. MOCHA has three features that differentiate it from the state-of-the-art: (i) the ability to compress input/ kernels, (ii) the flexibility to interleave various optimizations, and (iii) intelligence to automatically interleave and cascade the optimizations, depending on the dimension of a specific CNN layer and available resources. Post layout Synthesis results reveal that MOCHA provides up to 63% higher energy efficiency, up to 42% higher throughput, and up to 30% less storage, compared to the next best accelerator, at the cost of 26-35% additional area.
今天,基于神经网络的机器学习在许多应用领域已经成为主流。机器学习算法的一个小子集,称为卷积神经网络(CNN),被认为是许多应用(例如视频/音频分类)的最新技术。在嵌入式系统中实现cnn的主要挑战是其庞大的计算、内存和带宽需求。为了满足这些需求,已经提出了专用的硬件加速器。由于内存是cnn的主要成本,所以最近的加速器致力于减少内存访问。特别是,它们利用数据局部性,使用平铺、层合并或内部/内部特征映射并行性来减少内存占用。然而,它们缺乏交错或级联这些优化的灵活性。此外,大多数现有的加速器都没有利用压缩技术,而压缩技术可以同时降低内存需求、增加吞吐量和提高能源效率。为了解决这些限制,我们提出了一种名为MOCHA的灵活加速器。MOCHA有三个特征将其与最先进的技术区分出来:(i)压缩输入/内核的能力,(ii)交错各种优化的灵活性,以及(iii)根据特定CNN层的维度和可用资源自动交错和级联优化的智能。布局后综合结果显示,与第二好的加速器相比,MOCHA的能源效率提高了63%,吞吐量提高了42%,存储空间减少了30%,而成本是增加了26-35%的面积。
{"title":"MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks","authors":"Syed M. A. H. Jafri, A. Hemani, K. Paul, Naeem Abbas","doi":"10.1109/IPDPS.2017.59","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.59","url":null,"abstract":"Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-ofthe- art for many applications (e.g. video/audio classification). The main challenge in implementing the CNNs, in embedded systems, is their large computation, memory, and bandwidth requirements. To meet these demands, dedicated hardware accelerators have been proposed. Since memory is the major cost in CNNs, recent accelerators focus on reducing the memory accesses. In particular, they exploit data locality using either tiling, layer merging or intra/inter feature map parallelism to reduce the memory footprint. However, they lack the flexibility to interleave or cascade these optimizations. Moreover, most of the existing accelerators do not exploit compression that can simultaneously reduce memory requirements, increase the throughput, and enhance the energy efficiency. To tackle these limitations, we present a flexible accelerator called MOCHA. MOCHA has three features that differentiate it from the state-of-the-art: (i) the ability to compress input/ kernels, (ii) the flexibility to interleave various optimizations, and (iii) intelligence to automatically interleave and cascade the optimizations, depending on the dimension of a specific CNN layer and available resources. Post layout Synthesis results reveal that MOCHA provides up to 63% higher energy efficiency, up to 42% higher throughput, and up to 30% less storage, compared to the next best accelerator, at the cost of 26-35% additional area.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131186879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
FFQ: A Fast Single-Producer/Multiple-Consumer Concurrent FIFO Queue 快速单生产者/多消费者并发FIFO队列
Sergei Arnautov, P. Felber, C. Fetzer, Bohdan Trach
With the spreading of multi-core architectures, operating systems and applications are becoming increasingly more concurrent and their scalability is often limited by the primitives used to synchronize the different hardware threads. In this paper, we address the problem of how to optimize the throughput of a system with multiple producer and consumer threads. Such applications typically synchronize their threads via multi-producer/multi-consumer FIFO queues, but existing solutions have poor scalability, as we could observe when designing a secure application framework that requires high-throughput communication between many concurrent threads. In our target system, however, the items enqueued by different producers do not necessarily need to be FIFO ordered. Hence, we propose a fast FIFO queue, FFQ, that aims at maximizing throughput by specializing the algorithm for single-producer/multiple-consumer settings: each producer has its own queue from which multiple consumers can concurrently dequeue. Furthermore, while we provide a wait-free interface for producers, we limit ourselves to lock-free consumers to eliminate the need for helping. We also propose a multi-producer variant to show which synchronization operations we were able to remove by focusing on a single producer variant. Our evaluation analyses the performance using micro-benchmarks and compares our results with other state-of-the-art solutions: FFQ exhibits excellent performance and scalability.
随着多核体系结构的普及,操作系统和应用程序的并发性越来越强,它们的可伸缩性常常受到用于同步不同硬件线程的原语的限制。在本文中,我们解决了如何优化具有多个生产者和消费者线程的系统的吞吐量问题。这类应用程序通常通过多生产者/多消费者FIFO队列同步它们的线程,但是现有的解决方案具有较差的可伸缩性,正如我们在设计需要在许多并发线程之间进行高吞吐量通信的安全应用程序框架时可以观察到的那样。然而,在我们的目标系统中,由不同生产者排队的物品不一定需要FIFO排序。因此,我们提出了一个快速FIFO队列,FFQ,旨在通过专门化算法实现单生产者/多消费者设置的吞吐量最大化:每个生产者有自己的队列,多个消费者可以同时从中退出队列。此外,虽然我们为生产者提供了一个无等待的接口,但我们将自己限制为无锁定的消费者,以消除对帮助的需要。我们还提出了一个多生产者变体,以显示我们能够通过关注单个生产者变体来删除哪些同步操作。我们的评估使用微基准测试分析性能,并将我们的结果与其他最先进的解决方案进行比较:FFQ表现出出色的性能和可扩展性。
{"title":"FFQ: A Fast Single-Producer/Multiple-Consumer Concurrent FIFO Queue","authors":"Sergei Arnautov, P. Felber, C. Fetzer, Bohdan Trach","doi":"10.1109/IPDPS.2017.41","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.41","url":null,"abstract":"With the spreading of multi-core architectures, operating systems and applications are becoming increasingly more concurrent and their scalability is often limited by the primitives used to synchronize the different hardware threads. In this paper, we address the problem of how to optimize the throughput of a system with multiple producer and consumer threads. Such applications typically synchronize their threads via multi-producer/multi-consumer FIFO queues, but existing solutions have poor scalability, as we could observe when designing a secure application framework that requires high-throughput communication between many concurrent threads. In our target system, however, the items enqueued by different producers do not necessarily need to be FIFO ordered. Hence, we propose a fast FIFO queue, FFQ, that aims at maximizing throughput by specializing the algorithm for single-producer/multiple-consumer settings: each producer has its own queue from which multiple consumers can concurrently dequeue. Furthermore, while we provide a wait-free interface for producers, we limit ourselves to lock-free consumers to eliminate the need for helping. We also propose a multi-producer variant to show which synchronization operations we were able to remove by focusing on a single producer variant. Our evaluation analyses the performance using micro-benchmarks and compares our results with other state-of-the-art solutions: FFQ exhibits excellent performance and scalability.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"73 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131848002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Fly-Over: A Light-Weight Distributed Power-Gating Mechanism for Energy-Efficient Networks-on-Chip 高能效片上网络的轻量级分布式功率门控机制
R. Boyapati, Jiayi Huang, Ningyuan Wang, Kyung Hoon Kim, K. H. Yum, Eun Jung Kim
Scalable Networks-on-Chip (NoCs) have become the de facto interconnection mechanism in large scale Chip Multiprocessors. Not only are NoCs devouring a large fraction of the on-chip power budget but static NoC power consumption is becoming the dominant component as technology scales down. Hence reducing static NoC power consumption is critical for energy-efficient computing. Previous research has proposed to power-gate routers attached to inactive cores so as to save static power, but requires centralized control and global network knowledge. In this paper, we propose Fly-Over (FLOV), a light-weight distributed mechanism for power-gating routers, which encompasses FLOV router architecture, handshake protocols, and a partition-based dynamic routing algorithm to maintain network functionalities. With simple modifications to the baseline router architecture, FLOV can facilitate FLOV links over power-gated routers. Then we present two handshake protocols for FLOV routers, restricted FLOV that can power-gate routers under restricted conditions and generalized FLOV with more power saving capability. The proposed routing algorithm provides best-effort minimal path routing without the necessity for global network information. We evaluate our schemes using synthetic workloads as well as real workloads from PARSEC 2.1 benchmark suite. Our full system evaluations show that FLOV reduces the total and static energy consumption by 18% and 22% respectively, on average across several benchmarks, compared to state-of-the-art NoC power-gating mechanism while keeping the performance degradation minimal.
可扩展的片上网络(noc)已经成为大规模芯片多处理器事实上的互连机制。随着技术规模的缩小,NoC不仅占据了芯片上功耗预算的很大一部分,而且静态NoC功耗正在成为主导因素。因此,减少静态NoC功耗对于节能计算至关重要。先前的研究提出将电源闸路由器连接到非活动核心上,以节省静态功率,但需要集中控制和全局网络知识。在本文中,我们提出了flover (FLOV),这是一种轻量级的分布式机制,用于电源门控路由器,它包括flv路由器架构,握手协议和基于分区的动态路由算法来维护网络功能。通过对基准路由器架构的简单修改,FLOV可以在电源门控路由器上促进FLOV链路。在此基础上,提出了两种面向FLOV路由器的握手协议,一种是在受限条件下能够对路由器进行电源闸的受限FLOV协议,另一种是具有更节能性能的广义FLOV协议。提出的路由算法在不需要全局网络信息的情况下提供了最佳努力的最小路径路由。我们使用PARSEC 2.1基准测试套件中的合成工作负载和实际工作负载来评估我们的方案。我们的完整系统评估表明,在几个基准测试中,与最先进的NoC功率门控机制相比,FLOV将总能耗和静态能耗分别降低了18%和22%,同时将性能下降降到最低。
{"title":"Fly-Over: A Light-Weight Distributed Power-Gating Mechanism for Energy-Efficient Networks-on-Chip","authors":"R. Boyapati, Jiayi Huang, Ningyuan Wang, Kyung Hoon Kim, K. H. Yum, Eun Jung Kim","doi":"10.1109/IPDPS.2017.77","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.77","url":null,"abstract":"Scalable Networks-on-Chip (NoCs) have become the de facto interconnection mechanism in large scale Chip Multiprocessors. Not only are NoCs devouring a large fraction of the on-chip power budget but static NoC power consumption is becoming the dominant component as technology scales down. Hence reducing static NoC power consumption is critical for energy-efficient computing. Previous research has proposed to power-gate routers attached to inactive cores so as to save static power, but requires centralized control and global network knowledge. In this paper, we propose Fly-Over (FLOV), a light-weight distributed mechanism for power-gating routers, which encompasses FLOV router architecture, handshake protocols, and a partition-based dynamic routing algorithm to maintain network functionalities. With simple modifications to the baseline router architecture, FLOV can facilitate FLOV links over power-gated routers. Then we present two handshake protocols for FLOV routers, restricted FLOV that can power-gate routers under restricted conditions and generalized FLOV with more power saving capability. The proposed routing algorithm provides best-effort minimal path routing without the necessity for global network information. We evaluate our schemes using synthetic workloads as well as real workloads from PARSEC 2.1 benchmark suite. Our full system evaluations show that FLOV reduces the total and static energy consumption by 18% and 22% respectively, on average across several benchmarks, compared to state-of-the-art NoC power-gating mechanism while keeping the performance degradation minimal.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133592544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1