首页 > 最新文献

ACM Transactions on Parallel Computing最新文献

英文 中文
A High-Quality and Fast Maximal Independent Set Implementation for GPUs 一种高质量、快速的gpu最大独立集实现
IF 1.6 Q2 Computer Science Pub Date : 2019-01-23 DOI: 10.1145/3291525
Martin Burtscher, Sindhu Devale, S. Azimi, J. Jaiganesh, Evan Powers
Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size. Its CUDA implementation requires fewer than 30 kernel statements, runs asynchronously, and produces a deterministic result. It outperforms the maximal independent set implementations of Pannotia, CUSP, and IrGL on each of the 16 tested graphs of various types and sizes. On a Titan X GPU, ECL-MIS is between 3.9 and 100 times faster (11.5 times, on average). ECL-MIS running on the GPU is also faster than the parallel CPU codes Ligra, Ligra+, and PBBS running on 20 Xeon cores, which it outperforms by 4.1 times, on average. At the same time, ECL-MIS produces maximal independent sets that are up to 52% larger (over 10%, on average) compared to these preexisting CPU and GPU implementations. Whereas these codes produce maximal independent sets that are, on average, about 15% smaller than the largest possible such sets, ECL-MIS sets are less than 6% smaller than the maximum independent sets.
计算极大独立集是许多并行图算法的重要步骤。本文介绍了ECL-MIS,一个在gpu上运行良好的最大独立集实现。它包括加速计算、减少内存占用和增加集合大小的关键优化。它的CUDA实现需要少于30个内核语句,异步运行,并产生确定性结果。在16个不同类型和大小的测试图上,它的性能都优于Pannotia、CUSP和IrGL的最大独立集实现。在Titan X GPU上,ECL-MIS的速度在3.9到100倍之间(平均11.5倍)。在GPU上运行的ECL-MIS也比在20至强核上运行的并行CPU代码Ligra, Ligra+和PBBS快,平均性能高出4.1倍。与此同时,ECL-MIS产生的最大独立集比这些现有的CPU和GPU实现大52%(平均超过10%)。尽管这些代码产生的最大独立集平均比最大可能的独立集小15%左右,但ECL-MIS集比最大独立集小不到6%。
{"title":"A High-Quality and Fast Maximal Independent Set Implementation for GPUs","authors":"Martin Burtscher, Sindhu Devale, S. Azimi, J. Jaiganesh, Evan Powers","doi":"10.1145/3291525","DOIUrl":"https://doi.org/10.1145/3291525","url":null,"abstract":"Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size. Its CUDA implementation requires fewer than 30 kernel statements, runs asynchronously, and produces a deterministic result. It outperforms the maximal independent set implementations of Pannotia, CUSP, and IrGL on each of the 16 tested graphs of various types and sizes. On a Titan X GPU, ECL-MIS is between 3.9 and 100 times faster (11.5 times, on average). ECL-MIS running on the GPU is also faster than the parallel CPU codes Ligra, Ligra+, and PBBS running on 20 Xeon cores, which it outperforms by 4.1 times, on average. At the same time, ECL-MIS produces maximal independent sets that are up to 52% larger (over 10%, on average) compared to these preexisting CPU and GPU implementations. Whereas these codes produce maximal independent sets that are, on average, about 15% smaller than the largest possible such sets, ECL-MIS sets are less than 6% smaller than the maximum independent sets.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80137186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
BARAN: Bimodal Adaptive Reconfigurable-Allocator Network-on-Chip 双峰自适应可重构分配器芯片网络
IF 1.6 Q2 Computer Science Pub Date : 2019-01-23 DOI: 10.1145/3294049
Amirhossein Mirhosseini, Mohammad Sadrosadati, F. Aghamohammadi, M. Modarressi, H. Sarbazi-Azad
Virtual channels are employed to improve the throughput under high traffic loads in Networks-on-Chips (NoCs). However, they can impose non-negligible overheads on performance by prolonging clock cycle time, especially under low traffic loads where the impact of virtual channels on performance is trivial. In this article, we propose a novel architecture, called BARAN, that can either improve on-chip network performance or reduce its power consumption (depending on the specific implementation chosen), not both at the same time, when virtual channels are underutilized; that is, the average number of virtual channel allocation requests per cycle is lower than the number of total virtual channels. We also introduce a reconfigurable arbitration logic within the BARAN architecture that can be configured to have multiple latencies and, hence, multiple slack times. The increased slack times are then used to reduce the supply voltage of the routers or increase their clock frequency in order to reduce power consumption or improve the performance of the whole NoC system. The power-centric design of BARAN reduces NoC power consumption by 43.4% and 40.6% under CMP and GPU workloads, on average, respectively, compared to a baseline architecture while imposing negligible area and performance overheads. The performance-centric design of BARAN reduces the average packet latency by 45.4% and 42.1%, on average, under CMP and GPU workloads, respectively, compared to the baseline architecture while increasing power consumption by 39.7% and 43.7%, on average. Moreover, the performance-centric BARAN postpones the network saturation rate by 11.5% under uniform random traffic compared to the baseline architecture.
在片上网络(noc)中,利用虚拟信道来提高高流量负载下的吞吐量。然而,通过延长时钟周期时间,它们可能会对性能造成不可忽略的开销,特别是在低流量负载下,虚拟通道对性能的影响微不足道。在本文中,我们提出了一种称为BARAN的新架构,它可以提高片上网络性能或降低其功耗(取决于所选择的具体实现),而不是在虚拟通道未充分利用时同时进行;也就是说,每个周期的平均虚拟通道分配请求数低于虚拟通道总数。我们还在BARAN体系结构中引入了一个可重新配置的仲裁逻辑,可以将其配置为具有多个延迟,从而具有多个空闲时间。增加的空闲时间然后用于降低路由器的供电电压或增加其时钟频率,以降低功耗或提高整个NoC系统的性能。与基线架构相比,BARAN的以功率为中心的设计在CMP和GPU工作负载下平均分别降低了43.4%和40.6%的NoC功耗,而面积和性能开销可以忽略不计。与基准架构相比,以性能为中心的BARAN设计在CMP和GPU工作负载下平均减少了45.4%和42.1%的平均数据包延迟,同时平均增加了39.7%和43.7%的功耗。此外,与基准架构相比,以性能为中心的BARAN在均匀随机流量下将网络饱和率推迟了11.5%。
{"title":"BARAN: Bimodal Adaptive Reconfigurable-Allocator Network-on-Chip","authors":"Amirhossein Mirhosseini, Mohammad Sadrosadati, F. Aghamohammadi, M. Modarressi, H. Sarbazi-Azad","doi":"10.1145/3294049","DOIUrl":"https://doi.org/10.1145/3294049","url":null,"abstract":"Virtual channels are employed to improve the throughput under high traffic loads in Networks-on-Chips (NoCs). However, they can impose non-negligible overheads on performance by prolonging clock cycle time, especially under low traffic loads where the impact of virtual channels on performance is trivial. In this article, we propose a novel architecture, called BARAN, that can either improve on-chip network performance or reduce its power consumption (depending on the specific implementation chosen), not both at the same time, when virtual channels are underutilized; that is, the average number of virtual channel allocation requests per cycle is lower than the number of total virtual channels. We also introduce a reconfigurable arbitration logic within the BARAN architecture that can be configured to have multiple latencies and, hence, multiple slack times. The increased slack times are then used to reduce the supply voltage of the routers or increase their clock frequency in order to reduce power consumption or improve the performance of the whole NoC system. The power-centric design of BARAN reduces NoC power consumption by 43.4% and 40.6% under CMP and GPU workloads, on average, respectively, compared to a baseline architecture while imposing negligible area and performance overheads. The performance-centric design of BARAN reduces the average packet latency by 45.4% and 42.1%, on average, under CMP and GPU workloads, respectively, compared to the baseline architecture while increasing power consumption by 39.7% and 43.7%, on average. Moreover, the performance-centric BARAN postpones the network saturation rate by 11.5% under uniform random traffic compared to the baseline architecture.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74432091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Lock Contention Management in Multithreaded MPI 多线程MPI中的锁争用管理
IF 1.6 Q2 Computer Science Pub Date : 2019-01-23 DOI: 10.1145/3275443
A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka
In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.
在本文中,我们将研究基于锁的线程安全MPI库中的争用管理。具体来说,我们做了两个假设:(1)锁是保护通信路径时唯一的同步形式;(2)发生争用,因此序列化是不可避免的。我们的工作区分了在关键区域内执行的工作的锁获取;高效vs低效。在临界区中等待消息接收而不做任何其他事情是非生产性锁获取的一个例子。我们表明,现代可扩展锁定协议的高吞吐量特性转化为吞吐量密集型MPI通信的更好通信进度,但由于过度热心的非生产性锁获取,对延迟敏感通信产生负面影响。为了减少非生产性的锁获取,我们设计了一种方法,该方法使用通用的两级优先级锁定协议来促进具有生产性工作的线程。我们的结果表明,使用高吞吐量协议进行高效工作,使用公平协议进行低效率代码路径,确保了细粒度通信的最佳权衡,而公平协议则足以进行更粗粒度的通信。尽管这些努力得到了回报,但可伸缩性的退化仍然很严重。我们讨论了与纯锁定模型不同的技术,这些技术提供了进一步提高可伸缩性的潜力。
{"title":"Lock Contention Management in Multithreaded MPI","authors":"A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka","doi":"10.1145/3275443","DOIUrl":"https://doi.org/10.1145/3275443","url":null,"abstract":"In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84422335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs PowerLyra:歪斜图上的微分图计算和划分
IF 1.6 Q2 Computer Science Pub Date : 2019-01-23 DOI: 10.1145/3298989
Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen
Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.
具有倾斜分布的自然图对分布式图的计算和划分提出了独特的挑战。现有的图并行系统通常使用“一刀切”的设计,统一处理所有顶点,这要么会导致明显的负载不平衡和高程度顶点(例如,Pregel和GraphLab)的高争用,要么即使对于低程度顶点(例如,PowerGraph和GraphX)也会产生高通信成本和内存消耗。在本文中,我们认为在自然图的歪斜分布中,也需要对高次顶点和低次顶点进行区分处理。然后我们介绍了PowerLyra,一个新的分布式图形处理系统,它包含了现有图形并行系统的两个世界的优点。具体来说,PowerLyra对低度顶点使用集中计算以避免频繁的通信,并对高度顶点分配计算以平衡工作负载。PowerLyra进一步提供了一种高效的混合图划分算法(即hybrid-cut),它将边切(用于低度顶点)和顶点切(用于高度顶点)与启发式相结合。为了提高节点间图访问的缓存局域性,PowerLyra进一步提供了一个局域意识数据布局优化。PowerLyra是基于最新的GraphLab实现的,可以无缝地支持在同步和异步执行模式下运行的各种图形算法。使用各种图形分析和MLDM(机器学习和数据挖掘)应用程序对三个集群进行的详细评估表明,PowerLyra在实际和合成图形方面分别比PowerGraph高出5.53倍(从1.24倍)和3.26倍(从1.49倍),并且比其他系统(如GraphX和Giraph)快得多,但内存消耗少得多。将hybrid-cut移植到GraphX进一步证实了PowerLyra的效率和通用性。
{"title":"PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs","authors":"Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen","doi":"10.1145/3298989","DOIUrl":"https://doi.org/10.1145/3298989","url":null,"abstract":"Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90205401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 323
An Autotuning Protocol to Rapidly Build Autotuners 快速构建自动调谐器的自动调谐协议
IF 1.6 Q2 Computer Science Pub Date : 2019-01-23 DOI: 10.1145/3291527
Junhong Liu, Guangming Tan, Yulong Luo, Jiajia Li, Z. Mo, Ninghui Sun
Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.
自动性能调优(Autotuning)对于Exascale应用程序的高可移植性能来说是一项日益重要的调优技术。然而,从头开始构建自动调谐器仍然是一个挑战,即使对领域专家来说也是如此。在这项工作中,我们提出了一个性能调优和知识管理套件(PAK)来帮助快速构建自动调优器。为了适应现有的自动调优技术,我们提出了一个由提取器、生产者、优化器、评估器和学习者组成的自动调优协议。为了实现模块化和可重用性,我们还为每个协议组件定义了编程接口作为基础架构,为在性能数据库中部署知识挖掘提供了可定制的机制。通过研究两个重要的计算内核:模板计算和稀疏矩阵向量乘法(SpMV),证明了PAK的可用性。我们提出的基于PAK的自动调谐器通过使用我们的自动调谐协议编写几十个代码,显示出与传统自动调谐器相当的性能和更高的生产力。
{"title":"An Autotuning Protocol to Rapidly Build Autotuners","authors":"Junhong Liu, Guangming Tan, Yulong Luo, Jiajia Li, Z. Mo, Ninghui Sun","doi":"10.1145/3291527","DOIUrl":"https://doi.org/10.1145/3291527","url":null,"abstract":"Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86650051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees 具有访问保证的移动设备动态并行工作负载调度
IF 1.6 Q2 Computer Science Pub Date : 2018-12-08 DOI: 10.1145/3291529
Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong
We study a dynamic resource-allocation problem that arises in various parallel computing scenarios, such as mobile cloud computing, cloud computing systems, Internet of Things systems, and others. Generically, we model the architecture as client mobile devices and static base stations. Each client “arrives” to the system to upload data to base stations by radio transmissions and then “leaves.” The problem, called Station Assignment, is to assign clients to stations so that every client uploads their data under some restrictions, including a target subset of stations, a maximum delay between transmissions, a volume of data to upload, and a maximum bandwidth for each station. We study the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals. We show upper and lower bounds on the rate and burstiness for various client arrival schedules and protocol classes. To the best of our knowledge, this is the first time that Station Assignment is studied under adversarial arrivals and departures.
研究了各种并行计算场景下的动态资源分配问题,如移动云计算、云计算系统、物联网系统等。一般来说,我们将架构建模为客户端移动设备和静态基站。每个客户端“到达”系统,通过无线电传输将数据上传到基站,然后“离开”。这个问题被称为“站点分配”,它是将客户端分配到站点,以便每个客户端在一定的限制下上传数据,包括站点的目标子集、传输之间的最大延迟、要上传的数据量以及每个站点的最大带宽。我们研究了在对手控制客户到达和离开的情况下,站点分配的可解性,限制在这些到达的最大速率和突发次数。我们显示了各种客户端到达时间表和协议类的速率和突发性的上限和下限。据我们所知,这是第一次在对抗到达和离开的情况下研究车站分配。
{"title":"Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees","authors":"Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong","doi":"10.1145/3291529","DOIUrl":"https://doi.org/10.1145/3291529","url":null,"abstract":"We study a dynamic resource-allocation problem that arises in various parallel computing scenarios, such as mobile cloud computing, cloud computing systems, Internet of Things systems, and others. Generically, we model the architecture as client mobile devices and static base stations. Each client “arrives” to the system to upload data to base stations by radio transmissions and then “leaves.” The problem, called Station Assignment, is to assign clients to stations so that every client uploads their data under some restrictions, including a target subset of stations, a maximum delay between transmissions, a volume of data to upload, and a maximum bandwidth for each station. We study the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals. We show upper and lower bounds on the rate and burstiness for various client arrival schedules and protocol classes. To the best of our knowledge, this is the first time that Station Assignment is studied under adversarial arrivals and departures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2018-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3291529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72538506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code 应用于大规模生产天气预报代码的新型高性能GPGPU代码转换框架
IF 1.6 Q2 Computer Science Pub Date : 2018-02-16 DOI: 10.1145/3291523
Michel Müller, T. Aoki
We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.
我们介绍了“Hybrid Fortran”,这是一种允许高性能GPGPU端口用于结构化网格Fortran代码的新方法。这种技术只需要对以CPU为目标的代码库进行最小的更改,这在生产力方面是一个显著的进步。该方法已成功应用于日本150k多行代码的中尺度天气预报模式ASUCA的动力核心和物理过程。通过一个类似于ASUCA代码结构的最小天气应用程序,Hybrid Fortran既可以与性能模型进行比较,也可以与当今常用的方法OpenACC进行比较。结果表明,混合Fortran实现提供了与OpenACC相同或更好的性能,并且其性能在CPU和GPU上都与模型一致。在全面生产运行中,使用1581 × 1301 × 58单元的ASUCA网格和2km分辨率的真实天气数据,24个运行基于Hybrid fortran的GPU端口的NVIDIA Tesla P100被证明可以取代50多个运行参考实现的18核Intel Xeon Broadwell E5-2695 v4,这一成就可与更具侵入性的GPGPU重写其他天气模型相媲美。
{"title":"New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code","authors":"Michel Müller, T. Aoki","doi":"10.1145/3291523","DOIUrl":"https://doi.org/10.1145/3291523","url":null,"abstract":"We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2018-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79334909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs 多gpu预条件共轭梯度自适应优化建模
IF 1.6 Q2 Computer Science Pub Date : 2016-12-26 DOI: 10.1145/2990849
Jiaquan Gao, Yu Wang, Jun Wang, Ronghua Liang
The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.
预条件共轭梯度(PCG)算法是科学计算中求解稀疏线性系统的一种众所周知的迭代方法。gpu加速的大规模问题PCG算法近年来引起了人们的广泛关注。然而,在特定的多GPU平台上,为任何大型问题生成高度并行的PCG实现需要大量时间,因为涉及到调整相关参数和为分配给每个GPU的矩阵块选择适当的存储格式的几个手动步骤。这促使我们提出PCG在多gpu上的自适应优化建模,主要包括以下几个部分:(1)PCG的优化多gpu并行框架;(2)PCG算法各主要组成部分的基于轮廓的优化建模,包括向量运算、内积和稀疏矩阵向量乘法(SpMV)。我们的模型不构建新的存储格式或内核,而是通过集成现有的存储格式和内核,自动快速地为特定多gpu平台上的任何问题生成最优并行PCG算法。我们以一个向量运算核、一个内积核和五个流行的SpMV核为例来介绍构造模型的思想。考虑到我们的模型是通用的,独立于问题,并且依赖于设备的资源,这个模型只针对每种类型的GPU构建一次。实验验证了该模型的有效性。
{"title":"Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs","authors":"Jiaquan Gao, Yu Wang, Jun Wang, Ronghua Liang","doi":"10.1145/2990849","DOIUrl":"https://doi.org/10.1145/2990849","url":null,"abstract":"The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82616127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations Damaris:解决千兆级模拟后数据管理中的性能变化
IF 1.6 Q2 Computer Science Pub Date : 2016-12-26 DOI: 10.1145/2987371
Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf
With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. This variability significantly impacts the overall application performance at scale and its predictability over time. In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters. Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability. In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.
随着百亿亿级计算的出现,减少数据管理任务(存储、可视化、分析等)的性能变化正成为维持高性能的关键挑战。这种可变性在一定规模上显著影响应用程序的整体性能及其随时间的可预测性。在本文中,我们介绍Damaris,这是一个利用多核节点中的专用核心来卸载数据管理任务的系统,包括I/O、数据压缩、数据移动调度、原位分析和可视化。利用CM1大气模拟和Nek5000计算流体动力学模拟,在NICS的Kraken和NCSA的Blue Waters四个平台上对Damaris进行了评估。我们的研究结果表明:(1)Damaris完全隐藏了I/O可变性以及所有与I/O相关的成本,从而使仿真性能可预测;(2)与标准I/O方法相比,它将持续写入吞吐量提高了15倍;(3)它允许模拟的几乎完美的可扩展性高达9000多个核心,而不是最先进的方法,无法扩展;(4)它能够与VisIt可视化软件无缝连接,以一种既不影响模拟性能也不影响其可变性的方式执行原位分析和可视化。此外,我们扩展了Damaris的实现,以支持使用专用节点,并对使用上述应用程序处理I/O任务的两种方法(专用核心和专用节点)进行了彻底的比较。
{"title":"Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations","authors":"Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf","doi":"10.1145/2987371","DOIUrl":"https://doi.org/10.1145/2987371","url":null,"abstract":"With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. This variability significantly impacts the overall application performance at scale and its predictability over time.\u0000 In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters. Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability.\u0000 In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91252020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Transparently Space Sharing a Multicore Among Multiple Processes 在多个进程之间透明地共享多核空间
IF 1.6 Q2 Computer Science Pub Date : 2016-12-26 DOI: 10.1145/3001910
T. Creech, R. Barua
As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.
随着硬件变得越来越并行,可伸缩并行软件的可用性得到提高,管理多个多线程应用程序(进程)的问题变得非常重要。可塑进程可以在运行时改变所使用的线程数量,从而实现复杂而灵活的资源管理。尽管为具有并行运行时的smp并行化的许多现有应用程序实际上已经具有延展性,但部署的运行时环境既没有提供接口,也没有提供任何策略来智能地分配硬件线程,甚至防止过度订阅。先前的研究方法要么依赖于提前分析应用程序以做出关于分配的正确决策,要么根本不考虑流程效率,从而导致较差的性能。这些先前的方法都没有在实践中得到广泛的应用。本文介绍了带反馈的调度和分配(SCAF)系统:一个插入式运行时解决方案,它支持现有的可扩展应用程序根据观察到的效率做出智能分配决策,而无需对语义、程序修改、脱机分析甚至重新编译进行任何更改。我们现有的实现可以控制大多数未经修改的OpenMP应用程序。其他具有延展性的线程库也可以通过少量修改来支持,而不需要修改应用程序或重新编译。在本文中,我们介绍了SCAF守护进程和GNU OpenMP运行时的一个支持SCAF的端口。我们提出了一种使用可用硬件计数器在运行时评估进程效率的新技术,并证明了它在帮助分配决策方面的有效性。我们在五个商品并行平台上使用NAS NPB并行基准来评估SCAF,列举了架构特征及其对我们方案的影响。当同时运行所有基准对时,我们根据加速改进的总和(多程序环境的常用指标)来衡量SCAF的好处,并将其与均衡(文献中现有的最佳竞争方案)进行比较。我们发现SCAF在5台机器中的4台机器上的均分方面得到了改进,根据机器的不同,基准对的加速总和的平均改进系数为1.04到1.11倍,平均为1.09倍。由于我们不知道有任何广泛可用的均分工具,我们还比较了SCAF与使用未修改的OpenMP的多路编程,OpenMP是目前最终用户可用的唯一环境。SCAF在未修改的OpenMP运行时上对所有五台机器进行了改进,根据机器的不同,平均改进了1.08到2.07倍,平均改进了1.59倍。
{"title":"Transparently Space Sharing a Multicore Among Multiple Processes","authors":"T. Creech, R. Barua","doi":"10.1145/3001910","DOIUrl":"https://doi.org/10.1145/3001910","url":null,"abstract":"As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation.\u0000 In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions.\u0000 We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average.\u0000 Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88132169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
ACM Transactions on Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1