首页 > 最新文献

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)最新文献

英文 中文
A Scalable Unified Model for Dynamic Data Structures in Message Passing (Clusters) and Shared Memory (multicore CPUs) Computing environments 消息传递(集群)和共享内存(多核cpu)计算环境下动态数据结构的可扩展统一模型
G. Laccetti, M. Lapegna, R. Montella
Concurrent data structures are widely used in many software stack levels, ranging from high level parallel scientific applications to low level operating systems. The key issue of these objects is their concurrent use by several computing units (threads or process) so that the design of these structures is much more difficult compared to their sequential counterpart, because of their extremely dynamic nature requiring protocols to ensure data consistency, with a significant cost overhead. At this regard, several studies emphasize a tension between the needs of sequential correctness of the concurrent data structures and scalability of the algorithms, and in many cases it is evident the need to rethink the data structure design, using approaches based on randomization and/or redistribution techniques in order to fully exploit the computational power of the recent computing environments. The problem is grown in importance with the new generation High Performance Computing systems aimed to achieve extreme performance. It is easy to observe that such systems are based on heterogeneous architectures integrating several independent nodes in the form of clusters or MPP systems, where each node is composed by powerful computing elements (CPU core, GPUs or other acceleration devices) sharing resources in a single node. These systems therefore make massive use of communication libraries to exchange data among the nodes, as well as other tools for the management of the shared resources inside a single node. For such a reason, the development of algorithms and scientific software for dynamic data structures on these heterogeneous systems implies a suitable combination of several methodologies and tools to deal with the different kinds of parallelism corresponding to each specific device, so that to be aware of the underlying platform. The present work is aimed to introduce a scalable model to manage a special class of dynamic data structure known as heap based priority queue (or simply heap) on these heterogeneous architectures. A heap is generally used when the applications needs set of data not requiring a complete ordering, but only the access to some items tagged with high priority. In order to ensure a tradeoff between the correct access to high priority items by the several computing units with a low communication and synchronization overhead, a suitable reorganization of the heap is needed. More precisely we introduce a unified scalable model that can be used, with no modifications, to redeploy the items of a heap both in message passing environments (such as clusters and or MMP multicomputers with several nodes) as well as in shared memory environments (such as CPUs and multiprocessors with several cores) with an overhead independent of the number of computing units. Computational results related to the application of the proposed strategy on some numerical case studies are presented for different types of computing environments.
并发数据结构广泛应用于许多软件堆栈级别,从高级并行科学应用程序到低级操作系统。这些对象的关键问题是它们被多个计算单元(线程或进程)并发使用,因此这些结构的设计比顺序结构的设计要困难得多,因为它们具有极其动态的特性,需要协议来确保数据一致性,并且成本开销很大。在这方面,一些研究强调了并发数据结构的顺序正确性和算法的可扩展性之间的紧张关系,并且在许多情况下,显然需要重新考虑数据结构设计,使用基于随机化和/或再分配技术的方法,以便充分利用最新计算环境的计算能力。随着新一代高性能计算系统的发展,这个问题变得越来越重要。很容易观察到,这样的系统是基于异构架构,以集群或MPP系统的形式集成了几个独立的节点,其中每个节点由强大的计算元素(CPU核心,gpu或其他加速设备)组成,共享单个节点中的资源。因此,这些系统大量使用通信库在节点之间交换数据,并使用其他工具管理单个节点内的共享资源。因此,在这些异构系统上开发用于动态数据结构的算法和科学软件意味着需要将几种方法和工具适当地结合起来,以处理对应于每个特定设备的不同类型的并行性,从而了解底层平台。目前的工作旨在引入一个可扩展的模型来管理这些异构架构上的一类特殊的动态数据结构,称为基于堆的优先队列(或简称堆)。当应用程序需要一组不需要完整排序的数据,而只需要访问一些标记为高优先级的项目时,通常使用堆。为了确保几个计算单元对高优先级项的正确访问与低通信和同步开销之间的权衡,需要对堆进行适当的重组。更准确地说,我们引入了一个统一的可扩展模型,该模型无需修改即可用于在消息传递环境(例如具有多个节点的集群和/或MMP多计算机)以及开销与计算单元数量无关的共享内存环境(例如具有多个内核的cpu和多处理器)中重新部署堆的项。针对不同类型的计算环境,给出了与所提出的策略应用有关的一些数值案例研究的计算结果。
{"title":"A Scalable Unified Model for Dynamic Data Structures in Message Passing (Clusters) and Shared Memory (multicore CPUs) Computing environments","authors":"G. Laccetti, M. Lapegna, R. Montella","doi":"10.1109/CCGRID.2018.00007","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00007","url":null,"abstract":"Concurrent data structures are widely used in many software stack levels, ranging from high level parallel scientific applications to low level operating systems. The key issue of these objects is their concurrent use by several computing units (threads or process) so that the design of these structures is much more difficult compared to their sequential counterpart, because of their extremely dynamic nature requiring protocols to ensure data consistency, with a significant cost overhead. At this regard, several studies emphasize a tension between the needs of sequential correctness of the concurrent data structures and scalability of the algorithms, and in many cases it is evident the need to rethink the data structure design, using approaches based on randomization and/or redistribution techniques in order to fully exploit the computational power of the recent computing environments. The problem is grown in importance with the new generation High Performance Computing systems aimed to achieve extreme performance. It is easy to observe that such systems are based on heterogeneous architectures integrating several independent nodes in the form of clusters or MPP systems, where each node is composed by powerful computing elements (CPU core, GPUs or other acceleration devices) sharing resources in a single node. These systems therefore make massive use of communication libraries to exchange data among the nodes, as well as other tools for the management of the shared resources inside a single node. For such a reason, the development of algorithms and scientific software for dynamic data structures on these heterogeneous systems implies a suitable combination of several methodologies and tools to deal with the different kinds of parallelism corresponding to each specific device, so that to be aware of the underlying platform. The present work is aimed to introduce a scalable model to manage a special class of dynamic data structure known as heap based priority queue (or simply heap) on these heterogeneous architectures. A heap is generally used when the applications needs set of data not requiring a complete ordering, but only the access to some items tagged with high priority. In order to ensure a tradeoff between the correct access to high priority items by the several computing units with a low communication and synchronization overhead, a suitable reorganization of the heap is needed. More precisely we introduce a unified scalable model that can be used, with no modifications, to redeploy the items of a heap both in message passing environments (such as clusters and or MMP multicomputers with several nodes) as well as in shared memory environments (such as CPUs and multiprocessors with several cores) with an overhead independent of the number of computing units. Computational results related to the application of the proposed strategy on some numerical case studies are presented for different types of computing environments.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"378 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126972784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster 基于商用GPU集群的分布式深度学习自适应通信
Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu
Deep learning is now the most promising approach to develop human-intelligent computer systems. To speedup the development of neural networks, researchers have designed many distributed learning algorithms to facilitate the training process. In these algorithms, people use a constant to indicate the communication period for model/gradient exchange. We find that this type of communication pattern could incur unnecessary and inefficient data transmission for some training methods e.g., elastic SGD and gossiping SGD. In this paper, we propose an adaptive communication method to improve the performance of gossiping SGD. Instead of using a fixed period for model exchange, we exchange the models with other machines according to the change of the local model. This makes the communication more efficient and thus improves the performance. The experiment results show that our method reduces the communication traffic by 92%, which results in 52% reduction in training time while preserving the prediction accuracy compared with gossiping SGD.
深度学习是目前开发人类智能计算机系统最有前途的方法。为了加速神经网络的发展,研究人员设计了许多分布式学习算法来促进训练过程。在这些算法中,人们使用一个常数来表示通信周期,用于模型/梯度交换。我们发现这种类型的通信模式可能会导致一些训练方法(如弹性SGD和八卦SGD)产生不必要和低效的数据传输。在本文中,我们提出了一种自适应通信方法来提高流言SGD的性能。我们不是用固定的时间交换模型,而是根据本地模型的变化与其他机器交换模型。这使得通信更有效,从而提高了性能。实验结果表明,与八卦SGD相比,我们的方法减少了92%的通信流量,在保持预测精度的同时,训练时间减少了52%。
{"title":"Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster","authors":"Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu","doi":"10.1109/CCGRID.2018.00043","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00043","url":null,"abstract":"Deep learning is now the most promising approach to develop human-intelligent computer systems. To speedup the development of neural networks, researchers have designed many distributed learning algorithms to facilitate the training process. In these algorithms, people use a constant to indicate the communication period for model/gradient exchange. We find that this type of communication pattern could incur unnecessary and inefficient data transmission for some training methods e.g., elastic SGD and gossiping SGD. In this paper, we propose an adaptive communication method to improve the performance of gossiping SGD. Instead of using a fixed period for model exchange, we exchange the models with other machines according to the change of the local model. This makes the communication more efficient and thus improves the performance. The experiment results show that our method reduces the communication traffic by 92%, which results in 52% reduction in training time while preserving the prediction accuracy compared with gossiping SGD.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121177898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Experimental Study on the Performance and Resource Utilization of Data Streaming Frameworks 数据流框架性能与资源利用的实验研究
Subarna Chatterjee, C. Morin
With the advent of the Internet of Things (IoT), data stream processing have gained increased attention due to the ever-increasing need to process heterogeneous and voluminous data streams. This work addresses the problem of selecting a correct stream processing framework for a given application to be executed within a specific physical infrastructure. For this purpose, we focus on a thorough comparative analysis of three data stream processing platforms – Apache Flink, Apache Storm, and Twitter Heron (the enhanced version of Apache Storm), that are chosen based on their potential to process both streams and batches in real-time. The goal of the work is to enlighten the cloud-clients and the cloud-providers with the knowledge of the choice of the resource-efficient and requirement-adaptive streaming platform for a given application so that they can plan during allocation or assignment of Virtual Machines for application execution. For the comparative performance analysis of the chosen platforms, we have experimented using 8-node clusters on Grid5000 experimentation testbed and have selected a wide variety of applications ranging from a conventional benchmark to sensor-based IoT application and statistical batch processing application. In addition to the various performance metrics related to the elasticity and resource usage of the platforms, this work presents a comparative study of the “green-ness” of the streaming platforms by analyzing their power consumption – one of the first attempts of its kind. The obtained results are thoroughly analyzed to illustrate the functional behavior of these platforms under different computing scenarios.
随着物联网(IoT)的出现,由于处理异构和大量数据流的需求不断增加,数据流处理受到越来越多的关注。这项工作解决了为在特定物理基础设施中执行的给定应用程序选择正确的流处理框架的问题。为此,我们重点对三个数据流处理平台——Apache Flink、Apache Storm和Twitter Heron (Apache Storm的增强版本)——进行了全面的比较分析,这些平台是基于它们实时处理流和批处理的潜力而选择的。这项工作的目标是让云客户端和云提供商了解如何为给定的应用程序选择资源高效和需求自适应的流平台,以便他们可以在分配或分配应用程序执行的虚拟机期间进行计划。为了比较所选平台的性能分析,我们在Grid5000实验测试台上使用8节点集群进行了实验,并选择了从传统基准到基于传感器的物联网应用和统计批处理应用的各种应用。除了与平台的弹性和资源使用相关的各种性能指标外,这项工作还通过分析其功耗对流媒体平台的“绿色”进行了比较研究——这是同类研究的首次尝试之一。对得到的结果进行了深入的分析,以说明这些平台在不同计算场景下的功能行为。
{"title":"Experimental Study on the Performance and Resource Utilization of Data Streaming Frameworks","authors":"Subarna Chatterjee, C. Morin","doi":"10.1109/CCGRID.2018.00029","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00029","url":null,"abstract":"With the advent of the Internet of Things (IoT), data stream processing have gained increased attention due to the ever-increasing need to process heterogeneous and voluminous data streams. This work addresses the problem of selecting a correct stream processing framework for a given application to be executed within a specific physical infrastructure. For this purpose, we focus on a thorough comparative analysis of three data stream processing platforms – Apache Flink, Apache Storm, and Twitter Heron (the enhanced version of Apache Storm), that are chosen based on their potential to process both streams and batches in real-time. The goal of the work is to enlighten the cloud-clients and the cloud-providers with the knowledge of the choice of the resource-efficient and requirement-adaptive streaming platform for a given application so that they can plan during allocation or assignment of Virtual Machines for application execution. For the comparative performance analysis of the chosen platforms, we have experimented using 8-node clusters on Grid5000 experimentation testbed and have selected a wide variety of applications ranging from a conventional benchmark to sensor-based IoT application and statistical batch processing application. In addition to the various performance metrics related to the elasticity and resource usage of the platforms, this work presents a comparative study of the “green-ness” of the streaming platforms by analyzing their power consumption – one of the first attempts of its kind. The obtained results are thoroughly analyzed to illustrate the functional behavior of these platforms under different computing scenarios.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123834208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Stocator: Providing High Performance and Fault Tolerance for Apache Spark Over Object Storage Stocator:为Apache Spark Over Object Storage提供高性能和容错性
G. Vernik, M. Factor, E. K. Kolodner, P. Michiardi, Effi Ofer, Francesco Pace
Until now object storage has not been a first-class citizen of the Apache Hadoop ecosystem including Apache Spark. Hadoop connectors to object storage have been based on file semantics, an impedance mismatch, which leads to low performance and the need for an additional consistent storage system to achieve fault tolerance. In particular, Hadoop depends on its underlying storage system and its associated connector for fault tolerance and allowing speculative execution. However, these characteristics are obtained through file operations that are not native for object storage, and are both costly and not atomic. As a result these connectors are not efficient and more importantly they cannot help with fault tolerance for object storage. We introduce Stocator, whose novel algorithm achieves both high performance and fault tolerance by taking advantage of object storage semantics. This greatly decreases the number of operations on object storage as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object storage. We have implemented Stocator and shared it in open source. Performance testing with Apache Spark shows that it can be 18 times faster for write intensive workloads and can perform 30 times fewer operations on object storage than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider.
到目前为止,对象存储还不是Apache Hadoop生态系统(包括Apache Spark)的一等公民。Hadoop到对象存储的连接器是基于文件语义的,这是一种阻抗不匹配,导致性能低下,并且需要额外的一致存储系统来实现容错。特别是,Hadoop依赖于其底层存储系统及其相关连接器来实现容错和允许推测执行。然而,这些特征是通过文件操作获得的,这些操作不是对象存储的本机操作,而且成本高,而且不是原子性的。因此,这些连接器效率不高,更重要的是,它们不能帮助实现对象存储的容错。本文介绍了Stocator算法,该算法利用对象存储语义实现了高性能和容错性。这大大减少了对象存储上的操作数量,并支持一种更简单的方法来处理对象存储典型的最终一致的语义。我们已经实现了Stocator,并在开源中分享了它。使用Apache Spark进行的性能测试表明,对于写密集型工作负载,它可以比传统Hadoop连接器快18倍,在对象存储上执行的操作可以减少30倍,从而降低了客户端和对象存储服务提供商的成本。
{"title":"Stocator: Providing High Performance and Fault Tolerance for Apache Spark Over Object Storage","authors":"G. Vernik, M. Factor, E. K. Kolodner, P. Michiardi, Effi Ofer, Francesco Pace","doi":"10.1109/CCGRID.2018.00073","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00073","url":null,"abstract":"Until now object storage has not been a first-class citizen of the Apache Hadoop ecosystem including Apache Spark. Hadoop connectors to object storage have been based on file semantics, an impedance mismatch, which leads to low performance and the need for an additional consistent storage system to achieve fault tolerance. In particular, Hadoop depends on its underlying storage system and its associated connector for fault tolerance and allowing speculative execution. However, these characteristics are obtained through file operations that are not native for object storage, and are both costly and not atomic. As a result these connectors are not efficient and more importantly they cannot help with fault tolerance for object storage. We introduce Stocator, whose novel algorithm achieves both high performance and fault tolerance by taking advantage of object storage semantics. This greatly decreases the number of operations on object storage as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object storage. We have implemented Stocator and shared it in open source. Performance testing with Apache Spark shows that it can be 18 times faster for write intensive workloads and can perform 30 times fewer operations on object storage than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121107609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
ApproxG: Fast Approximate Parallel Graphlet Counting Through Accuracy Control 通过精度控制快速近似并行石墨计数
Daniel Mawhirter, Bo Wu, D. Mehta, Chao Ai
Graphlet counting is a methodology for detecting local structural properties of large graphs that has been in use for over a decade. Despite tremendous effort in optimizing its performance, even 3- and 4-node graphlet counting routines may run for hours or days on highly optimized systems. In this paper, we describe how a synergistic combination of approximate computing with parallel computing can result in multiplicative performance improvements in graphlet counting runtimes with minimal and controllable loss of accuracy. Specifically, we describe two novel techniques, multi-phased sampling for statistical accuracy guarantees and cost-aware sampling to further improve performance on multi-machine runs, which reduce the query time on large graphs from tens of hours to several minutes or seconds with only <1% relative error.
Graphlet计数是一种用于检测大型图的局部结构特性的方法,已经使用了十多年。尽管在优化性能方面付出了巨大的努力,但即使是3节点和4节点的graphlet计数例程也可能在高度优化的系统上运行数小时或数天。在本文中,我们描述了近似计算与并行计算的协同组合如何在最小和可控的准确性损失的情况下,在graphlet计数运行时产生乘法性能改进。具体来说,我们描述了两种新技术,用于保证统计准确性的多阶段采样和用于进一步提高多机器运行性能的成本感知采样,这将大型图的查询时间从数十小时减少到几分钟或几秒钟,相对误差仅<1%。
{"title":"ApproxG: Fast Approximate Parallel Graphlet Counting Through Accuracy Control","authors":"Daniel Mawhirter, Bo Wu, D. Mehta, Chao Ai","doi":"10.1109/CCGRID.2018.00080","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00080","url":null,"abstract":"Graphlet counting is a methodology for detecting local structural properties of large graphs that has been in use for over a decade. Despite tremendous effort in optimizing its performance, even 3- and 4-node graphlet counting routines may run for hours or days on highly optimized systems. In this paper, we describe how a synergistic combination of approximate computing with parallel computing can result in multiplicative performance improvements in graphlet counting runtimes with minimal and controllable loss of accuracy. Specifically, we describe two novel techniques, multi-phased sampling for statistical accuracy guarantees and cost-aware sampling to further improve performance on multi-machine runs, which reduce the query time on large graphs from tens of hours to several minutes or seconds with only <1% relative error.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123213988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Enhancing Efficiency of Hybrid Transactional Memory Via Dynamic Data Partitioning Schemes 通过动态数据分区方案提高混合事务性内存的效率
Pedro Raminhas, S. Issa, P. Romano
Transactional Memory (TM) is an emerging paradigm that promises to significantly ease the development of parallel programs. Hybrid TM (HyTM) is probably the most promising implementation of the TM abstraction, which seeks to combine the high efficiency of hardware implementations (HTM) with the robustness and flexibility of software-based ones (STM). Unfortunately, though, existing Hybrid TM systems are known to suffer from high overheads to guarantee correct synchronization between concurrent transactions executing in hardware and software. This article introduces DMP-TM (Dynamic Memory Partitioning-TM), a novel HyTM algorithm that exploits, to the best of our knowledge for the first time in the literature, the idea of leveraging operating system-level memory protection mechanisms to detect conflicts between HTM and STM transactions. This innovative design allows for employing highly scalable STM implementations, while avoiding instrumentation on the HTM path. This allows DMP-TM to achieve up to ~ 20× speedups compared to state of the art Hybrid TM solutions in uncontended workloads. Further, thanks to the use of simple and lightweight self-tuning mechanisms, DMP-TM achieves robust performance even in unfavourable workload that exhibits high contention between the STM and HTM path.
事务性内存(Transactional Memory, TM)是一种新兴的范式,有望极大地简化并行程序的开发。混合TM (HyTM)可能是最有前途的TM抽象实现,它寻求将硬件实现(HTM)的高效率与基于软件的实现(STM)的健壮性和灵活性结合起来。然而,不幸的是,已知现有的Hybrid TM系统在保证硬件和软件中执行的并发事务之间的正确同步方面存在很高的开销。本文介绍DMP-TM(动态内存分区- tm),这是一种新颖的HyTM算法,据我们所知,在文献中首次利用了利用操作系统级内存保护机制来检测HTM和STM事务之间冲突的思想。这种创新的设计允许采用高度可伸缩的STM实现,同时避免在HTM路径上进行检测。这使得DMP-TM在无竞争的工作负载中与最先进的Hybrid TM解决方案相比,可以实现高达20倍的速度提升。此外,由于使用了简单和轻量级的自调优机制,DMP-TM即使在STM和HTM路径之间表现出高度竞争的不利工作负载中也能实现健壮的性能。
{"title":"Enhancing Efficiency of Hybrid Transactional Memory Via Dynamic Data Partitioning Schemes","authors":"Pedro Raminhas, S. Issa, P. Romano","doi":"10.1109/CCGRID.2018.00020","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00020","url":null,"abstract":"Transactional Memory (TM) is an emerging paradigm that promises to significantly ease the development of parallel programs. Hybrid TM (HyTM) is probably the most promising implementation of the TM abstraction, which seeks to combine the high efficiency of hardware implementations (HTM) with the robustness and flexibility of software-based ones (STM). Unfortunately, though, existing Hybrid TM systems are known to suffer from high overheads to guarantee correct synchronization between concurrent transactions executing in hardware and software. This article introduces DMP-TM (Dynamic Memory Partitioning-TM), a novel HyTM algorithm that exploits, to the best of our knowledge for the first time in the literature, the idea of leveraging operating system-level memory protection mechanisms to detect conflicts between HTM and STM transactions. This innovative design allows for employing highly scalable STM implementations, while avoiding instrumentation on the HTM path. This allows DMP-TM to achieve up to ~ 20× speedups compared to state of the art Hybrid TM solutions in uncontended workloads. Further, thanks to the use of simple and lightweight self-tuning mechanisms, DMP-TM achieves robust performance even in unfavourable workload that exhibits high contention between the STM and HTM path.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128486893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Real-Time Graph Partition and Embedding of Large Network 大型网络的实时图划分与嵌入
Wenqi Liu, Hongxiang Li, Bin Xie
Recently, large-scale networks attract significant attention to analyze and extract the hidden information of big data. Toward this end, graph embedding is a method to embed a high dimensional graph into a much lower dimensional vector space while maximally preserving the structural information of the original network. However, effective graph embedding is particularly challenging when massive graph data are generated and processed for real-time applications. In this paper, we address this challenge and propose a new real-time and distributed graph embedding algorithm (RTDGE) that is capable of distributively embedding a large-scale graph in a streaming fashion. Specifically, our RTDGE consists of the following components: (1) a graph partition scheme that divides all edges into distinct subgraphs, where vertices are associated with edges and may belong to several subgraphs; (2) a dynamic negative sampling (DNS) method that updates the embedded vectors in real-time; and (3) an unsupervised global aggregation scheme that combines all locally embedded vectors into a global vector space. Furthermore, we also build a real-time distributed graph embedding platform based on Apache Kafka and Apache Storm. Extensive experimental results show that RTDGE outperforms existing solutions in terms of graph embedding efficiency and accuracy.
近年来,大规模网络对大数据隐藏信息的分析和提取备受关注。为此,图嵌入是一种将高维图嵌入到低维向量空间中,同时最大限度地保留原始网络结构信息的方法。然而,当为实时应用生成和处理大量图形数据时,有效的图嵌入尤其具有挑战性。在本文中,我们解决了这一挑战,并提出了一种新的实时分布式图嵌入算法(RTDGE),该算法能够以流方式分布式嵌入大规模图。具体来说,我们的RTDGE由以下部分组成:(1)将所有边划分为不同的子图的图划分方案,其中顶点与边相关联,并且可能属于多个子图;(2)实时更新嵌入向量的动态负采样(DNS)方法;(3)将所有局部嵌入向量组合到一个全局向量空间的无监督全局聚合方案。此外,我们还构建了一个基于Apache Kafka和Apache Storm的实时分布式图嵌入平台。大量的实验结果表明,RTDGE在图嵌入效率和准确性方面优于现有的解决方案。
{"title":"Real-Time Graph Partition and Embedding of Large Network","authors":"Wenqi Liu, Hongxiang Li, Bin Xie","doi":"10.1109/CCGRID.2018.00070","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00070","url":null,"abstract":"Recently, large-scale networks attract significant attention to analyze and extract the hidden information of big data. Toward this end, graph embedding is a method to embed a high dimensional graph into a much lower dimensional vector space while maximally preserving the structural information of the original network. However, effective graph embedding is particularly challenging when massive graph data are generated and processed for real-time applications. In this paper, we address this challenge and propose a new real-time and distributed graph embedding algorithm (RTDGE) that is capable of distributively embedding a large-scale graph in a streaming fashion. Specifically, our RTDGE consists of the following components: (1) a graph partition scheme that divides all edges into distinct subgraphs, where vertices are associated with edges and may belong to several subgraphs; (2) a dynamic negative sampling (DNS) method that updates the embedded vectors in real-time; and (3) an unsupervised global aggregation scheme that combines all locally embedded vectors into a global vector space. Furthermore, we also build a real-time distributed graph embedding platform based on Apache Kafka and Apache Storm. Extensive experimental results show that RTDGE outperforms existing solutions in terms of graph embedding efficiency and accuracy.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125326117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing Data Transfers for Improved Performance on Shared GPUs Using Reinforcement Learning 使用强化学习优化共享gpu上的数据传输以提高性能
R. Luley, Qinru Qiu
Optimizing resource utilization is a critical issue in cloud and cluster-based computing systems. In such systems, computing resources often consist of one or more GPU devices, and much research has already been conducted on means for maximizing compute resources through shared execution strategies. However, one of the most severe resource constraints in these scenarios is the data transfer channel between the host (i.e., CPU) and the device (i.e., GPU). Data transfer contention has been shown to have a significant impact on performance, yet methods for optimizing such contention have not been thoroughly studied. Techniques that have been examined make certain assumptions which limit effectiveness in the general case. In this paper, we introduce a heuristic which selectively aggregates transfers in order to maximize system performance by optimizing the transfer channel bandwidth. We compare this heuristic to traditional first-come-first-served approach, and apply Monte Carlo reinforcement learning to find an optimal policy for message aggregation. Finally, we evaluate the performance of Monte Carlo reinforcement learning with an arbitrarily-initialized policy. We demonstrate its effectiveness in learning optimal data transfer policy without detailed system characterization, which will enable a general adaptable solution for resource management of future systems.
在基于云和集群的计算系统中,优化资源利用是一个关键问题。在这样的系统中,计算资源通常由一个或多个GPU设备组成,并且已经对通过共享执行策略最大化计算资源的方法进行了大量研究。然而,在这些场景中最严重的资源限制之一是主机(即CPU)和设备(即GPU)之间的数据传输通道。数据传输争用已被证明对性能有重大影响,但优化这种争用的方法尚未得到深入研究。所研究的技术都有一定的假设,这些假设限制了一般情况下的有效性。本文引入了一种启发式算法,通过优化传输信道带宽,选择性地聚合传输,从而使系统性能最大化。我们将这种启发式方法与传统的先到先得方法进行比较,并应用蒙特卡罗强化学习来寻找消息聚合的最佳策略。最后,我们评估了随机初始化策略下蒙特卡罗强化学习的性能。我们证明了它在学习最佳数据传输策略方面的有效性,而不需要详细的系统特征,这将为未来系统的资源管理提供一个通用的适应性解决方案。
{"title":"Optimizing Data Transfers for Improved Performance on Shared GPUs Using Reinforcement Learning","authors":"R. Luley, Qinru Qiu","doi":"10.1109/CCGRID.2018.00061","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00061","url":null,"abstract":"Optimizing resource utilization is a critical issue in cloud and cluster-based computing systems. In such systems, computing resources often consist of one or more GPU devices, and much research has already been conducted on means for maximizing compute resources through shared execution strategies. However, one of the most severe resource constraints in these scenarios is the data transfer channel between the host (i.e., CPU) and the device (i.e., GPU). Data transfer contention has been shown to have a significant impact on performance, yet methods for optimizing such contention have not been thoroughly studied. Techniques that have been examined make certain assumptions which limit effectiveness in the general case. In this paper, we introduce a heuristic which selectively aggregates transfers in order to maximize system performance by optimizing the transfer channel bandwidth. We compare this heuristic to traditional first-come-first-served approach, and apply Monte Carlo reinforcement learning to find an optimal policy for message aggregation. Finally, we evaluate the performance of Monte Carlo reinforcement learning with an arbitrarily-initialized policy. We demonstrate its effectiveness in learning optimal data transfer policy without detailed system characterization, which will enable a general adaptable solution for resource management of future systems.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116722027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Data Analysis of a Google Data Center 谷歌数据中心数据分析
P. Minet, É. Renault, I. Khoufi, S. Boumerdassi
Data collected from an operational Google data center during 29 days represent a very rich and very useful source of information for understanding the main features of a data center. In this paper, we highlight the strong heterogeneity of jobs. The distribution of job execution duration shows a high disparity, as well as the job waiting time before being scheduled. The resource requests in terms of CPU and memory are also analyzed. The knowledge of all these features is needed to design models of jobs, machines and resource requests that are representative of a real data center.
从运行中的谷歌数据中心在29天内收集的数据为了解数据中心的主要特性提供了非常丰富和非常有用的信息源。在本文中,我们强调了工作的强烈异质性。作业执行时间分布差异较大,作业被调度前的等待时间分布差异较大。从CPU和内存两个方面分析了资源请求。要设计代表真实数据中心的作业、机器和资源请求模型,就需要了解所有这些特性。
{"title":"Data Analysis of a Google Data Center","authors":"P. Minet, É. Renault, I. Khoufi, S. Boumerdassi","doi":"10.1109/CCGRID.2018.00049","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00049","url":null,"abstract":"Data collected from an operational Google data center during 29 days represent a very rich and very useful source of information for understanding the main features of a data center. In this paper, we highlight the strong heterogeneity of jobs. The distribution of job execution duration shows a high disparity, as well as the job waiting time before being scheduled. The resource requests in terms of CPU and memory are also analyzed. The knowledge of all these features is needed to design models of jobs, machines and resource requests that are representative of a real data center.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"14 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116822815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Parallel Low Discrepancy Parameter Sweep for Public Health Policy 公共卫生策略的并行低差异参数扫描
Sudheer Chunduri, Meysam Ghaffari, M. S. Lahijani, A. Srinivasan, S. Namilae
Numerical simulations are used to analyze the effectiveness of alternate public policy choices in limiting the spread of infections. In practice, it is usually not feasible to predict their precise impacts due to inherent uncertainties, especially at the early stages of an epidemic. One option is to parameterize the sources of uncertainty and carry out a parameter sweep to identify their robustness under a variety of possible scenarios. The Self Propelled Entity Dynamics (SPED) model has used this approach successfully to analyze the robustness of different airline boarding and deplaning procedures. However, the time taken by this approach is too large to answer questions raised during the course of a decision meeting. In this paper, we use a modified approach that pre-computes simulations of passenger movement, performing only the disease-specific analysis in real time. A novel contribution of this paper lies in using a low discrepancy sequence (LDS) in the parameter sweep, and demonstrating that it can lead to a reduction in analysis time by one to three orders of magnitude over the conventional lattice-based parameter sweep. However, its parallelization suffers from greater load imbalance than the conventional approach. We examine this and relate it to number-theoretic properties of the LDS. We then propose solutions to this problem. Our approach and analysis are applicable to other parameter sweep problems too. The primary contributions of this paper lie in the new approach of low discrepancy parameter sweep and in exploring solutions to challenges in its parallelization, evaluated in the context of an important public health application.
数值模拟用于分析不同公共政策选择在限制传染病传播方面的有效性。在实践中,由于固有的不确定性,特别是在流行病的早期阶段,通常无法准确预测其影响。一种选择是参数化不确定性的来源,并进行参数扫描,以确定其在各种可能情况下的鲁棒性。自推进实体动力学(Self - Propelled Entity Dynamics, SPED)模型利用该方法成功地分析了不同航线登机和下机过程的鲁棒性。然而,这种方法所花费的时间太长,无法回答决策会议过程中提出的问题。在本文中,我们使用了一种改进的方法,即预先计算乘客运动的模拟,仅实时执行特定疾病的分析。本文的一个新颖贡献在于在参数扫描中使用低差异序列(LDS),并证明它可以使分析时间比传统的基于格的参数扫描减少一到三个数量级。然而,与传统方法相比,它的并行性受到更大的负载不平衡的影响。我们对此进行了研究,并将其与LDS的数论性质联系起来。然后我们提出解决这个问题的方法。我们的方法和分析也适用于其他参数扫描问题。本文的主要贡献在于低差异参数扫描的新方法,并探索其并行化挑战的解决方案,在一个重要的公共卫生应用的背景下进行评估。
{"title":"Parallel Low Discrepancy Parameter Sweep for Public Health Policy","authors":"Sudheer Chunduri, Meysam Ghaffari, M. S. Lahijani, A. Srinivasan, S. Namilae","doi":"10.1109/CCGRID.2018.00044","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00044","url":null,"abstract":"Numerical simulations are used to analyze the effectiveness of alternate public policy choices in limiting the spread of infections. In practice, it is usually not feasible to predict their precise impacts due to inherent uncertainties, especially at the early stages of an epidemic. One option is to parameterize the sources of uncertainty and carry out a parameter sweep to identify their robustness under a variety of possible scenarios. The Self Propelled Entity Dynamics (SPED) model has used this approach successfully to analyze the robustness of different airline boarding and deplaning procedures. However, the time taken by this approach is too large to answer questions raised during the course of a decision meeting. In this paper, we use a modified approach that pre-computes simulations of passenger movement, performing only the disease-specific analysis in real time. A novel contribution of this paper lies in using a low discrepancy sequence (LDS) in the parameter sweep, and demonstrating that it can lead to a reduction in analysis time by one to three orders of magnitude over the conventional lattice-based parameter sweep. However, its parallelization suffers from greater load imbalance than the conventional approach. We examine this and relate it to number-theoretic properties of the LDS. We then propose solutions to this problem. Our approach and analysis are applicable to other parameter sweep problems too. The primary contributions of this paper lie in the new approach of low discrepancy parameter sweep and in exploring solutions to challenges in its parallelization, evaluated in the context of an important public health application.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133561074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1