Proceedings of the 48th International Conference on Parallel Processing最新文献

英文中文

AdaM 亚当

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337822

Shiyi Cao, Yuanning Gao, Xiaofeng Gao, Guihai Chen

Distributed metadata management, administrating the distribution of metadata nodes on different metadata servers (MDS's), can substantially improve overall performance of large-scale distributed storage systems if well designed. A major difficulty confronting many metadata management schemes is the trade-off between two conflicting aspects: system load balance and metadata locality preservation. It becomes even more challenging as file access pattern inevitably varies with time. However, existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make timely and efficient update on distribution of nodes. In this paper, we propose an adaptive fine-grained metadata management scheme, AdaM, leveraging Deep Reinforcement Learning, to address the trade-off dilemma against time-varying access pattern. At each time step, AdaM collects environmental "states" including access pattern, the structure of namespace tree and current distribution of nodes on MDS's. Then an actor-critic network is trained to reallocate hot metadata nodes to different servers according to the observed "states". Adaptive to varying access pattern, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. We test AdaM on real-world data traces. Experimental results demonstrate the superiority of our proposed method over other schemes.

{"title":"AdaM","authors":"Shiyi Cao, Yuanning Gao, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337822","DOIUrl":"https://doi.org/10.1145/3337821.3337822","url":null,"abstract":"Distributed metadata management, administrating the distribution of metadata nodes on different metadata servers (MDS's), can substantially improve overall performance of large-scale distributed storage systems if well designed. A major difficulty confronting many metadata management schemes is the trade-off between two conflicting aspects: system load balance and metadata locality preservation. It becomes even more challenging as file access pattern inevitably varies with time. However, existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make timely and efficient update on distribution of nodes. In this paper, we propose an adaptive fine-grained metadata management scheme, AdaM, leveraging Deep Reinforcement Learning, to address the trade-off dilemma against time-varying access pattern. At each time step, AdaM collects environmental \"states\" including access pattern, the structure of namespace tree and current distribution of nodes on MDS's. Then an actor-critic network is trained to reallocate hot metadata nodes to different servers according to the observed \"states\". Adaptive to varying access pattern, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. We test AdaM on real-world data traces. Experimental results demonstrate the superiority of our proposed method over other schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128190948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Cartesian Collective Communication 笛卡尔集体交流

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337848

J. Träff, S. Hunold

We introduce Cartesian Collective Communication as sparse, collective communication defined on processes (processors) organized into d-dimensional tori or meshes. Processes specify local neighborhoods, e.g., stencil patterns, by lists of relative Cartesian coordinate offsets. The Cartesian collective operations perform data exchanges (and reductions) over the set of all neighborhoods such that each process communicates with the processes in its local neighborhood. The key requirement is that local neighborhoods must be structurally identical (isomorphic). This makes it possible for processes to compute correct, deadlock-free, efficient communication schedules for the collective operations locally without any interaction with other processes. Cartesian Collective Communication substantially extends collective neighborhood communication on Cartesian communicators as defined by the MPI standard, and is a restricted form of neighborhood collective communication on general, distributed graph topologies. We show that the restriction to isomorphic neighborhoods permits communication improvements beyond what is possible for unrestricted graph topologies by presenting non-trivial message-combining algorithms that reduce communication latency for Cartesian alltoall and allgather collective operations. For both types of communication, the required communication schedules can be computed in linear time in the size of the input neighborhood. Our benchmarks show that we can, for small data block sizes, substantially outperform the general MPI neighborhood collectives implementing the same communication pattern. We discuss different possibilities for supporting Cartesian Collective Communication in MPI. Our library is implemented on top of MPI and uses the same signatures for the collective communication operations as the MPI (neighborhood) collectives. Our implementation requires essentially only one single, new communicator creation function, but even this might not be needed for implementation in an MPI library.

我们引入笛卡尔集体通信作为稀疏的，集体通信定义在进程(处理器)组织成d维环面或网格。进程通过相对笛卡尔坐标偏移列表指定局部邻域，例如，模板模式。笛卡尔集体操作在所有邻域的集合上执行数据交换(和约简)，以便每个进程与其本地邻域中的进程通信。关键的要求是，当地社区必须在结构上相同(同构)。这使得进程可以在不与其他进程交互的情况下，为本地的集体操作计算正确、无死锁、高效的通信调度。笛卡尔集体通信实质上扩展了MPI标准定义的笛卡尔通信器上的集体邻域通信，是一般分布式图拓扑上邻域集体通信的一种受限形式。我们展示了对同构邻域的限制，通过提出非平凡的消息组合算法来减少笛卡尔alltoall和allgather集体操作的通信延迟，使得通信改进超越了无限制图拓扑的可能。对于这两种类型的通信，所需的通信调度可以在输入邻域大小的线性时间内计算出来。我们的基准测试表明，对于小数据块大小，我们可以大大优于实现相同通信模式的一般MPI邻域集合。我们讨论了在MPI中支持笛卡尔集体通信的不同可能性。我们的库是在MPI之上实现的，并且对集体通信操作使用与MPI(邻域)集体相同的签名。我们的实现基本上只需要一个新的通信器创建函数，但在MPI库中实现可能也不需要这个函数。

{"title":"Cartesian Collective Communication","authors":"J. Träff, S. Hunold","doi":"10.1145/3337821.3337848","DOIUrl":"https://doi.org/10.1145/3337821.3337848","url":null,"abstract":"We introduce Cartesian Collective Communication as sparse, collective communication defined on processes (processors) organized into d-dimensional tori or meshes. Processes specify local neighborhoods, e.g., stencil patterns, by lists of relative Cartesian coordinate offsets. The Cartesian collective operations perform data exchanges (and reductions) over the set of all neighborhoods such that each process communicates with the processes in its local neighborhood. The key requirement is that local neighborhoods must be structurally identical (isomorphic). This makes it possible for processes to compute correct, deadlock-free, efficient communication schedules for the collective operations locally without any interaction with other processes. Cartesian Collective Communication substantially extends collective neighborhood communication on Cartesian communicators as defined by the MPI standard, and is a restricted form of neighborhood collective communication on general, distributed graph topologies. We show that the restriction to isomorphic neighborhoods permits communication improvements beyond what is possible for unrestricted graph topologies by presenting non-trivial message-combining algorithms that reduce communication latency for Cartesian alltoall and allgather collective operations. For both types of communication, the required communication schedules can be computed in linear time in the size of the input neighborhood. Our benchmarks show that we can, for small data block sizes, substantially outperform the general MPI neighborhood collectives implementing the same communication pattern. We discuss different possibilities for supporting Cartesian Collective Communication in MPI. Our library is implemented on top of MPI and uses the same signatures for the collective communication operations as the MPI (neighborhood) collectives. Our implementation requires essentially only one single, new communicator creation function, but even this might not be needed for implementation in an MPI library.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114014761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Cosin

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337858

Jingya Zhou, Jianxi Fan, Jin Wang

Influence Maximization (IM) has been extensively applied to many fields, and the viral marketing in today's online social networks (OSNs) is one of the most famous applications, where a group of seed users are selected to activate more users in a distributed cascading fashion. Many prior work explore the IM problem based on the assumption of given budget. However, the budget assumption does not hold in many practical scenarios, since companies might have no sufficient prior knowledge about the market. Moreover, companies prefer a moderately controllable viral marketing that allows them to adjust marketing decision according to the market reaction. In this paper, we propose a new problem, called Controllable social influence maximization (Cosin), to find a set of seed users inside a controllable scope to maximize the benefit given an expected return on investment (ROI). Like the IM problem, the Cosin problem is also NP-hard. We present a distributed multi-hop based framework for the influence estimation, and design a (1/2 + ϵ)-approximate algorithm based on the proposed framework. Moreover, we further present a distributed implementation to accelerate the execution of algorithm for large-scale social networks. Extensive experiments with a billion-scale social network indicate that the proposed algorithms outperform state-of-the-art algorithms in both benefit and running time.

{"title":"Cosin","authors":"Jingya Zhou, Jianxi Fan, Jin Wang","doi":"10.1145/3337821.3337858","DOIUrl":"https://doi.org/10.1145/3337821.3337858","url":null,"abstract":"Influence Maximization (IM) has been extensively applied to many fields, and the viral marketing in today's online social networks (OSNs) is one of the most famous applications, where a group of seed users are selected to activate more users in a distributed cascading fashion. Many prior work explore the IM problem based on the assumption of given budget. However, the budget assumption does not hold in many practical scenarios, since companies might have no sufficient prior knowledge about the market. Moreover, companies prefer a moderately controllable viral marketing that allows them to adjust marketing decision according to the market reaction. In this paper, we propose a new problem, called Controllable social influence maximization (Cosin), to find a set of seed users inside a controllable scope to maximize the benefit given an expected return on investment (ROI). Like the IM problem, the Cosin problem is also NP-hard. We present a distributed multi-hop based framework for the influence estimation, and design a (1/2 + ϵ)-approximate algorithm based on the proposed framework. Moreover, we further present a distributed implementation to accelerate the execution of algorithm for large-scale social networks. Extensive experiments with a billion-scale social network indicate that the proposed algorithms outperform state-of-the-art algorithms in both benefit and running time.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Machine Learning for Fine-Grained Hardware Prefetcher Control 细粒度硬件预取器控制的机器学习

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337854

Jason Hiebel, Laura E. Brown, Zhenlin Wang

Modern architectures provide hardware memory prefetching capabilities which can be configured at runtime. While hardware prefetching can provide substantial performance improvements for many programs, prefetching can also increase contention for shared resources such as last-level cache and memory bandwidth. In turn, this contention can degrade performance in multi-core workloads. In this paper, we model fine-grained hardware prefetcher control as a contextual bandit, and propose a framework for learning prefetcher control policies which adjust hardware prefetching usage at runtime according to workload performance behavior. We train our policies on profiling data, wherein hardware memory prefetchers are enabled or disabled randomly at regular intervals over the course of a workload's execution. The learned prefetcher control policies provide up to a 4.3% average performance improvement over a set of memory bandwidth intensive workloads.

现代架构提供硬件内存预取功能，可以在运行时配置。虽然硬件预取可以为许多程序提供实质性的性能改进，但预取也会增加对共享资源(如最后一级缓存和内存带宽)的争用。反过来，这种争用会降低多核工作负载中的性能。在本文中，我们将细粒度的硬件预取控制建模为上下文强盗，并提出了一个框架来学习预取控制策略，该策略可以根据工作负载性能行为在运行时调整硬件预取的使用。我们在分析数据上训练我们的策略，其中硬件内存预取器在工作负载的执行过程中以定期的间隔随机启用或禁用。学习到的预取器控制策略在一组内存带宽密集型工作负载上提供了高达4.3%的平均性能提升。

引用次数: 7

Speculative Scheduling for Stochastic HPC Applications 随机高性能计算应用的推测调度

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337890

Ana Gainaru, Guillaume Pallez, Hongyang Sun, P. Raghavan

New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put a high emphasis on productivity and thus are not tuned to run efficiently on today's high performance computing (HPC) systems. Some of these applications, such as neuroscience workloads and those that use adaptive numerical algorithms, develop modeling and simulation workflows with stochastic execution times and unpredictable resource requirements. When they are deployed on current HPC systems using existing resource management solutions, it can result in loss of efficiency for the users and decrease in effective system utilization for the platform providers. In this paper, we consider the current HPC scheduling model and describe the challenge it poses for stochastic applications due to the strict requirement in its job deployment policies. To address the challenge, we present speculative scheduling techniques that adapt the resource requirements of a stochastic application on-the-fly, based on its past execution behavior instead of relying on estimates given by the user. We focus on improving the overall system utilization and application response time without disrupting the current HPC scheduling model or the application development process. Our solution can operate alongside existing HPC batch schedulers without interfering with their usage modes. We show that speculative scheduling can improve the system utilization and average application response time by 25-30% compared to the classical HPC approach.

新兴领域正在开发越来越多的具有异构、动态和数据密集型需求的大规模应用程序，这些应用程序高度强调生产力，因此无法在当今的高性能计算(HPC)系统上高效运行。其中一些应用，如神经科学工作负载和那些使用自适应数值算法的应用，开发具有随机执行时间和不可预测资源需求的建模和仿真工作流程。当它们使用现有的资源管理解决方案部署在当前的HPC系统上时，可能会导致用户效率的降低，并降低平台提供商的有效系统利用率。本文考虑了当前的高性能计算调度模型，并描述了由于其作业部署策略要求严格而给随机应用带来的挑战。为了应对这一挑战，我们提出了推测调度技术，该技术基于随机应用程序过去的执行行为，而不是依赖于用户给出的估计，实时适应其资源需求。我们专注于在不中断当前HPC调度模型或应用程序开发过程的情况下提高整体系统利用率和应用程序响应时间。我们的解决方案可以与现有的HPC批调度程序一起运行，而不会干扰它们的使用模式。我们表明，与传统的高性能计算方法相比，推测调度可以将系统利用率和平均应用程序响应时间提高25-30%。

{"title":"Speculative Scheduling for Stochastic HPC Applications","authors":"Ana Gainaru, Guillaume Pallez, Hongyang Sun, P. Raghavan","doi":"10.1145/3337821.3337890","DOIUrl":"https://doi.org/10.1145/3337821.3337890","url":null,"abstract":"New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put a high emphasis on productivity and thus are not tuned to run efficiently on today's high performance computing (HPC) systems. Some of these applications, such as neuroscience workloads and those that use adaptive numerical algorithms, develop modeling and simulation workflows with stochastic execution times and unpredictable resource requirements. When they are deployed on current HPC systems using existing resource management solutions, it can result in loss of efficiency for the users and decrease in effective system utilization for the platform providers. In this paper, we consider the current HPC scheduling model and describe the challenge it poses for stochastic applications due to the strict requirement in its job deployment policies. To address the challenge, we present speculative scheduling techniques that adapt the resource requirements of a stochastic application on-the-fly, based on its past execution behavior instead of relying on estimates given by the user. We focus on improving the overall system utilization and application response time without disrupting the current HPC scheduling model or the application development process. Our solution can operate alongside existing HPC batch schedulers without interfering with their usage modes. We show that speculative scheduling can improve the system utilization and average application response time by 25-30% compared to the classical HPC approach.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123811126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Faster parallel collision detection at high resolution for CNC milling applications 更快的并行碰撞检测在高分辨率的数控铣削应用

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337838

Xin Chen, Dmytro Konobrytskyi, Thomas M. Tucker, T. Kurfess, R. Vuduc

This paper presents a new and more work-efficient parallel method to speed up a class of three-dimensional collision detection (CD) problems, which arise, for instance, in computer numerical control (CNC) milling. Given two objects, one enclosed by a bounding volume and the other represented by a voxel model, we wish to determine all possible orientations of the bounded object around a given point that do not cause collisions. Underlying most CD methods are 3 types of geometrical operations that are bottlenecks: decompositions, rotations, and projections. Our proposed approach, which we call the aggressive inaccessible cone angle (AICA) method, simplifies these operations and, empirically, can prune as much as 99% of the intersection tests that would otherwise be required and improve load balance. We validate our techniques by implementing a parallel version of AICA in SculptPrint, a state-of-the-art computer-aided manufacturing (CAM) application used CNC milling, for GPU platforms. Experimental results using 4 CAM benchmarks show that AICA can be over 23× faster than a baseline method that does not prune projections, and can check collisions for 4096 angle orientations in an object represented by 27 million voxels in less than 18 milliseconds on a GPU.

本文提出了一种新的、工作效率更高的并行方法，以加速计算机数控铣削中出现的一类三维碰撞检测问题。给定两个物体，一个由边界体包围，另一个由体素模型表示，我们希望确定有界物体围绕给定点的所有可能的方向，这些方向不会导致碰撞。大多数CD方法的基础是3种几何操作，它们是瓶颈:分解、旋转和投影。我们提出的方法，我们称之为侵略性不可达锥角(AICA)方法，简化了这些操作，并且从经验上看，可以减少多达99%的相交测试，否则将需要这样做，并改善负载平衡。我们通过在SculptPrint中实现AICA的并行版本来验证我们的技术，SculptPrint是一种使用CNC铣削的最先进的计算机辅助制造(CAM)应用程序，用于GPU平台。使用4个CAM基准测试的实验结果表明，AICA比不修剪投影的基线方法快23倍以上，并且可以在GPU上不到18毫秒的时间内检查由2700万体素表示的对象的4096个角度方向的碰撞。

{"title":"Faster parallel collision detection at high resolution for CNC milling applications","authors":"Xin Chen, Dmytro Konobrytskyi, Thomas M. Tucker, T. Kurfess, R. Vuduc","doi":"10.1145/3337821.3337838","DOIUrl":"https://doi.org/10.1145/3337821.3337838","url":null,"abstract":"This paper presents a new and more work-efficient parallel method to speed up a class of three-dimensional collision detection (CD) problems, which arise, for instance, in computer numerical control (CNC) milling. Given two objects, one enclosed by a bounding volume and the other represented by a voxel model, we wish to determine all possible orientations of the bounded object around a given point that do not cause collisions. Underlying most CD methods are 3 types of geometrical operations that are bottlenecks: decompositions, rotations, and projections. Our proposed approach, which we call the aggressive inaccessible cone angle (AICA) method, simplifies these operations and, empirically, can prune as much as 99% of the intersection tests that would otherwise be required and improve load balance. We validate our techniques by implementing a parallel version of AICA in SculptPrint, a state-of-the-art computer-aided manufacturing (CAM) application used CNC milling, for GPU platforms. Experimental results using 4 CAM benchmarks show that AICA can be over 23× faster than a baseline method that does not prune projections, and can check collisions for 4096 angle orientations in an object represented by 27 million voxels in less than 18 milliseconds on a GPU.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114659781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DeepHash DeepHash

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337924

Yuanning Gao, Xiaofeng Gao, Guihai Chen

In distributed file systems, distributed metadata management can be considered as a mapping problem, i.e., how to effectively map the metadata namespace tree to multiple metadata servers (MDS's). In general, all traditional distributed metadata management schemes simply presume a rigid mapping function, thus failing to adaptively meet the requirements of different applications. To better take advantage of the current distribution of the metadata, in this exploratory paper, we present the first machine learning based model called DeepHash, which leverages the deep neural network to learn a locality preserving hashing (LPH) mapping. To help learn a good position relationship of metadata nodes in the namespace tree, we first present a metadata representation strategy. Due to the absence of training labels, i.e., the hash values of metadata nodes, we design two kinds of loss functions with distinctive characters to train DeepHash respectively, including a pair loss and a triplet loss, and introduce some sampling strategies for these two approaches. We conduct extensive experiments on Amazon EC2 platform to compare the performance of DeepHash with traditional and state-of-the-art schemes. The results demonstrate that DeepHash can preserve the metadata locality well while maintaining a high load balancing, which denotes the effectiveness and efficiency of DeepHash.

{"title":"DeepHash","authors":"Yuanning Gao, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337924","DOIUrl":"https://doi.org/10.1145/3337821.3337924","url":null,"abstract":"In distributed file systems, distributed metadata management can be considered as a mapping problem, i.e., how to effectively map the metadata namespace tree to multiple metadata servers (MDS's). In general, all traditional distributed metadata management schemes simply presume a rigid mapping function, thus failing to adaptively meet the requirements of different applications. To better take advantage of the current distribution of the metadata, in this exploratory paper, we present the first machine learning based model called DeepHash, which leverages the deep neural network to learn a locality preserving hashing (LPH) mapping. To help learn a good position relationship of metadata nodes in the namespace tree, we first present a metadata representation strategy. Due to the absence of training labels, i.e., the hash values of metadata nodes, we design two kinds of loss functions with distinctive characters to train DeepHash respectively, including a pair loss and a triplet loss, and introduce some sampling strategies for these two approaches. We conduct extensive experiments on Amazon EC2 platform to compare the performance of DeepHash with traditional and state-of-the-art schemes. The results demonstrate that DeepHash can preserve the metadata locality well while maintaining a high load balancing, which denotes the effectiveness and efficiency of DeepHash.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SAFE: Service Availability via Failure Elimination Through VNF Scaling SAFE:通过VNF扩展消除故障的服务可用性

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337832

Rui Xia, Haipeng Dai, Jiaqi Zheng, Rong Gu, Xiaoyu Wang, Guihai Chen

Virtualized network functions (VNFs) enable software applications to replace traditional middleboxes, which is more flexible and scalable in the network service provision. This paper focuses on ensuring Service Availability via Failure Elimination (SAFE) using VNF scaling, that is, given the resource requirements of VNF instances, finding an optimal and robust instance consolidation strategy, which can recover from one instance failure quickly. To address the above problem, we present a framework based on rounding and dynamic programming. First, we discretize the range of resource requirements for VNF instances deployment into several sub-ranges, so that the number of instance types becomes a constant. Second, we further reduce the number of instance types by gathering several small instances into a bigger one. Third, we propose an algorithm built on dynamic programming to solve the instance consolidation problem with a limited number of instance types. We set up a testbed to profile the functional relationship between resource and throughput for different types of VNF instances, and conduct simulations to validate our theoretical results according to profiling results. The simulation results show that our algorithm outperforms the standby deployment model by 27.33% on average in terms of the number of servers required. Furthermore, SAFE has marginal overhead, around 7.22%, compared to instance consolidation strategy without VNF backup consideration.

虚拟化网络功能(virtual network functions, VNFs)使软件应用程序能够取代传统的中间件，在网络业务提供中具有更大的灵活性和可扩展性。本文的重点是利用VNF扩展，通过故障消除(SAFE)来确保服务可用性，即在给定VNF实例资源需求的情况下，寻找一种最优且鲁棒的实例整合策略，该策略可以从一个实例故障中快速恢复。为了解决上述问题，我们提出了一个基于舍入和动态规划的框架。首先，我们将VNF实例部署的资源需求范围离散到几个子范围，以便实例类型的数量保持不变。其次，我们通过将几个小实例聚集成一个更大的实例来进一步减少实例类型的数量。第三，我们提出了一种基于动态规划的算法来解决实例类型有限的实例整合问题。我们建立了一个测试平台来分析不同类型VNF实例的资源和吞吐量之间的函数关系，并根据分析结果进行模拟来验证我们的理论结果。仿真结果表明，在服务器数量方面，我们的算法比备用部署模型平均高出27.33%。此外，与不考虑VNF备份的实例整合策略相比，SAFE的边际开销约为7.22%。

{"title":"SAFE: Service Availability via Failure Elimination Through VNF Scaling","authors":"Rui Xia, Haipeng Dai, Jiaqi Zheng, Rong Gu, Xiaoyu Wang, Guihai Chen","doi":"10.1145/3337821.3337832","DOIUrl":"https://doi.org/10.1145/3337821.3337832","url":null,"abstract":"Virtualized network functions (VNFs) enable software applications to replace traditional middleboxes, which is more flexible and scalable in the network service provision. This paper focuses on ensuring Service Availability via Failure Elimination (SAFE) using VNF scaling, that is, given the resource requirements of VNF instances, finding an optimal and robust instance consolidation strategy, which can recover from one instance failure quickly. To address the above problem, we present a framework based on rounding and dynamic programming. First, we discretize the range of resource requirements for VNF instances deployment into several sub-ranges, so that the number of instance types becomes a constant. Second, we further reduce the number of instance types by gathering several small instances into a bigger one. Third, we propose an algorithm built on dynamic programming to solve the instance consolidation problem with a limited number of instance types. We set up a testbed to profile the functional relationship between resource and throughput for different types of VNF instances, and conduct simulations to validate our theoretical results according to profiling results. The simulation results show that our algorithm outperforms the standby deployment model by 27.33% on average in terms of the number of servers required. Furthermore, SAFE has marginal overhead, around 7.22%, compared to instance consolidation strategy without VNF backup consideration.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"75 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116470400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication 一种提升ssd重复数据删除读性能的读均衡数据分发方案

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337884

Mengting Lu, F. Wang, D. Feng, Yuchong Hu

Deduplication, as a space-saving technology, is widely deployed in the flash-based storage systems to address the capacity and endurance limitations of flash devices. In this paper, we find that deduplication changes the physical data layout, which raises the chances of the uneven read distribution. This uneven read distribution not only increases the access contention but also deteriorates the read parallelism, thus leading to the read performance degradation. To solve this issue, we propose an efficient read-leveling data distribution scheme (RLDDS), which scatters the highly-duplicated data into different parallel units, to improve the read performance for SSDs with deduplication for access-intensive workloads. RLDDS writes data into a parallel unit with lower potential read-hotness to balance the read distribution among all the parallel units. Extensive experimental results show that RLDDS effectively improves the read performance by up to 21.61% compared to deduplication with the conventional dynamic data allocation scheme. Additional benefits of RLDDS include the promoted write performance (up to 23.69%) in access-intensive workloads and the overall system performance improvement (up to 18.22%) with the same write traffic reduction.

重复数据删除作为一种节省空间的技术，被广泛应用于基于闪存的存储系统中，以解决闪存设备容量和寿命的限制。在本文中，我们发现重复数据删除改变了物理数据布局，这增加了读取分布不均匀的可能性。这种不均匀的读分布不仅增加了访问争用，而且降低了读并行性，从而导致读性能下降。为了解决这个问题，我们提出了一种高效的读级数据分布方案(RLDDS)，该方案将高重复数据分散到不同的并行单元中，以提高具有重复数据删除功能的ssd在访问密集型工作负载下的读性能。RLDDS将数据写入潜在读热度较低的并行单元，以平衡所有并行单元之间的读分布。大量的实验结果表明，与传统的动态数据分配方案相比，RLDDS有效地提高了读取性能，最高可达21.61%。RLDDS的其他好处包括在访问密集型工作负载中提高写性能(最高23.69%)，在减少同样的写流量的情况下提高整体系统性能(最高18.22%)。

{"title":"A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication","authors":"Mengting Lu, F. Wang, D. Feng, Yuchong Hu","doi":"10.1145/3337821.3337884","DOIUrl":"https://doi.org/10.1145/3337821.3337884","url":null,"abstract":"Deduplication, as a space-saving technology, is widely deployed in the flash-based storage systems to address the capacity and endurance limitations of flash devices. In this paper, we find that deduplication changes the physical data layout, which raises the chances of the uneven read distribution. This uneven read distribution not only increases the access contention but also deteriorates the read parallelism, thus leading to the read performance degradation. To solve this issue, we propose an efficient read-leveling data distribution scheme (RLDDS), which scatters the highly-duplicated data into different parallel units, to improve the read performance for SSDs with deduplication for access-intensive workloads. RLDDS writes data into a parallel unit with lower potential read-hotness to balance the read distribution among all the parallel units. Extensive experimental results show that RLDDS effectively improves the read performance by up to 21.61% compared to deduplication with the conventional dynamic data allocation scheme. Additional benefits of RLDDS include the promoted write performance (up to 23.69%) in access-intensive workloads and the overall system performance improvement (up to 18.22%) with the same write traffic reduction.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115916533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

FuncyTuner FuncyTuner

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337842

Tao Wang, Nikhil Jain, D. Beckingsale, David Boehme, F. Mueller, T. Gamblin

The de facto compilation model for production software compiles all modules of a target program with a single set of compilation flags, typically 02 or 03. Such a per-program compilation strategy may yield sub-optimal executables since programs often have multiple hot loops with diverse code structures and may be better optimized with a per-region compilation model that assembles an optimized executable by combining the best per-region code variants. In this paper, we demonstrate that a naïve greedy approach to per-region compilation often degrades performance in comparison to the 03 baseline. To overcome this problem, we contribute a novel per-loop compilation framework, FuncyTuner, which employs lightweight profiling to collect per-loop timing information, and then utilizes a space-focusing technique to construct a performant executable. Experimental results show that FuncyTuner can reliably improve performance of modern scientific applications on several multi-core architectures by 9.2% to 12.3% and 4.5% to 10.7%(geometric mean, up to 22% on certain program) in comparison to the 03 baseline and prior work, respectively.

引用次数: 5

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 48th International Conference on Parallel Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀