首页 > 最新文献

Proceedings of the 48th International Conference on Parallel Processing最新文献

英文 中文
Modeling the Performance of Atomic Primitives on Modern Architectures 现代体系结构中原子原语的性能建模
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337901
F. Hoseini, A. Atalar, P. Tsigas
Utilizing the atomic primitives of a processor to access a memory location atomically is key to the correctness and feasibility of parallel software systems. The performance of atomics plays a significant role in the scalability and overall performance of parallel software systems. In this work, we study the performance -in terms of latency, throughput, fairness, energy consumption- of atomic primitives in the context of the two common software execution settings that result in high and low contention access on shared memory. We perform and present an exhaustive study of the performance of atomics in these two application contexts and propose a performance model that captures their behavior. We consider two state-of-the-art architectures: Intel Xeon E5, Xeon Phi (KNL). We propose a model that is centered around the bouncing of cache lines between threads that execute atomic primitives on these shared cache lines. The model is very simple to be used in practice and captures the behavior of atomics accurately under these execution scenarios and facilitate algorithmic design decisions in multi-threaded programming.
利用处理器的原子原语以原子方式访问内存位置是并行软件系统正确性和可行性的关键。原子性能在并行软件系统的可伸缩性和整体性能中起着重要的作用。在这项工作中,我们从延迟、吞吐量、公平性、能耗等方面研究了原子原语在两种常见软件执行设置上下文中的性能,这两种设置会导致共享内存上的高争用访问和低争用访问。我们对这两个应用程序上下文中的原子性能进行了详尽的研究,并提出了一个捕获其行为的性能模型。我们考虑两种最先进的架构:Intel Xeon E5和Xeon Phi (KNL)。我们提出了一个模型,该模型以在这些共享缓存线上执行原子原语的线程之间的缓存线反弹为中心。该模型在实践中使用非常简单,可以准确地捕获这些执行场景下原子的行为,并有助于多线程编程中的算法设计决策。
{"title":"Modeling the Performance of Atomic Primitives on Modern Architectures","authors":"F. Hoseini, A. Atalar, P. Tsigas","doi":"10.1145/3337821.3337901","DOIUrl":"https://doi.org/10.1145/3337821.3337901","url":null,"abstract":"Utilizing the atomic primitives of a processor to access a memory location atomically is key to the correctness and feasibility of parallel software systems. The performance of atomics plays a significant role in the scalability and overall performance of parallel software systems. In this work, we study the performance -in terms of latency, throughput, fairness, energy consumption- of atomic primitives in the context of the two common software execution settings that result in high and low contention access on shared memory. We perform and present an exhaustive study of the performance of atomics in these two application contexts and propose a performance model that captures their behavior. We consider two state-of-the-art architectures: Intel Xeon E5, Xeon Phi (KNL). We propose a model that is centered around the bouncing of cache lines between threads that execute atomic primitives on these shared cache lines. The model is very simple to be used in practice and captures the behavior of atomics accurately under these execution scenarios and facilitate algorithmic design decisions in multi-threaded programming.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127282688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Adaptive Routing Reconfigurations to Minimize Flow Cost in SDN-Based Data Center Networks 基于sdn的数据中心网络中自适应路由重构以最小化流量成本
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337861
Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen
Data center networks have become heavily reliant on software-defined network to orchestrate data transmission. To maintain optimal network configurations, a controller needs to solve the multi-commodity flow problem and globally update the network under tight time constraints. In this paper, we aim to minimize flow cost or intuitively average transmission delay, under reconfiguration budget constraints in data centers. Thus, we formulate this optimization problem as a constrained Markov Decision Process and propose a set of algorithms to solve it in a scalable manner. We first develop a propagation algorithm to identify the flows which are mostly affected in terms of latency and will be configured in the next network update. Then, we set a limitation range for updating them to improve adaptability and scalability by updating a less number of flows each time to achieve fast operations as well. Further, based on the Drift-Plus-Penalty method in Lyapunov theory, we propose a heuristic policy without prior information of flow demand with a performance guarantee to minimize the additive optimality gap. To the best of our knowledge, this is the first paper that studies the range and frequency of flow reconfigurations, which has both theoretical and practical significance in the related area. Extensive emulations and numerical simulations, which are much better than the estimated theoretical bound, show that our proposed policy outperform the state of the art algorithms in terms of latency by over 45% while making improvements in adaptability and scalability.
数据中心网络在很大程度上依赖于软件定义的网络来协调数据传输。为了保持最优的网络配置,控制器需要解决多商品流问题,并在严格的时间约束下对网络进行全局更新。在本文中,我们的目标是在数据中心重构预算约束下最小化流成本或直观的平均传输延迟。因此,我们将这个优化问题表述为一个约束马尔可夫决策过程,并提出了一套可扩展的算法来解决它。我们首先开发了一种传播算法来识别受延迟影响最大的流,并将在下一次网络更新中进行配置。然后,我们设置了更新流的限制范围,通过每次更新较少的流来提高适应性和可扩展性,从而实现快速的操作。进一步,基于Lyapunov理论中的漂移加惩罚方法,提出了一种不含流量需求先验信息的启发式策略,并保证了性能,使加性最优性缺口最小化。据我们所知,这是第一篇研究流动重构范围和频率的论文,在相关领域具有理论和实际意义。大量的仿真和数值模拟结果表明,我们提出的策略在延迟方面优于现有算法的45%以上,同时在适应性和可扩展性方面有所改进。
{"title":"Adaptive Routing Reconfigurations to Minimize Flow Cost in SDN-Based Data Center Networks","authors":"Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen","doi":"10.1145/3337821.3337861","DOIUrl":"https://doi.org/10.1145/3337821.3337861","url":null,"abstract":"Data center networks have become heavily reliant on software-defined network to orchestrate data transmission. To maintain optimal network configurations, a controller needs to solve the multi-commodity flow problem and globally update the network under tight time constraints. In this paper, we aim to minimize flow cost or intuitively average transmission delay, under reconfiguration budget constraints in data centers. Thus, we formulate this optimization problem as a constrained Markov Decision Process and propose a set of algorithms to solve it in a scalable manner. We first develop a propagation algorithm to identify the flows which are mostly affected in terms of latency and will be configured in the next network update. Then, we set a limitation range for updating them to improve adaptability and scalability by updating a less number of flows each time to achieve fast operations as well. Further, based on the Drift-Plus-Penalty method in Lyapunov theory, we propose a heuristic policy without prior information of flow demand with a performance guarantee to minimize the additive optimality gap. To the best of our knowledge, this is the first paper that studies the range and frequency of flow reconfigurations, which has both theoretical and practical significance in the related area. Extensive emulations and numerical simulations, which are much better than the estimated theoretical bound, show that our proposed policy outperform the state of the art algorithms in terms of latency by over 45% while making improvements in adaptability and scalability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127431585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era 在微服务时代释放功率受限数据中心的可扩展性潜力
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337857
Xiaofeng Hou, Jiacheng Liu, Chao Li, M. Guo
Recent scale-out cloud services have undergone a shift from monolithic applications to microservices by putting each functionality into lightweight software containers. Although traditional data center power optimization frameworks excel at per-server or per-rack management, they can hardly make informed decisions when facing microservices that have different QoS requirements on a per-service basis. In a power-constrained data center, blindly budgeting power usage could lead to a power unbalance issue: microservices on the critical path may not receive adequate power budget. This unavoidably hinders the growth of cloud productivity. To unleash the performance potential of cloud in the microservice era, this paper investigates microservice-aware data center resource management. We model microservice using a bipartite graph and propose a metric called microservice criticality factor (MCF) to measure the overall impact of performance scaling on a microservice from the whole application's perspective. We further devise ServiceFridge, a novel system framework that leverages MCF to jointly orchestrate software containers and control hardware power demand. Our detailed case study on a practical microservice application demonstrates that ServiceFridge allows data center to reduce its dynamic power by 25% with slight performance loss. It improves the mean response time by 25.2% and improves the 90th tail latency by 18.0% compared with existing schemes.
最近的横向扩展云服务通过将每个功能放入轻量级软件容器中,经历了从单片应用到微服务的转变。尽管传统的数据中心电源优化框架擅长于每台服务器或每机架的管理,但当面对在每个服务基础上具有不同QoS需求的微服务时,它们很难做出明智的决策。在电力受限的数据中心中,盲目地预算电力使用可能导致电力不平衡问题:关键路径上的微服务可能没有得到足够的电力预算。这不可避免地阻碍了云生产力的增长。为了在微服务时代释放云的性能潜力,本文研究了微服务感知数据中心资源管理。我们使用二部图对微服务进行建模,并提出了一个称为微服务临界系数(MCF)的度量,从整个应用程序的角度衡量性能扩展对微服务的总体影响。我们进一步设计了ServiceFridge,这是一个利用MCF共同编排软件容器和控制硬件功率需求的新系统框架。我们对一个实际微服务应用程序的详细案例研究表明,ServiceFridge允许数据中心在轻微性能损失的情况下将其动态功率降低25%。与现有方案相比,平均响应时间提高了25.2%,第90尾延迟提高了18.0%。
{"title":"Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era","authors":"Xiaofeng Hou, Jiacheng Liu, Chao Li, M. Guo","doi":"10.1145/3337821.3337857","DOIUrl":"https://doi.org/10.1145/3337821.3337857","url":null,"abstract":"Recent scale-out cloud services have undergone a shift from monolithic applications to microservices by putting each functionality into lightweight software containers. Although traditional data center power optimization frameworks excel at per-server or per-rack management, they can hardly make informed decisions when facing microservices that have different QoS requirements on a per-service basis. In a power-constrained data center, blindly budgeting power usage could lead to a power unbalance issue: microservices on the critical path may not receive adequate power budget. This unavoidably hinders the growth of cloud productivity. To unleash the performance potential of cloud in the microservice era, this paper investigates microservice-aware data center resource management. We model microservice using a bipartite graph and propose a metric called microservice criticality factor (MCF) to measure the overall impact of performance scaling on a microservice from the whole application's perspective. We further devise ServiceFridge, a novel system framework that leverages MCF to jointly orchestrate software containers and control hardware power demand. Our detailed case study on a practical microservice application demonstrates that ServiceFridge allows data center to reduce its dynamic power by 25% with slight performance loss. It improves the mean response time by 25.2% and improves the 90th tail latency by 18.0% compared with existing schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122689414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Cartesian Collective Communication 笛卡尔集体交流
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337848
J. Träff, S. Hunold
We introduce Cartesian Collective Communication as sparse, collective communication defined on processes (processors) organized into d-dimensional tori or meshes. Processes specify local neighborhoods, e.g., stencil patterns, by lists of relative Cartesian coordinate offsets. The Cartesian collective operations perform data exchanges (and reductions) over the set of all neighborhoods such that each process communicates with the processes in its local neighborhood. The key requirement is that local neighborhoods must be structurally identical (isomorphic). This makes it possible for processes to compute correct, deadlock-free, efficient communication schedules for the collective operations locally without any interaction with other processes. Cartesian Collective Communication substantially extends collective neighborhood communication on Cartesian communicators as defined by the MPI standard, and is a restricted form of neighborhood collective communication on general, distributed graph topologies. We show that the restriction to isomorphic neighborhoods permits communication improvements beyond what is possible for unrestricted graph topologies by presenting non-trivial message-combining algorithms that reduce communication latency for Cartesian alltoall and allgather collective operations. For both types of communication, the required communication schedules can be computed in linear time in the size of the input neighborhood. Our benchmarks show that we can, for small data block sizes, substantially outperform the general MPI neighborhood collectives implementing the same communication pattern. We discuss different possibilities for supporting Cartesian Collective Communication in MPI. Our library is implemented on top of MPI and uses the same signatures for the collective communication operations as the MPI (neighborhood) collectives. Our implementation requires essentially only one single, new communicator creation function, but even this might not be needed for implementation in an MPI library.
我们引入笛卡尔集体通信作为稀疏的,集体通信定义在进程(处理器)组织成d维环面或网格。进程通过相对笛卡尔坐标偏移列表指定局部邻域,例如,模板模式。笛卡尔集体操作在所有邻域的集合上执行数据交换(和约简),以便每个进程与其本地邻域中的进程通信。关键的要求是,当地社区必须在结构上相同(同构)。这使得进程可以在不与其他进程交互的情况下,为本地的集体操作计算正确、无死锁、高效的通信调度。笛卡尔集体通信实质上扩展了MPI标准定义的笛卡尔通信器上的集体邻域通信,是一般分布式图拓扑上邻域集体通信的一种受限形式。我们展示了对同构邻域的限制,通过提出非平凡的消息组合算法来减少笛卡尔alltoall和allgather集体操作的通信延迟,使得通信改进超越了无限制图拓扑的可能。对于这两种类型的通信,所需的通信调度可以在输入邻域大小的线性时间内计算出来。我们的基准测试表明,对于小数据块大小,我们可以大大优于实现相同通信模式的一般MPI邻域集合。我们讨论了在MPI中支持笛卡尔集体通信的不同可能性。我们的库是在MPI之上实现的,并且对集体通信操作使用与MPI(邻域)集体相同的签名。我们的实现基本上只需要一个新的通信器创建函数,但在MPI库中实现可能也不需要这个函数。
{"title":"Cartesian Collective Communication","authors":"J. Träff, S. Hunold","doi":"10.1145/3337821.3337848","DOIUrl":"https://doi.org/10.1145/3337821.3337848","url":null,"abstract":"We introduce Cartesian Collective Communication as sparse, collective communication defined on processes (processors) organized into d-dimensional tori or meshes. Processes specify local neighborhoods, e.g., stencil patterns, by lists of relative Cartesian coordinate offsets. The Cartesian collective operations perform data exchanges (and reductions) over the set of all neighborhoods such that each process communicates with the processes in its local neighborhood. The key requirement is that local neighborhoods must be structurally identical (isomorphic). This makes it possible for processes to compute correct, deadlock-free, efficient communication schedules for the collective operations locally without any interaction with other processes. Cartesian Collective Communication substantially extends collective neighborhood communication on Cartesian communicators as defined by the MPI standard, and is a restricted form of neighborhood collective communication on general, distributed graph topologies. We show that the restriction to isomorphic neighborhoods permits communication improvements beyond what is possible for unrestricted graph topologies by presenting non-trivial message-combining algorithms that reduce communication latency for Cartesian alltoall and allgather collective operations. For both types of communication, the required communication schedules can be computed in linear time in the size of the input neighborhood. Our benchmarks show that we can, for small data block sizes, substantially outperform the general MPI neighborhood collectives implementing the same communication pattern. We discuss different possibilities for supporting Cartesian Collective Communication in MPI. Our library is implemented on top of MPI and uses the same signatures for the collective communication operations as the MPI (neighborhood) collectives. Our implementation requires essentially only one single, new communicator creation function, but even this might not be needed for implementation in an MPI library.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114014761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Speculative Scheduling for Stochastic HPC Applications 随机高性能计算应用的推测调度
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337890
Ana Gainaru, Guillaume Pallez, Hongyang Sun, P. Raghavan
New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put a high emphasis on productivity and thus are not tuned to run efficiently on today's high performance computing (HPC) systems. Some of these applications, such as neuroscience workloads and those that use adaptive numerical algorithms, develop modeling and simulation workflows with stochastic execution times and unpredictable resource requirements. When they are deployed on current HPC systems using existing resource management solutions, it can result in loss of efficiency for the users and decrease in effective system utilization for the platform providers. In this paper, we consider the current HPC scheduling model and describe the challenge it poses for stochastic applications due to the strict requirement in its job deployment policies. To address the challenge, we present speculative scheduling techniques that adapt the resource requirements of a stochastic application on-the-fly, based on its past execution behavior instead of relying on estimates given by the user. We focus on improving the overall system utilization and application response time without disrupting the current HPC scheduling model or the application development process. Our solution can operate alongside existing HPC batch schedulers without interfering with their usage modes. We show that speculative scheduling can improve the system utilization and average application response time by 25-30% compared to the classical HPC approach.
新兴领域正在开发越来越多的具有异构、动态和数据密集型需求的大规模应用程序,这些应用程序高度强调生产力,因此无法在当今的高性能计算(HPC)系统上高效运行。其中一些应用,如神经科学工作负载和那些使用自适应数值算法的应用,开发具有随机执行时间和不可预测资源需求的建模和仿真工作流程。当它们使用现有的资源管理解决方案部署在当前的HPC系统上时,可能会导致用户效率的降低,并降低平台提供商的有效系统利用率。本文考虑了当前的高性能计算调度模型,并描述了由于其作业部署策略要求严格而给随机应用带来的挑战。为了应对这一挑战,我们提出了推测调度技术,该技术基于随机应用程序过去的执行行为,而不是依赖于用户给出的估计,实时适应其资源需求。我们专注于在不中断当前HPC调度模型或应用程序开发过程的情况下提高整体系统利用率和应用程序响应时间。我们的解决方案可以与现有的HPC批调度程序一起运行,而不会干扰它们的使用模式。我们表明,与传统的高性能计算方法相比,推测调度可以将系统利用率和平均应用程序响应时间提高25-30%。
{"title":"Speculative Scheduling for Stochastic HPC Applications","authors":"Ana Gainaru, Guillaume Pallez, Hongyang Sun, P. Raghavan","doi":"10.1145/3337821.3337890","DOIUrl":"https://doi.org/10.1145/3337821.3337890","url":null,"abstract":"New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put a high emphasis on productivity and thus are not tuned to run efficiently on today's high performance computing (HPC) systems. Some of these applications, such as neuroscience workloads and those that use adaptive numerical algorithms, develop modeling and simulation workflows with stochastic execution times and unpredictable resource requirements. When they are deployed on current HPC systems using existing resource management solutions, it can result in loss of efficiency for the users and decrease in effective system utilization for the platform providers. In this paper, we consider the current HPC scheduling model and describe the challenge it poses for stochastic applications due to the strict requirement in its job deployment policies. To address the challenge, we present speculative scheduling techniques that adapt the resource requirements of a stochastic application on-the-fly, based on its past execution behavior instead of relying on estimates given by the user. We focus on improving the overall system utilization and application response time without disrupting the current HPC scheduling model or the application development process. Our solution can operate alongside existing HPC batch schedulers without interfering with their usage modes. We show that speculative scheduling can improve the system utilization and average application response time by 25-30% compared to the classical HPC approach.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123811126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Faster parallel collision detection at high resolution for CNC milling applications 更快的并行碰撞检测在高分辨率的数控铣削应用
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337838
Xin Chen, Dmytro Konobrytskyi, Thomas M. Tucker, T. Kurfess, R. Vuduc
This paper presents a new and more work-efficient parallel method to speed up a class of three-dimensional collision detection (CD) problems, which arise, for instance, in computer numerical control (CNC) milling. Given two objects, one enclosed by a bounding volume and the other represented by a voxel model, we wish to determine all possible orientations of the bounded object around a given point that do not cause collisions. Underlying most CD methods are 3 types of geometrical operations that are bottlenecks: decompositions, rotations, and projections. Our proposed approach, which we call the aggressive inaccessible cone angle (AICA) method, simplifies these operations and, empirically, can prune as much as 99% of the intersection tests that would otherwise be required and improve load balance. We validate our techniques by implementing a parallel version of AICA in SculptPrint, a state-of-the-art computer-aided manufacturing (CAM) application used CNC milling, for GPU platforms. Experimental results using 4 CAM benchmarks show that AICA can be over 23× faster than a baseline method that does not prune projections, and can check collisions for 4096 angle orientations in an object represented by 27 million voxels in less than 18 milliseconds on a GPU.
本文提出了一种新的、工作效率更高的并行方法,以加速计算机数控铣削中出现的一类三维碰撞检测问题。给定两个物体,一个由边界体包围,另一个由体素模型表示,我们希望确定有界物体围绕给定点的所有可能的方向,这些方向不会导致碰撞。大多数CD方法的基础是3种几何操作,它们是瓶颈:分解、旋转和投影。我们提出的方法,我们称之为侵略性不可达锥角(AICA)方法,简化了这些操作,并且从经验上看,可以减少多达99%的相交测试,否则将需要这样做,并改善负载平衡。我们通过在SculptPrint中实现AICA的并行版本来验证我们的技术,SculptPrint是一种使用CNC铣削的最先进的计算机辅助制造(CAM)应用程序,用于GPU平台。使用4个CAM基准测试的实验结果表明,AICA比不修剪投影的基线方法快23倍以上,并且可以在GPU上不到18毫秒的时间内检查由2700万体素表示的对象的4096个角度方向的碰撞。
{"title":"Faster parallel collision detection at high resolution for CNC milling applications","authors":"Xin Chen, Dmytro Konobrytskyi, Thomas M. Tucker, T. Kurfess, R. Vuduc","doi":"10.1145/3337821.3337838","DOIUrl":"https://doi.org/10.1145/3337821.3337838","url":null,"abstract":"This paper presents a new and more work-efficient parallel method to speed up a class of three-dimensional collision detection (CD) problems, which arise, for instance, in computer numerical control (CNC) milling. Given two objects, one enclosed by a bounding volume and the other represented by a voxel model, we wish to determine all possible orientations of the bounded object around a given point that do not cause collisions. Underlying most CD methods are 3 types of geometrical operations that are bottlenecks: decompositions, rotations, and projections. Our proposed approach, which we call the aggressive inaccessible cone angle (AICA) method, simplifies these operations and, empirically, can prune as much as 99% of the intersection tests that would otherwise be required and improve load balance. We validate our techniques by implementing a parallel version of AICA in SculptPrint, a state-of-the-art computer-aided manufacturing (CAM) application used CNC milling, for GPU platforms. Experimental results using 4 CAM benchmarks show that AICA can be over 23× faster than a baseline method that does not prune projections, and can check collisions for 4096 angle orientations in an object represented by 27 million voxels in less than 18 milliseconds on a GPU.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114659781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DeepHash DeepHash
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337924
Yuanning Gao, Xiaofeng Gao, Guihai Chen
In distributed file systems, distributed metadata management can be considered as a mapping problem, i.e., how to effectively map the metadata namespace tree to multiple metadata servers (MDS's). In general, all traditional distributed metadata management schemes simply presume a rigid mapping function, thus failing to adaptively meet the requirements of different applications. To better take advantage of the current distribution of the metadata, in this exploratory paper, we present the first machine learning based model called DeepHash, which leverages the deep neural network to learn a locality preserving hashing (LPH) mapping. To help learn a good position relationship of metadata nodes in the namespace tree, we first present a metadata representation strategy. Due to the absence of training labels, i.e., the hash values of metadata nodes, we design two kinds of loss functions with distinctive characters to train DeepHash respectively, including a pair loss and a triplet loss, and introduce some sampling strategies for these two approaches. We conduct extensive experiments on Amazon EC2 platform to compare the performance of DeepHash with traditional and state-of-the-art schemes. The results demonstrate that DeepHash can preserve the metadata locality well while maintaining a high load balancing, which denotes the effectiveness and efficiency of DeepHash.
{"title":"DeepHash","authors":"Yuanning Gao, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337924","DOIUrl":"https://doi.org/10.1145/3337821.3337924","url":null,"abstract":"In distributed file systems, distributed metadata management can be considered as a mapping problem, i.e., how to effectively map the metadata namespace tree to multiple metadata servers (MDS's). In general, all traditional distributed metadata management schemes simply presume a rigid mapping function, thus failing to adaptively meet the requirements of different applications. To better take advantage of the current distribution of the metadata, in this exploratory paper, we present the first machine learning based model called DeepHash, which leverages the deep neural network to learn a locality preserving hashing (LPH) mapping. To help learn a good position relationship of metadata nodes in the namespace tree, we first present a metadata representation strategy. Due to the absence of training labels, i.e., the hash values of metadata nodes, we design two kinds of loss functions with distinctive characters to train DeepHash respectively, including a pair loss and a triplet loss, and introduce some sampling strategies for these two approaches. We conduct extensive experiments on Amazon EC2 platform to compare the performance of DeepHash with traditional and state-of-the-art schemes. The results demonstrate that DeepHash can preserve the metadata locality well while maintaining a high load balancing, which denotes the effectiveness and efficiency of DeepHash.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SAFE: Service Availability via Failure Elimination Through VNF Scaling SAFE:通过VNF扩展消除故障的服务可用性
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337832
Rui Xia, Haipeng Dai, Jiaqi Zheng, Rong Gu, Xiaoyu Wang, Guihai Chen
Virtualized network functions (VNFs) enable software applications to replace traditional middleboxes, which is more flexible and scalable in the network service provision. This paper focuses on ensuring Service Availability via Failure Elimination (SAFE) using VNF scaling, that is, given the resource requirements of VNF instances, finding an optimal and robust instance consolidation strategy, which can recover from one instance failure quickly. To address the above problem, we present a framework based on rounding and dynamic programming. First, we discretize the range of resource requirements for VNF instances deployment into several sub-ranges, so that the number of instance types becomes a constant. Second, we further reduce the number of instance types by gathering several small instances into a bigger one. Third, we propose an algorithm built on dynamic programming to solve the instance consolidation problem with a limited number of instance types. We set up a testbed to profile the functional relationship between resource and throughput for different types of VNF instances, and conduct simulations to validate our theoretical results according to profiling results. The simulation results show that our algorithm outperforms the standby deployment model by 27.33% on average in terms of the number of servers required. Furthermore, SAFE has marginal overhead, around 7.22%, compared to instance consolidation strategy without VNF backup consideration.
虚拟化网络功能(virtual network functions, VNFs)使软件应用程序能够取代传统的中间件,在网络业务提供中具有更大的灵活性和可扩展性。本文的重点是利用VNF扩展,通过故障消除(SAFE)来确保服务可用性,即在给定VNF实例资源需求的情况下,寻找一种最优且鲁棒的实例整合策略,该策略可以从一个实例故障中快速恢复。为了解决上述问题,我们提出了一个基于舍入和动态规划的框架。首先,我们将VNF实例部署的资源需求范围离散到几个子范围,以便实例类型的数量保持不变。其次,我们通过将几个小实例聚集成一个更大的实例来进一步减少实例类型的数量。第三,我们提出了一种基于动态规划的算法来解决实例类型有限的实例整合问题。我们建立了一个测试平台来分析不同类型VNF实例的资源和吞吐量之间的函数关系,并根据分析结果进行模拟来验证我们的理论结果。仿真结果表明,在服务器数量方面,我们的算法比备用部署模型平均高出27.33%。此外,与不考虑VNF备份的实例整合策略相比,SAFE的边际开销约为7.22%。
{"title":"SAFE: Service Availability via Failure Elimination Through VNF Scaling","authors":"Rui Xia, Haipeng Dai, Jiaqi Zheng, Rong Gu, Xiaoyu Wang, Guihai Chen","doi":"10.1145/3337821.3337832","DOIUrl":"https://doi.org/10.1145/3337821.3337832","url":null,"abstract":"Virtualized network functions (VNFs) enable software applications to replace traditional middleboxes, which is more flexible and scalable in the network service provision. This paper focuses on ensuring Service Availability via Failure Elimination (SAFE) using VNF scaling, that is, given the resource requirements of VNF instances, finding an optimal and robust instance consolidation strategy, which can recover from one instance failure quickly. To address the above problem, we present a framework based on rounding and dynamic programming. First, we discretize the range of resource requirements for VNF instances deployment into several sub-ranges, so that the number of instance types becomes a constant. Second, we further reduce the number of instance types by gathering several small instances into a bigger one. Third, we propose an algorithm built on dynamic programming to solve the instance consolidation problem with a limited number of instance types. We set up a testbed to profile the functional relationship between resource and throughput for different types of VNF instances, and conduct simulations to validate our theoretical results according to profiling results. The simulation results show that our algorithm outperforms the standby deployment model by 27.33% on average in terms of the number of servers required. Furthermore, SAFE has marginal overhead, around 7.22%, compared to instance consolidation strategy without VNF backup consideration.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"75 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116470400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication 一种提升ssd重复数据删除读性能的读均衡数据分发方案
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337884
Mengting Lu, F. Wang, D. Feng, Yuchong Hu
Deduplication, as a space-saving technology, is widely deployed in the flash-based storage systems to address the capacity and endurance limitations of flash devices. In this paper, we find that deduplication changes the physical data layout, which raises the chances of the uneven read distribution. This uneven read distribution not only increases the access contention but also deteriorates the read parallelism, thus leading to the read performance degradation. To solve this issue, we propose an efficient read-leveling data distribution scheme (RLDDS), which scatters the highly-duplicated data into different parallel units, to improve the read performance for SSDs with deduplication for access-intensive workloads. RLDDS writes data into a parallel unit with lower potential read-hotness to balance the read distribution among all the parallel units. Extensive experimental results show that RLDDS effectively improves the read performance by up to 21.61% compared to deduplication with the conventional dynamic data allocation scheme. Additional benefits of RLDDS include the promoted write performance (up to 23.69%) in access-intensive workloads and the overall system performance improvement (up to 18.22%) with the same write traffic reduction.
重复数据删除作为一种节省空间的技术,被广泛应用于基于闪存的存储系统中,以解决闪存设备容量和寿命的限制。在本文中,我们发现重复数据删除改变了物理数据布局,这增加了读取分布不均匀的可能性。这种不均匀的读分布不仅增加了访问争用,而且降低了读并行性,从而导致读性能下降。为了解决这个问题,我们提出了一种高效的读级数据分布方案(RLDDS),该方案将高重复数据分散到不同的并行单元中,以提高具有重复数据删除功能的ssd在访问密集型工作负载下的读性能。RLDDS将数据写入潜在读热度较低的并行单元,以平衡所有并行单元之间的读分布。大量的实验结果表明,与传统的动态数据分配方案相比,RLDDS有效地提高了读取性能,最高可达21.61%。RLDDS的其他好处包括在访问密集型工作负载中提高写性能(最高23.69%),在减少同样的写流量的情况下提高整体系统性能(最高18.22%)。
{"title":"A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication","authors":"Mengting Lu, F. Wang, D. Feng, Yuchong Hu","doi":"10.1145/3337821.3337884","DOIUrl":"https://doi.org/10.1145/3337821.3337884","url":null,"abstract":"Deduplication, as a space-saving technology, is widely deployed in the flash-based storage systems to address the capacity and endurance limitations of flash devices. In this paper, we find that deduplication changes the physical data layout, which raises the chances of the uneven read distribution. This uneven read distribution not only increases the access contention but also deteriorates the read parallelism, thus leading to the read performance degradation. To solve this issue, we propose an efficient read-leveling data distribution scheme (RLDDS), which scatters the highly-duplicated data into different parallel units, to improve the read performance for SSDs with deduplication for access-intensive workloads. RLDDS writes data into a parallel unit with lower potential read-hotness to balance the read distribution among all the parallel units. Extensive experimental results show that RLDDS effectively improves the read performance by up to 21.61% compared to deduplication with the conventional dynamic data allocation scheme. Additional benefits of RLDDS include the promoted write performance (up to 23.69%) in access-intensive workloads and the overall system performance improvement (up to 18.22%) with the same write traffic reduction.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115916533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
FuncyTuner FuncyTuner
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337842
Tao Wang, Nikhil Jain, D. Beckingsale, David Boehme, F. Mueller, T. Gamblin
The de facto compilation model for production software compiles all modules of a target program with a single set of compilation flags, typically 02 or 03. Such a per-program compilation strategy may yield sub-optimal executables since programs often have multiple hot loops with diverse code structures and may be better optimized with a per-region compilation model that assembles an optimized executable by combining the best per-region code variants. In this paper, we demonstrate that a naïve greedy approach to per-region compilation often degrades performance in comparison to the 03 baseline. To overcome this problem, we contribute a novel per-loop compilation framework, FuncyTuner, which employs lightweight profiling to collect per-loop timing information, and then utilizes a space-focusing technique to construct a performant executable. Experimental results show that FuncyTuner can reliably improve performance of modern scientific applications on several multi-core architectures by 9.2% to 12.3% and 4.5% to 10.7%(geometric mean, up to 22% on certain program) in comparison to the 03 baseline and prior work, respectively.
{"title":"FuncyTuner","authors":"Tao Wang, Nikhil Jain, D. Beckingsale, David Boehme, F. Mueller, T. Gamblin","doi":"10.1145/3337821.3337842","DOIUrl":"https://doi.org/10.1145/3337821.3337842","url":null,"abstract":"The de facto compilation model for production software compiles all modules of a target program with a single set of compilation flags, typically 02 or 03. Such a per-program compilation strategy may yield sub-optimal executables since programs often have multiple hot loops with diverse code structures and may be better optimized with a per-region compilation model that assembles an optimized executable by combining the best per-region code variants. In this paper, we demonstrate that a naïve greedy approach to per-region compilation often degrades performance in comparison to the 03 baseline. To overcome this problem, we contribute a novel per-loop compilation framework, FuncyTuner, which employs lightweight profiling to collect per-loop timing information, and then utilizes a space-focusing technique to construct a performant executable. Experimental results show that FuncyTuner can reliably improve performance of modern scientific applications on several multi-core architectures by 9.2% to 12.3% and 4.5% to 10.7%(geometric mean, up to 22% on certain program) in comparison to the 03 baseline and prior work, respectively.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123044325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the 48th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1