2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation最新文献

Partitioning on Dynamic Behavior for Parallel Discrete Event Simulation 并行离散事件仿真的动态行为划分

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.32

Ketan Bahulkar, Jingjing Wang, N. Abu-Ghazaleh, D. Ponomarev

Partitioning plays an important role in PDES performance due to the high communication cost in parallel platforms and the fine-granularity of most simulation models. Traditionally, models are partitioned by deriving the static communication graph of objects and applying graph partitioning to reduce the mincut while load balancing the number of objects. However, many, if not all, models exhibit great diversity in their dynamic behavior: objects communicate with each other with diverse frequencies that are commonly power-law distributed. Similar diversity exists in the activity of objects and the processing requirements of events. In this paper, we argue that partitioning based on static graphs ignores these effects, leading to poor partitioning. We explore how partitioning based on dynamic information should be approached and explore policies that focus on communication cost, load balancing and both. We show that on multicore clusters, dynamic partitioning achieves up to 4x better performance than static partitioning. On the AMD magnycours, where the communication latency is low, dynamic partitioning results in a 2x performance improvement over static partitioning for some of our models. Our future work considers how to derive the dynamic weights (in this study, we do that through profiling), and how to balance the importance of communication and computation in a way that is informed by the underlying architecture.

由于并行平台的高通信成本和大多数仿真模型的细粒度，分区在PDES性能中起着重要作用。传统的模型划分方法是推导对象的静态通信图，并应用图划分来减少最小分割，同时对对象的数量进行负载平衡。然而，许多(如果不是全部)模型在其动态行为中表现出极大的多样性:对象以不同的频率相互通信，这些频率通常是幂律分布的。对象的活动和事件的处理要求也存在着类似的多样性。在本文中，我们认为基于静态图的分区忽略了这些影响，导致分区不良。我们探讨了应该如何处理基于动态信息的分区，并探讨了侧重于通信成本、负载平衡和两者的策略。我们表明，在多核集群上，动态分区的性能比静态分区高4倍。在通信延迟较低的AMD magnycours上，对于我们的一些模型，动态分区的性能比静态分区提高了2倍。我们未来的工作将考虑如何推导动态权重(在本研究中，我们通过分析来实现)，以及如何以一种由底层架构提供信息的方式平衡通信和计算的重要性。

{"title":"Partitioning on Dynamic Behavior for Parallel Discrete Event Simulation","authors":"Ketan Bahulkar, Jingjing Wang, N. Abu-Ghazaleh, D. Ponomarev","doi":"10.1109/PADS.2012.32","DOIUrl":"https://doi.org/10.1109/PADS.2012.32","url":null,"abstract":"Partitioning plays an important role in PDES performance due to the high communication cost in parallel platforms and the fine-granularity of most simulation models. Traditionally, models are partitioned by deriving the static communication graph of objects and applying graph partitioning to reduce the mincut while load balancing the number of objects. However, many, if not all, models exhibit great diversity in their dynamic behavior: objects communicate with each other with diverse frequencies that are commonly power-law distributed. Similar diversity exists in the activity of objects and the processing requirements of events. In this paper, we argue that partitioning based on static graphs ignores these effects, leading to poor partitioning. We explore how partitioning based on dynamic information should be approached and explore policies that focus on communication cost, load balancing and both. We show that on multicore clusters, dynamic partitioning achieves up to 4x better performance than static partitioning. On the AMD magnycours, where the communication latency is low, dynamic partitioning results in a 2x performance improvement over static partitioning for some of our models. Our future work considers how to derive the dynamic weights (in this study, we do that through profiling), and how to balance the importance of communication and computation in a way that is informed by the underlying architecture.","PeriodicalId":299627,"journal":{"name":"2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116655492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Multi-level Parallelism for Time- and Cost-Efficient Parallel Discrete Event Simulation on GPUs 基于gpu的时间和成本效益并行离散事件仿真的多级并行

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.27

G. Kunz, Daniel Schemmel, J. Gross, Klaus Wehrle

Developing complex technical systems requires a systematic exploration of the given design space in order to identify optimal system configurations. However, studying the effects and interactions of even a small number of system parameters often requires an extensive number of simulation runs. This in turn results in excessive runtime demands which severely hamper thorough design space explorations. In this paper, we present a parallel discrete event simulation scheme that enables cost- and time-efficient execution of large scale parameter studies on GPUs. In order to efficiently accommodate the stream-processing paradigm of GPUs, our parallelization scheme exploits two orthogonal levels of parallelism: External parallelism among the inherently independent simulations of a parameter study and internal parallelism among independent events within each individual simulation of a parameter study. Specifically, we design an event aggregation strategy based on external parallelism that generates workloads suitable for GPUs. In addition, we define a pipelined event execution mechanism based on internal parallelism to hide the transfer latencies between host- and GPU-memory. We analyze the performance characteristics of our parallelization scheme by means of a prototype implementation and show a 25-fold performance improvement over purely CPU-based execution.

开发复杂的技术系统需要对给定的设计空间进行系统的探索，以确定最佳的系统配置。然而，研究即使是少量系统参数的影响和相互作用通常也需要大量的模拟运行。这反过来又会导致过多的运行时需求，从而严重阻碍了彻底的设计空间探索。在本文中，我们提出了一种并行离散事件模拟方案，该方案能够在gpu上以成本和时间效率执行大规模参数研究。为了有效地适应gpu的流处理范式，我们的并行化方案利用了两个正交的并行度:参数研究的固有独立模拟之间的外部并行性和参数研究的每个单独模拟中独立事件之间的内部并行性。具体来说，我们设计了一个基于外部并行性的事件聚合策略，该策略生成适合gpu的工作负载。此外，我们定义了一个基于内部并行性的流水线事件执行机制，以隐藏主机和gpu内存之间的传输延迟。我们通过原型实现分析了我们的并行化方案的性能特征，并显示了比纯粹基于cpu的执行提高25倍的性能。

{"title":"Multi-level Parallelism for Time- and Cost-Efficient Parallel Discrete Event Simulation on GPUs","authors":"G. Kunz, Daniel Schemmel, J. Gross, Klaus Wehrle","doi":"10.1109/PADS.2012.27","DOIUrl":"https://doi.org/10.1109/PADS.2012.27","url":null,"abstract":"Developing complex technical systems requires a systematic exploration of the given design space in order to identify optimal system configurations. However, studying the effects and interactions of even a small number of system parameters often requires an extensive number of simulation runs. This in turn results in excessive runtime demands which severely hamper thorough design space explorations. In this paper, we present a parallel discrete event simulation scheme that enables cost- and time-efficient execution of large scale parameter studies on GPUs. In order to efficiently accommodate the stream-processing paradigm of GPUs, our parallelization scheme exploits two orthogonal levels of parallelism: External parallelism among the inherently independent simulations of a parameter study and internal parallelism among independent events within each individual simulation of a parameter study. Specifically, we design an event aggregation strategy based on external parallelism that generates workloads suitable for GPUs. In addition, we define a pipelined event execution mechanism based on internal parallelism to hide the transfer latencies between host- and GPU-memory. We analyze the performance characteristics of our parallelization scheme by means of a prototype implementation and show a 25-fold performance improvement over purely CPU-based execution.","PeriodicalId":299627,"journal":{"name":"2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation","volume":"424 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123560066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Virtual Time Integration of Emulation and Parallel Simulation 仿真与并行仿真的虚拟时间集成

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.49

Dong Jin, Yuhao Zheng, Huaiyu Zhu, D. Nicol, Lenhard Winterrowd

A high fidelity testbed for large-scale system analysis requires emulation to represent the execution of critical software, and simulation to model an extensive ensemble of background computation and communication. We leverage prior work showing that large numbers of virtual environments may be emulated on a single host, and that the time stamped interactions between them can be mapped to virtual time, and we leverage existing work on simulation of large-scale communication networks. The present paper brings these concepts together, marrying the scale emulation framework OpenVZ (modified earlier to operate in virtual time) with a scalable network simulator S3F. Our algorithmic contributions lay in the design and management of virtual time as it transitions from emulation, to simulation, and back. In particular, inescapable uncertainties in emulation behavior force us to explicitly set and reset timestamps so as to avoid either emulator or simulator having to deal with a packet arriving in its logical past. We provide analytic bounds and empirical evidence that the error introduced in resetting timestamps is small. Finally, we present a case-study using this capability, of a cyber-attack with the smart power grid communication infrastructure.

用于大规模系统分析的高保真测试平台需要仿真来表示关键软件的执行，并需要仿真来模拟背景计算和通信的广泛集成。我们利用先前的工作表明，大量的虚拟环境可以在单个主机上模拟，并且它们之间的时间戳交互可以映射到虚拟时间，并且我们利用现有的大规模通信网络模拟工作。本文将这些概念结合在一起，将规模仿真框架OpenVZ(早期修改为在虚拟时间内运行)与可扩展网络模拟器S3F结合在一起。我们的算法贡献在于虚拟时间的设计和管理，因为它从模拟，到模拟，再回来。特别是，仿真行为中不可避免的不确定性迫使我们显式地设置和重置时间戳，以避免模拟器或模拟器必须处理在其逻辑过去到达的数据包。我们提供了解析界和经验证据，证明在重置时间戳时引入的误差很小。最后，我们提出了一个案例研究，使用这种能力，与智能电网通信基础设施的网络攻击。

{"title":"Virtual Time Integration of Emulation and Parallel Simulation","authors":"Dong Jin, Yuhao Zheng, Huaiyu Zhu, D. Nicol, Lenhard Winterrowd","doi":"10.1109/PADS.2012.49","DOIUrl":"https://doi.org/10.1109/PADS.2012.49","url":null,"abstract":"A high fidelity testbed for large-scale system analysis requires emulation to represent the execution of critical software, and simulation to model an extensive ensemble of background computation and communication. We leverage prior work showing that large numbers of virtual environments may be emulated on a single host, and that the time stamped interactions between them can be mapped to virtual time, and we leverage existing work on simulation of large-scale communication networks. The present paper brings these concepts together, marrying the scale emulation framework OpenVZ (modified earlier to operate in virtual time) with a scalable network simulator S3F. Our algorithmic contributions lay in the design and management of virtual time as it transitions from emulation, to simulation, and back. In particular, inescapable uncertainties in emulation behavior force us to explicitly set and reset timestamps so as to avoid either emulator or simulator having to deal with a packet arriving in its logical past. We provide analytic bounds and empirical evidence that the error introduced in resetting timestamps is small. Finally, we present a case-study using this capability, of a cyber-attack with the smart power grid communication infrastructure.","PeriodicalId":299627,"journal":{"name":"2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114448333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Realizing Large-Scale Interactive Network Simulation via Model Splitting 通过模型分割实现大规模交互网络仿真

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.35

N. Vorst, Jason Liu

This paper presents the model splitting method for large-scale interactive network simulation, which addresses the separation of concerns between network researchers, who focus on developing complex network models and conducting large-scale network experiments, and simulator developers, who are concerned with developing efficient simulation engines to achieve the best performance on parallel platforms. Modeling splitting divides the system into an interactive model to support user interaction, and an execution model to facilitate parallel processing. We describe techniques to maintain consistency and real-time synchronization between the two models. We also provide solutions to reduce the memory complexity of large network models and to ensure data persistency and access efficiency for out-of-core processing.

本文提出了一种大规模交互网络仿真的模型分割方法，解决了网络研究人员和模拟器开发人员之间的关注点分离问题，前者专注于开发复杂的网络模型并进行大规模网络实验，而后者则专注于开发高效的仿真引擎以在并行平台上实现最佳性能。建模分割将系统划分为支持用户交互的交互模型和便于并行处理的执行模型。我们描述了在两个模型之间保持一致性和实时同步的技术。我们还提供解决方案，以降低大型网络模型的内存复杂性，并确保数据持久性和外核处理的访问效率。

引用次数: 9

Characterizing and Understanding PDES Behavior on Tilera Architecture Tilera架构上PDES行为的表征与理解

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.10

Deepak Jagtap, Ketan Bahulkar, D. Ponomarev, N. Abu-Ghazaleh

The emergence of many core architectures with shifting balance between computation and communication overhead can have a tremendous impact on performance and scalability of fine-grained parallel applications such as PDES. It may also be necessary to rethink the design philosophy of key PDES subsystems, that were traditionally focussed on hiding long communication delays. In this paper, we perform extensive evaluation of PDES on Tile64Pro - a new 64-core chip from Tilera. For our studies, we use the recently developed multithreaded version of the popular ROSS simulator and show that the performance of this simulator (with many optimizations proposed) scales by a factor of 27X when it is executed on 56 cores of the Tilera chip for Phold benchmark with 20% remote communication. We also evaluate the impact of performance optimizations that we propose on both conservative and optimistic versions of the simulator and also analyze the sensitivity to various simulation parameters. Finally, we explore the issues of object placement and model partitioning on Tilera architecture.

许多核心架构在计算和通信开销之间的平衡不断变化，这可能会对细粒度并行应用程序(如PDES)的性能和可伸缩性产生巨大影响。可能还需要重新考虑关键PDES子系统的设计理念，这些子系统传统上侧重于隐藏长时间的通信延迟。在本文中，我们在Tilera的新64核芯片Tile64Pro上对PDES进行了广泛的评估。在我们的研究中，我们使用了最近开发的流行的ROSS模拟器的多线程版本，并表明该模拟器的性能(提出了许多优化)在56核Tilera芯片上执行时，在hold基准测试中使用20%的远程通信时，其性能扩展了27倍。我们还评估了我们提出的性能优化对模拟器的保守和乐观版本的影响，并分析了对各种模拟参数的敏感性。最后，我们探讨了Tilera体系结构上的对象放置和模型划分问题。

引用次数: 22

A Radio-Driven Time Synchronization Protocol in Hybrid Simulation Systems 混合仿真系统中无线电驱动的时间同步协议

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.5

Zhiyu Huang

Cyber-physical system (CPS) is a system featuring a tight combination and coordination between the system's computational and physical resources. As a CPS representative, the Weather Monitoring and Train Traffic Control Simulation System (WMT2CS2) includes two subsystems: the wireless sensor network front end and the train traffic control simulation subsystem. The sensing front end collects the real-time data of weathers(speeds and directions of winds and rainfalls, etc.), and connects to the simulation subsystem. The purpose of WMT2CS2 is to study the impact of weather on the train traffic control and envisions to enhance the safety of high-speed rail (HSR) system. However, the simulation system design faces new challenges such as accurate and fast time synchronization, fast data/command dissemination, and so on. In this paper, we propose an accurate and low-latency time synchronization protocol based on constructive interference (CI) to apply in the sensing front end of the hybrid simulation systems. As a recently discovered physical layer phenomenon, CI allows multiple nodes transmit and forward an identical packet simultaneously. By leveraging CI, the proposed Radio-Driven Time Synchronization protocol (RDTS) can realize microsecond time synchronization accuracy and milliseconds latency. Moreover, RDTS can directly utilize the time-stamps from the sink node instead of intermediate nodes, which avoids the error caused by the unstable clock of intermediate nodes.

信息物理系统(Cyber-physical system, CPS)是一个计算资源和物理资源紧密结合和协调的系统。作为CPS的代表，天气监测与列车交通控制仿真系统(WMT2CS2)包括两个子系统:无线传感器网络前端和列车交通控制仿真子系统。传感前端采集实时天气数据(风速、雨量方向等)，与仿真子系统对接。WMT2CS2的目的是研究天气对列车交通控制的影响，并设想提高高铁系统的安全性。然而，仿真系统的设计面临着准确快速的时间同步、快速的数据/命令传播等新的挑战。在本文中，我们提出了一种基于建设性干扰(CI)的精确低延迟时间同步协议，用于混合仿真系统的传感前端。作为最近发现的物理层现象，CI允许多个节点同时传输和转发相同的数据包。通过利用CI，提出的无线电驱动时间同步协议(RDTS)可以实现微秒级的时间同步精度和毫秒级的延迟。此外，RDTS可以直接利用汇聚节点的时间戳而不是中间节点的时间戳，从而避免了中间节点时钟不稳定造成的误差。

{"title":"A Radio-Driven Time Synchronization Protocol in Hybrid Simulation Systems","authors":"Zhiyu Huang","doi":"10.1109/PADS.2012.5","DOIUrl":"https://doi.org/10.1109/PADS.2012.5","url":null,"abstract":"Cyber-physical system (CPS) is a system featuring a tight combination and coordination between the system's computational and physical resources. As a CPS representative, the Weather Monitoring and Train Traffic Control Simulation System (WMT2CS2) includes two subsystems: the wireless sensor network front end and the train traffic control simulation subsystem. The sensing front end collects the real-time data of weathers(speeds and directions of winds and rainfalls, etc.), and connects to the simulation subsystem. The purpose of WMT2CS2 is to study the impact of weather on the train traffic control and envisions to enhance the safety of high-speed rail (HSR) system. However, the simulation system design faces new challenges such as accurate and fast time synchronization, fast data/command dissemination, and so on. In this paper, we propose an accurate and low-latency time synchronization protocol based on constructive interference (CI) to apply in the sensing front end of the hybrid simulation systems. As a recently discovered physical layer phenomenon, CI allows multiple nodes transmit and forward an identical packet simultaneously. By leveraging CI, the proposed Radio-Driven Time Synchronization protocol (RDTS) can realize microsecond time synchronization accuracy and milliseconds latency. Moreover, RDTS can directly utilize the time-stamps from the sink node instead of intermediate nodes, which avoids the error caused by the unstable clock of intermediate nodes.","PeriodicalId":299627,"journal":{"name":"2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126424713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamically Adjusting Core Frequencies to Accelerate Time Warp Simulations in Many-Core Processors 动态调整核心频率以加速多核处理器中的时间扭曲模拟

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.15

Ryan Child, P. Wilsey

Time Warp synchronized parallel discrete event simulators are organized to operate asynchronously and aggressively without explicit synchronization between the concurrently executing simulators. In place of an explicit synchronization mechanism, the concurrent simulators maintain a common virtual clock model and implement a rollback/recovery mechanism to restore causal order when out-of-order events are detected. When the critical path of execution of the simulation is balanced across these parallel simulators, this can result in a highly effective, lightweight synchronization mechanism. However, imbalances in the workload across the parallel simulators can result in excessive rollback at some nodes and ultimately result in an overall slowing of the simulation as prematurely computed and transmitted events are processed. On small shared memory multi-core systems, a lowest time-stamp first scheduling policy can effectively balance the workload. However, on larger many-core chips, conventional load balancing and workload migration will once again become necessary. Fortunately, emerging many-core chips contain some interesting features that can potentially be exploited to improve the performance of parallel simulations. For example, the Intel Single-chip Cloud Computer (SCC) provides mechanisms that a running application can use to adjust the frequency/voltage of different regions (called islands) of the chip. These islands are network and processing core centric and thus, in a Time Warp simulation, one can increase the frequency of the cores executing threads on the critical path (those experiencing infrequent rollback) and decrease the frequency of the cores executing threads off the critical path (those experiencing excessive rollback). This paper investigates the run-time control and adjustment of core frequency in an AMD Phenom II X6 multi-core processor to explore and demonstrate that the dynamic run-time control of core frequency can sometimes improve the performance of a Time Warp synchronized parallel simulation.

时间扭曲同步并行离散事件模拟器被组织为异步和积极地运行，而在并发执行的模拟器之间没有显式的同步。并发模拟器维护一个公共的虚拟时钟模型，并实现一个回滚/恢复机制，以代替显式的同步机制，以便在检测到无序事件时恢复因果顺序。当模拟执行的关键路径在这些并行模拟器之间得到平衡时，这可以产生高效、轻量级的同步机制。然而，跨并行模拟器的工作负载的不平衡可能导致某些节点上的过度回滚，并最终导致模拟的整体速度变慢，因为过早地计算和传输事件被处理。在小型共享内存多核系统上，最低时间戳优先的调度策略可以有效地平衡工作负载。然而，在更大的多核芯片上，传统的负载平衡和工作负载迁移将再次成为必要。幸运的是，新兴的多核芯片包含一些有趣的特性，可以潜在地用于提高并行模拟的性能。例如，英特尔单芯片云计算机(SCC)提供了运行中的应用程序可以使用的机制来调整芯片不同区域(称为孤岛)的频率/电压。这些孤岛以网络和处理核心为中心，因此，在Time Warp模拟中，可以增加在关键路径上执行线程的核心(那些经历不频繁回滚的内核)的频率，并减少在关键路径外执行线程的核心(那些经历过度回滚的内核)的频率。本文研究了AMD飞鸿II X6多核处理器的运行时控制和内核频率的调整，以探索和证明内核频率的动态运行时控制有时可以提高时间扭曲同步并行仿真的性能。

{"title":"Dynamically Adjusting Core Frequencies to Accelerate Time Warp Simulations in Many-Core Processors","authors":"Ryan Child, P. Wilsey","doi":"10.1109/PADS.2012.15","DOIUrl":"https://doi.org/10.1109/PADS.2012.15","url":null,"abstract":"Time Warp synchronized parallel discrete event simulators are organized to operate asynchronously and aggressively without explicit synchronization between the concurrently executing simulators. In place of an explicit synchronization mechanism, the concurrent simulators maintain a common virtual clock model and implement a rollback/recovery mechanism to restore causal order when out-of-order events are detected. When the critical path of execution of the simulation is balanced across these parallel simulators, this can result in a highly effective, lightweight synchronization mechanism. However, imbalances in the workload across the parallel simulators can result in excessive rollback at some nodes and ultimately result in an overall slowing of the simulation as prematurely computed and transmitted events are processed. On small shared memory multi-core systems, a lowest time-stamp first scheduling policy can effectively balance the workload. However, on larger many-core chips, conventional load balancing and workload migration will once again become necessary. Fortunately, emerging many-core chips contain some interesting features that can potentially be exploited to improve the performance of parallel simulations. For example, the Intel Single-chip Cloud Computer (SCC) provides mechanisms that a running application can use to adjust the frequency/voltage of different regions (called islands) of the chip. These islands are network and processing core centric and thus, in a Time Warp simulation, one can increase the frequency of the cores executing threads on the critical path (those experiencing infrequent rollback) and decrease the frequency of the cores executing threads off the critical path (those experiencing excessive rollback). This paper investigates the run-time control and adjustment of core frequency in an AMD Phenom II X6 multi-core processor to explore and demonstrate that the dynamic run-time control of core frequency can sometimes improve the performance of a Time Warp synchronized parallel simulation.","PeriodicalId":299627,"journal":{"name":"2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117047253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Hybrid Simulation of Packet-Level Networks and Functional-Level Routers 分组级网络和功能级路由器的混合仿真

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.22

Mirko Stoffers, G. Riley

We discuss our approach to federating dissimilar discrete event simulations, leveraging the strengths and design goals of both, to produce a packet-level detailed network model federated with a component-level detailed input-queuing router model. All existing network simulation tools that we are aware of incorporate a very simplistic model for the flow of packets through a router. The simplistic model simply responds to a packet receipt event by performing a route look-up and adding the packet to the output queue of the next-hop output interface. This is often simulated to take place in zero time, or with rudimentary probabilistic models of delay within a router. However, modern high-end routers are designed using a complex input-queuing methodology and a sophisticated scheduling approach to move packets through a crossbar switch from the input queue to the output queue. We used the popular ns -- 3 network simulator to create realistic packet-level models of network load, and the Manifold computer architecture simulator to create a realistic model of data movement through an input-queued router. We federated the two by means of two alternative approaches: First, two POSIX threads run within a single simulation process and utilize the shared memory for both time synchronization and packet exchange. Second, we used the well-known MPI message passing library for the federation. Our results show that the detailed router models can in fact produce somewhat different packet delay and loss characteristics than the simplistic router models at the expense of considerable computational complexity.

我们将讨论联合不同离散事件模拟的方法，利用两者的优势和设计目标，生成与组件级详细输入队列路由器模型联合的包级详细网络模型。我们所知道的所有现有的网络模拟工具都包含一个非常简单的模型，用于通过路由器的数据包流。简单模型只是通过执行路由查找并将数据包添加到下一跳输出接口的输出队列来响应数据包接收事件。这通常被模拟为在零时间内发生，或者使用路由器内延迟的基本概率模型。然而，现代高端路由器的设计使用了复杂的输入排队方法和复杂的调度方法，通过横杆交换机将数据包从输入队列移动到输出队列。我们使用流行的ns - 3网络模拟器来创建网络负载的真实数据包级模型，并使用Manifold计算机体系结构模拟器来创建通过输入排队路由器的数据移动的真实模型。我们通过两种替代方法将两者联合起来:首先，在单个模拟进程中运行两个POSIX线程，并利用共享内存进行时间同步和数据包交换。其次，我们为联合使用了众所周知的MPI消息传递库。我们的研究结果表明，与简单的路由器模型相比，详细的路由器模型实际上可以产生一些不同的数据包延迟和丢失特性，但代价是相当的计算复杂性。

{"title":"Hybrid Simulation of Packet-Level Networks and Functional-Level Routers","authors":"Mirko Stoffers, G. Riley","doi":"10.1109/PADS.2012.22","DOIUrl":"https://doi.org/10.1109/PADS.2012.22","url":null,"abstract":"We discuss our approach to federating dissimilar discrete event simulations, leveraging the strengths and design goals of both, to produce a packet-level detailed network model federated with a component-level detailed input-queuing router model. All existing network simulation tools that we are aware of incorporate a very simplistic model for the flow of packets through a router. The simplistic model simply responds to a packet receipt event by performing a route look-up and adding the packet to the output queue of the next-hop output interface. This is often simulated to take place in zero time, or with rudimentary probabilistic models of delay within a router. However, modern high-end routers are designed using a complex input-queuing methodology and a sophisticated scheduling approach to move packets through a crossbar switch from the input queue to the output queue. We used the popular ns -- 3 network simulator to create realistic packet-level models of network load, and the Manifold computer architecture simulator to create a realistic model of data movement through an input-queued router. We federated the two by means of two alternative approaches: First, two POSIX threads run within a single simulation process and utilize the shared memory for both time synchronization and packet exchange. Second, we used the well-known MPI message passing library for the federation. Our results show that the detailed router models can in fact produce somewhat different packet delay and loss characteristics than the simplistic router models at the expense of considerable computational complexity.","PeriodicalId":299627,"journal":{"name":"2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117129617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Bug Locating Method for the Debugging of Parallel Discrete Event Simulation 并行离散事件仿真调试中的Bug定位方法

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.1

Feng Zhu, Yiping Yao

Debugging is critically important for diagnosing bugs of programs. In optimistic Parallel Discrete Event Simulation(PDES), a bug is probably not to be reproduced for the different orders of event processing in different simulation runs, so locating bugs is of great challenge in debugging PDES programs. To solve this problem, we first propose a bug reproducing method based on checkpoint/restart mechanism, which avoids starting the program from scratch when an error emerges. Moreover, our method can change the checkpoint interval dynamically to reduce the overhead of states saving. Then, based on bug reproduction we propose a bug locating method, which aims at searching for these events that cause the bugs likely by comparing the event processing sequences between one passing test case and the failing test case. By doing this, we can focus on the events directly related to the bugs, which will reduce the time of locating a bug.

调试对于诊断程序错误至关重要。在乐观并行离散事件仿真(PDES)中，由于在不同的仿真运行中，事件处理的顺序不同，bug很可能不会重现，因此bug的定位是PDES程序调试中的一大挑战。为了解决这个问题，我们首先提出了一种基于检查点/重启机制的bug再现方法，避免了在出现错误时重新启动程序。此外，我们的方法可以动态地改变检查点间隔，以减少状态保存的开销。然后，在bug再现的基础上，提出了一种bug定位方法，通过比较一个通过的测试用例和一个失败的测试用例之间的事件处理顺序，来搜索可能导致bug的事件。通过这样做，我们可以专注于与bug直接相关的事件，这将减少定位bug的时间。

引用次数: 1

Fair and Efficient Dead Reckoning-Based Update Dissemination for Distributed Virtual Environments 分布式虚拟环境中基于死亡数计算的公平高效更新传播

2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation

Pub Date : 2012-07-15 DOI: 10.1109/PADS.2012.18

Zengxiang Li, Xueyan Tang, Wentong Cai, S. Turner

Due to diverse network latencies, participants in a Distributed Virtual Environment (DVE) may observe different inconsistency levels of the simulated virtual world, which can seriously affect fair competition among them. In this paper, we investigate how to disseminate Dead Reckoning (DR)-based updates with the objectives of achieving fairness among participants and reducing inconsistency as much as possible. We first propose an optimized bandwidth allocation scheme for sending updates to overcome the drawbacks of uniform bandwidth allocation and the local-lag technique. Then, we integrate bandwidth allocation with an indirect relay method and develop algorithms to select relay routes for minimizing inconsistency under various bandwidth allocation schemes. Our proposed scheme and algorithms are evaluated using traces collected from a real car racing game as well as the real Internet latency data. The experimental results show that the proposed optimized bandwidth allocation scheme significantly reduces inconsistency while maintaining fairness among participants and that integrating the optimized scheme with our proposed relay setup algorithm further improves consistency.

分布式虚拟环境(Distributed Virtual Environment, DVE)中由于网络时延的差异，参与者在模拟的虚拟世界中可能会观察到不同程度的不一致，这将严重影响参与者之间的公平竞争。在本文中，我们研究了如何传播基于航位推算(DR)的更新，以实现参与者之间的公平和尽可能减少不一致。我们首先提出了一种优化的更新发送带宽分配方案，以克服均匀带宽分配和局部滞后技术的缺点。然后，我们将带宽分配与间接中继方法相结合，并开发了在各种带宽分配方案下选择中继路由以最小化不一致性的算法。我们提出的方案和算法使用从真实赛车游戏中收集的痕迹以及真实的互联网延迟数据进行评估。实验结果表明，优化后的带宽分配方案在保持参与者之间的公平性的同时显著减少了不一致性，将优化后的方案与我们提出的中继设置算法相结合，进一步提高了一致性。

{"title":"Fair and Efficient Dead Reckoning-Based Update Dissemination for Distributed Virtual Environments","authors":"Zengxiang Li, Xueyan Tang, Wentong Cai, S. Turner","doi":"10.1109/PADS.2012.18","DOIUrl":"https://doi.org/10.1109/PADS.2012.18","url":null,"abstract":"Due to diverse network latencies, participants in a Distributed Virtual Environment (DVE) may observe different inconsistency levels of the simulated virtual world, which can seriously affect fair competition among them. In this paper, we investigate how to disseminate Dead Reckoning (DR)-based updates with the objectives of achieving fairness among participants and reducing inconsistency as much as possible. We first propose an optimized bandwidth allocation scheme for sending updates to overcome the drawbacks of uniform bandwidth allocation and the local-lag technique. Then, we integrate bandwidth allocation with an indirect relay method and develop algorithms to select relay routes for minimizing inconsistency under various bandwidth allocation schemes. Our proposed scheme and algorithms are evaluated using traces collected from a real car racing game as well as the real Internet latency data. The experimental results show that the proposed optimized bandwidth allocation scheme significantly reduces inconsistency while maintaining fairness among participants and that integrating the optimized scheme with our proposed relay setup algorithm further improves consistency.","PeriodicalId":299627,"journal":{"name":"2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121673905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4