2015 44th International Conference on Parallel Processing最新文献

Fine-Grained Loss Tomography in Dynamic Sensor Networks 动态传感器网络中的细粒度损耗层析成像

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.87

Chenhong Cao, Yi Gao, Wei Dong, Jiajun Bu

Wireless Sensor Networks (WSNs) have been successfully applied in many application areas. Understanding the wireless link performance is very helpful for both protocol designers and network managers. Loss tomography is a popular approach to inferring the per-link loss ratios from end-to-end delivery ratios. Previous studies, however, are usually targeted for networks with static or slowly changing routing paths. In this work, we propose Dophy, a Dynamic loss tomography approach specifically designed for dynamic WSNs where each node dynamically selects the forwarding nodes towards the sink. The key idea of Dophy is based on an observation that most existing protocols use retransmissions to achieve high data delivery ratio. Dophy employs arithmetic encoding to compactly encode the number of retransmissions along the paths. Dophy incorporates two mechanisms to optimize its performance. First, Dophy intelligently reduces the size of symbol set by aggregating the number of retransmissions, reducing the encoding overhead significantly. Second, Dophy periodically updates the probability model to minimize the overall transmission overhead. We implement Dophy on the Tiny OS platform and evaluate its performance extensively using large-scale simulations. Results show that Dophy achieves both high encoding efficiency and high estimation accuracy. Comparative studies show that Dophy significantly outperforms traditional loss tomography approaches in terms of accuracy.

无线传感器网络(WSNs)已经成功地应用于许多应用领域。了解无线链路的性能对协议设计者和网络管理者都很有帮助。损耗层析成像是从端到端传输比推断每链路损耗比的一种流行方法。然而，先前的研究通常针对具有静态或缓慢变化路由路径的网络。在这项工作中，我们提出了Dophy，这是一种专门为动态wsn设计的动态损耗层析成像方法，其中每个节点动态选择朝向汇聚的转发节点。Dophy的关键思想是基于对大多数现有协议使用重传来实现高数据传输率的观察。多菲采用算术编码对沿路径重传的次数进行紧凑编码。Dophy采用两种机制来优化其性能。首先，Dophy通过聚合重传次数，智能地减小了符号集的大小，显著降低了编码开销。其次，Dophy定期更新概率模型以最小化总体传输开销。我们在Tiny OS平台上实现了Dophy，并通过大规模仿真对其性能进行了广泛的评估。结果表明，该算法具有较高的编码效率和估计精度。对比研究表明，Dophy在精度方面明显优于传统的损失层析成像方法。

{"title":"Fine-Grained Loss Tomography in Dynamic Sensor Networks","authors":"Chenhong Cao, Yi Gao, Wei Dong, Jiajun Bu","doi":"10.1109/ICPP.2015.87","DOIUrl":"https://doi.org/10.1109/ICPP.2015.87","url":null,"abstract":"Wireless Sensor Networks (WSNs) have been successfully applied in many application areas. Understanding the wireless link performance is very helpful for both protocol designers and network managers. Loss tomography is a popular approach to inferring the per-link loss ratios from end-to-end delivery ratios. Previous studies, however, are usually targeted for networks with static or slowly changing routing paths. In this work, we propose Dophy, a Dynamic loss tomography approach specifically designed for dynamic WSNs where each node dynamically selects the forwarding nodes towards the sink. The key idea of Dophy is based on an observation that most existing protocols use retransmissions to achieve high data delivery ratio. Dophy employs arithmetic encoding to compactly encode the number of retransmissions along the paths. Dophy incorporates two mechanisms to optimize its performance. First, Dophy intelligently reduces the size of symbol set by aggregating the number of retransmissions, reducing the encoding overhead significantly. Second, Dophy periodically updates the probability model to minimize the overall transmission overhead. We implement Dophy on the Tiny OS platform and evaluate its performance extensively using large-scale simulations. Results show that Dophy achieves both high encoding efficiency and high estimation accuracy. Comparative studies show that Dophy significantly outperforms traditional loss tomography approaches in terms of accuracy.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115585113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Green-CM: Energy Efficient Contention Management for Transactional Memory Green-CM:事务性内存的高能效争用管理

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.64

S. Issa, P. Romano, M. Brorsson

Transactional memory (TM) is emerging as an attractive synchronization mechanism for concurrent computing. In this work we aim at filling a relevant gap in the TM literature, by investigating the issue of energy efficiency for one crucial building block of TM systems: contention management. Green-CM, the solution proposed in this paper, is the first contention management scheme explicitly designed to jointly optimize both performance and energy consumption. To this end Green-TM combines three key mechanisms: i) it leverages on a novel asymmetric design, which combines different back-off policies in order to take advantage of dynamic frequency and voltage scaling, ii) it introduces an energy efficient design of the back-off mechanism, which combines spin-based and sleep-based implementations, iii) it makes extensive use of self-tuning mechanisms to pursue optimal efficiency across highly heterogeneous workloads. We evaluate Green-CM from both the energy and performance perspectives, and show that it can achieve enhanced efficiency by up to 2.35 times with respect to state of the art contention managers, with an average gain of more than 60% when using 64 threads.

事务性内存(TM)作为并发计算的一种有吸引力的同步机制正在兴起。在这项工作中，我们的目标是填补TM文献中的相关空白，通过调查TM系统的一个关键构建块的能源效率问题:争用管理。本文提出的解决方案Green-CM是第一个明确设计用于共同优化性能和能耗的争用管理方案。为此，Green-TM结合了三个关键机制:i)它利用了一种新的不对称设计，它结合了不同的退退策略，以利用动态频率和电压缩放;ii)它引入了退退机制的节能设计，它结合了基于自旋和基于睡眠的实现;iii)它广泛使用自调优机制，以在高度异构的工作负载中追求最佳效率。我们从能源和性能两个角度对Green-CM进行了评估，并表明，相对于目前最先进的争用管理器，它可以实现高达2.35倍的效率提升，在使用64个线程时，平均增益超过60%。

{"title":"Green-CM: Energy Efficient Contention Management for Transactional Memory","authors":"S. Issa, P. Romano, M. Brorsson","doi":"10.1109/ICPP.2015.64","DOIUrl":"https://doi.org/10.1109/ICPP.2015.64","url":null,"abstract":"Transactional memory (TM) is emerging as an attractive synchronization mechanism for concurrent computing. In this work we aim at filling a relevant gap in the TM literature, by investigating the issue of energy efficiency for one crucial building block of TM systems: contention management. Green-CM, the solution proposed in this paper, is the first contention management scheme explicitly designed to jointly optimize both performance and energy consumption. To this end Green-TM combines three key mechanisms: i) it leverages on a novel asymmetric design, which combines different back-off policies in order to take advantage of dynamic frequency and voltage scaling, ii) it introduces an energy efficient design of the back-off mechanism, which combines spin-based and sleep-based implementations, iii) it makes extensive use of self-tuning mechanisms to pursue optimal efficiency across highly heterogeneous workloads. We evaluate Green-CM from both the energy and performance perspectives, and show that it can achieve enhanced efficiency by up to 2.35 times with respect to state of the art contention managers, with an average gain of more than 60% when using 64 threads.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124776388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Leveraging Error Compensation to Minimize Time Deviation in Parallel Multi-core Simulations 利用误差补偿最小化并行多核仿真中的时间偏差

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.21

Xiaodong Zhu, Junmin Wu, Tao Li

Due to synchronization overhead, it is challenging to apply the parallel simulation techniques of multi-core processors to a larger scale. Although the use of lax synchronization scheme reduces the synchronous overhead and balances the load between synchronous points, it introduces timing errors. To improve the accuracy of lax synchronized simulations, we propose an error compensation technique, which leverages prediction methods to compensate for simulated time deviations due to timing errors. The rationale of our approach is that, in the simulated multi-core processor systems the errors typically propagate via the delays of some pivotal events that connect subsystem models across different hierarchies. By predicting delays based on the simulation results of the preceding pivotal events, our techniques can eliminate errors from the predicted delays before they propagate to the models at higher hierarchies, thereby effectively improving the simulation accuracy. Since the predictions don't have any constraint on synchronizations, our approach largely maintains the scalability of lax synchronization schemes. Furthermore, our proposed mechanism is orthogonal to other parallel simulation techniques and can be used in conjunction with them. Experimental results show error compensation improves the accuracy of lax synchronized simulations by 60.2% and achieves 98.2% accuracy when combined with an enhanced lax synchronization.

由于同步开销过大，多核处理器的并行仿真技术在更大规模上的应用具有挑战性。虽然使用松散同步方案减少了同步开销并平衡了同步点之间的负载，但它引入了时序误差。为了提高松散同步仿真的精度，我们提出了一种误差补偿技术，该技术利用预测方法来补偿由于定时误差引起的模拟时间偏差。我们的方法的基本原理是，在模拟的多核处理器系统中，错误通常通过跨不同层次连接子系统模型的一些关键事件的延迟传播。通过基于上述关键事件的仿真结果预测延迟，我们的技术可以在预测延迟传播到更高层次的模型之前消除其误差，从而有效地提高仿真精度。由于预测对同步没有任何约束，因此我们的方法在很大程度上保持了松散同步方案的可伸缩性。此外，我们提出的机制与其他并行仿真技术是正交的，可以与它们结合使用。实验结果表明，误差补偿可使松弛同步仿真精度提高60.2%，结合增强的松弛同步仿真精度可达到98.2%。

{"title":"Leveraging Error Compensation to Minimize Time Deviation in Parallel Multi-core Simulations","authors":"Xiaodong Zhu, Junmin Wu, Tao Li","doi":"10.1109/ICPP.2015.21","DOIUrl":"https://doi.org/10.1109/ICPP.2015.21","url":null,"abstract":"Due to synchronization overhead, it is challenging to apply the parallel simulation techniques of multi-core processors to a larger scale. Although the use of lax synchronization scheme reduces the synchronous overhead and balances the load between synchronous points, it introduces timing errors. To improve the accuracy of lax synchronized simulations, we propose an error compensation technique, which leverages prediction methods to compensate for simulated time deviations due to timing errors. The rationale of our approach is that, in the simulated multi-core processor systems the errors typically propagate via the delays of some pivotal events that connect subsystem models across different hierarchies. By predicting delays based on the simulation results of the preceding pivotal events, our techniques can eliminate errors from the predicted delays before they propagate to the models at higher hierarchies, thereby effectively improving the simulation accuracy. Since the predictions don't have any constraint on synchronizations, our approach largely maintains the scalability of lax synchronization schemes. Furthermore, our proposed mechanism is orthogonal to other parallel simulation techniques and can be used in conjunction with them. Experimental results show error compensation improves the accuracy of lax synchronized simulations by 60.2% and achieves 98.2% accuracy when combined with an enhanced lax synchronization.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114587878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modelling and Developing Co-scheduling Strategies on Multicore Processors 多核处理器协同调度策略的建模与开发

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.31

Huanzhou Zhu, Ligang He, Bo Gao, Kenli Li, Jianhua Sun, Hao Chen, Kuan-Ching Li

On-chip cache is often shared between processes that run concurrently on different cores of the same processor. Resource contention of this type causes performance degradation to the co-running processes. Contention-aware co-scheduling refers to the class of scheduling techniques to reduce the performance degradation. Most existing contention-aware co-schedulers only consider serial jobs. However, there often exist both parallel and serial jobs in computing systems. In this paper, the problem of co-scheduling a mix of serial and parallel jobs is modelled as an Integer Programming (IP) problem. Then the existing IP solver can be used to find the optimal co-scheduling solution that minimizes the performance degradation. However, we find that the IP-based method incurs high time overhead and can only be used to solve small-scale problems. Therefore, a graph-based method is also proposed in this paper to tackle this problem. We construct a co-scheduling graph to represent the co-scheduling problem and model the problem of finding the optimal co-scheduling solution as the problem of finding the shortest valid path in the co-scheduling graph. A heuristic A*-search algorithm (HA*) is then developed to find the near-optimal solutions efficiently. The extensive experiments have been conducted to verify the effectiveness and efficiency of the proposed methods. The experimental results show that compared with the IP-based method, HA* is able to find the near-optimal solutions with much less time.

片上缓存通常在同一处理器的不同内核上并发运行的进程之间共享。这种类型的资源争用会导致协同运行进程的性能下降。竞争感知协同调度是指一类减少性能下降的调度技术。大多数现有的争用感知协同调度器只考虑串行作业。然而，在计算系统中往往同时存在并行和串行作业。本文将串行和并行混合作业的协同调度问题建模为整数规划(IP)问题。然后利用现有的IP求解器找到性能下降最小的最优协同调度方案。然而，我们发现基于ip的方法产生了很高的时间开销，并且只能用于解决小规模的问题。因此，本文还提出了一种基于图的方法来解决这一问题。我们构造了一个协同调度图来表示协同调度问题，并将寻找最优协同调度解的问题建模为寻找协同调度图中最短有效路径的问题。然后，提出了一种启发式A*搜索算法(HA*)，以有效地找到近似最优解。大量的实验验证了所提出方法的有效性和效率。实验结果表明，与基于ip的方法相比，HA*能够在更短的时间内找到近似最优解。

{"title":"Modelling and Developing Co-scheduling Strategies on Multicore Processors","authors":"Huanzhou Zhu, Ligang He, Bo Gao, Kenli Li, Jianhua Sun, Hao Chen, Kuan-Ching Li","doi":"10.1109/ICPP.2015.31","DOIUrl":"https://doi.org/10.1109/ICPP.2015.31","url":null,"abstract":"On-chip cache is often shared between processes that run concurrently on different cores of the same processor. Resource contention of this type causes performance degradation to the co-running processes. Contention-aware co-scheduling refers to the class of scheduling techniques to reduce the performance degradation. Most existing contention-aware co-schedulers only consider serial jobs. However, there often exist both parallel and serial jobs in computing systems. In this paper, the problem of co-scheduling a mix of serial and parallel jobs is modelled as an Integer Programming (IP) problem. Then the existing IP solver can be used to find the optimal co-scheduling solution that minimizes the performance degradation. However, we find that the IP-based method incurs high time overhead and can only be used to solve small-scale problems. Therefore, a graph-based method is also proposed in this paper to tackle this problem. We construct a co-scheduling graph to represent the co-scheduling problem and model the problem of finding the optimal co-scheduling solution as the problem of finding the shortest valid path in the co-scheduling graph. A heuristic A*-search algorithm (HA*) is then developed to find the near-optimal solutions efficiently. The extensive experiments have been conducted to verify the effectiveness and efficiency of the proposed methods. The experimental results show that compared with the IP-based method, HA* is able to find the near-optimal solutions with much less time.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129828624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Open ACC Programs Examined: A Performance Analysis Approach 开放式ACC项目审查:绩效分析方法

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.40

R. Dietrich, G. Juckeland, M. Wolfe

The Open ACC standard has been developed to simplify parallel programming of heterogeneous systems. Based on a set of high-level compiler directives it allows application developers to offload code regions from a host CPU to an accelerator without the need for low-level programming with CUDA or Open CL. Details are implicit in the programming model and managed by Open ACC API-enabled compilers and runtimes. However, it is still possible for the application developer to explicitly specify several performance-related details for the execution. To tune an Open ACC program and efficiently utilize available hardware resources, sophisticated performance analysis tools are required. In this paper we present a framework for detailed analysis of Open ACC applications. We describe new analysis capabilities introduced with an Open ACC tools interface and depict the integration of performance analysis for low-level programming models. As proof of concept we implemented the concept into the measurement infrastructure Score-P and the trace browser Vampir. This provides the program developer with a clearer understanding of the dynamic runtime behavior of the application and for systematic identification of potential bottlenecks.

Open ACC标准的开发是为了简化异构系统的并行编程。基于一组高级编译器指令，它允许应用程序开发人员将代码区域从主机CPU卸载到加速器，而无需使用CUDA或Open CL进行低级编程。细节隐含在编程模型中，并由启用Open ACC api的编译器和运行时管理。但是，应用程序开发人员仍然可以显式地为执行指定一些与性能相关的细节。为了调优Open ACC程序并有效地利用可用的硬件资源，需要复杂的性能分析工具。在本文中，我们提出了一个详细分析Open ACC应用的框架。我们描述了由Open ACC工具接口引入的新分析功能，并描述了低级编程模型的性能分析集成。作为概念验证，我们将该概念实现到测量基础设施Score-P和跟踪浏览器Vampir中。这使程序开发人员能够更清楚地了解应用程序的动态运行时行为，并能够系统地识别潜在的瓶颈。

{"title":"Open ACC Programs Examined: A Performance Analysis Approach","authors":"R. Dietrich, G. Juckeland, M. Wolfe","doi":"10.1109/ICPP.2015.40","DOIUrl":"https://doi.org/10.1109/ICPP.2015.40","url":null,"abstract":"The Open ACC standard has been developed to simplify parallel programming of heterogeneous systems. Based on a set of high-level compiler directives it allows application developers to offload code regions from a host CPU to an accelerator without the need for low-level programming with CUDA or Open CL. Details are implicit in the programming model and managed by Open ACC API-enabled compilers and runtimes. However, it is still possible for the application developer to explicitly specify several performance-related details for the execution. To tune an Open ACC program and efficiently utilize available hardware resources, sophisticated performance analysis tools are required. In this paper we present a framework for detailed analysis of Open ACC applications. We describe new analysis capabilities introduced with an Open ACC tools interface and depict the integration of performance analysis for low-level programming models. As proof of concept we implemented the concept into the measurement infrastructure Score-P and the trace browser Vampir. This provides the program developer with a clearer understanding of the dynamic runtime behavior of the application and for systematic identification of potential bottlenecks.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124723699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Profiling and Understanding Virtualization Overhead in Cloud 分析和理解云中的虚拟化开销

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.12

Liuhua Chen, Shilkumar Patel, Haiying Shen, Zhongyi Zhou

Virtualization is a key technology for cloud data centers to implement infrastructure as a service (IaaS) and to provide flexible and cost-effective resource sharing. It introduces an additional layer of abstraction that produces resource utilization overhead. Disregarding this overhead may cause serious reduction of the monitoring accuracy of the cloud providers and may cause degradation of the VM performance. However, there is no previous work that comprehensively investigates the virtualization overhead. In this paper, we comprehensively measure and study the relationship between the resource utilizations of virtual machines (VMs) and the resource utilizations of the device driver domain, hypervisor and the physical machine (PM) with diverse workloads and scenarios in the Xen virtualization environment. We examine data from the real-world virtualized deployment to characterize VM workloads and assess their impact on the resource utilizations in the system. We show that the impact of virtualization overhead depends on the workloads, and that virtualization overhead is an important factor to consider in cloud resource provisioning. Based on the measurements, we build a regression model to estimate the resource utilization overhead of the PM resulting from providing virtualized resource to the VMs and from managing multiple VMs. Finally, our trace-driven real-world experimental results show the high accuracy of our model in predicting PM resource consumptions in the cloud datacenter, and the importance of considering the virtualization overhead in cloud resource provisioning.

虚拟化是云数据中心实现基础设施即服务(IaaS)和提供灵活且经济高效的资源共享的关键技术。它引入了一个额外的抽象层，产生资源利用开销。忽略这种开销可能会严重降低云提供商的监控准确性，并可能导致VM性能下降。但是，以前还没有全面研究虚拟化开销的工作。在本文中，我们全面测量和研究了Xen虚拟化环境中不同工作负载和场景下，虚拟机(vm)的资源利用率与设备驱动程序域、管理程序和物理机(PM)的资源利用率之间的关系。我们检查来自真实世界虚拟化部署的数据，以表征VM工作负载，并评估它们对系统中资源利用率的影响。我们展示了虚拟化开销的影响取决于工作负载，虚拟化开销是云资源配置中需要考虑的一个重要因素。根据测量结果，我们构建了一个回归模型，以估计由于向vm提供虚拟化资源和管理多个vm而导致的PM的资源利用开销。最后，我们的跟踪驱动的实际实验结果表明，我们的模型在预测云数据中心的PM资源消耗方面具有很高的准确性，并且在云资源供应中考虑虚拟化开销的重要性。

{"title":"Profiling and Understanding Virtualization Overhead in Cloud","authors":"Liuhua Chen, Shilkumar Patel, Haiying Shen, Zhongyi Zhou","doi":"10.1109/ICPP.2015.12","DOIUrl":"https://doi.org/10.1109/ICPP.2015.12","url":null,"abstract":"Virtualization is a key technology for cloud data centers to implement infrastructure as a service (IaaS) and to provide flexible and cost-effective resource sharing. It introduces an additional layer of abstraction that produces resource utilization overhead. Disregarding this overhead may cause serious reduction of the monitoring accuracy of the cloud providers and may cause degradation of the VM performance. However, there is no previous work that comprehensively investigates the virtualization overhead. In this paper, we comprehensively measure and study the relationship between the resource utilizations of virtual machines (VMs) and the resource utilizations of the device driver domain, hypervisor and the physical machine (PM) with diverse workloads and scenarios in the Xen virtualization environment. We examine data from the real-world virtualized deployment to characterize VM workloads and assess their impact on the resource utilizations in the system. We show that the impact of virtualization overhead depends on the workloads, and that virtualization overhead is an important factor to consider in cloud resource provisioning. Based on the measurements, we build a regression model to estimate the resource utilization overhead of the PM resulting from providing virtualized resource to the VMs and from managing multiple VMs. Finally, our trace-driven real-world experimental results show the high accuracy of our model in predicting PM resource consumptions in the cloud datacenter, and the importance of considering the virtualization overhead in cloud resource provisioning.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130119439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Network Coding for Effective NDN Content Delivery: Models, Experiments, and Applications 有效NDN内容传递的网络编码:模型、实验和应用

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.19

Kai Lei, Fangxing Zhu, Cheng Peng, Kuai Xu

How to effectively distribute and share increasingly large volumes of data in large-scale network applications is a key challenge for Internet infrastructure. Although NDN, a promising new future internet architecture which takes data oriented transfer approaches, aims to better solve such needs than IP, it still faces problems like data redundancy transmission and inefficient in-network cache utilization. This paper combines network coding techniques to NDN to improve network throughput and efficiency. The merit of our design is that it is able to avoid duplicate and unproductive data delivery while transferring disjoint data segments along multiple paths and with no excess modification to NDN fundamentals. To quantify performance benefits of applying network coding in NDN, we integrate network coding into an NDN streaming media system implemented in the ndn SIM simulator. Basing on BRITE generated network topologies in our simulation, the experimental results clearly and fairly demonstrate that considering network coding in NDN can significantly improve the performance, reliability and QoS. More importantly, our approach is capable of and well fit for delivering growing Big Data applications including high-performance and high-density video streaming services.

如何在大规模网络应用中有效地分发和共享日益庞大的数据量是互联网基础设施面临的一个关键挑战。NDN是一种有前景的未来互联网新架构，采用面向数据的传输方式，旨在比IP更好地解决这些需求，但它仍然面临数据冗余传输和网络内缓存利用率低下等问题。本文将网络编码技术与NDN相结合，以提高网络吞吐量和效率。我们设计的优点是，它能够避免重复和非生产性的数据传输，同时沿着多条路径传输不相交的数据段，并且不需要对NDN基础进行过多的修改。为了量化在NDN中应用网络编码的性能优势，我们将网络编码集成到在NDN SIM模拟器中实现的NDN流媒体系统中。基于我们仿真的BRITE生成的网络拓扑，实验结果清楚而公正地证明了在NDN中考虑网络编码可以显著提高性能、可靠性和QoS。更重要的是，我们的方法能够并且非常适合交付不断增长的大数据应用，包括高性能和高密度视频流服务。

{"title":"Network Coding for Effective NDN Content Delivery: Models, Experiments, and Applications","authors":"Kai Lei, Fangxing Zhu, Cheng Peng, Kuai Xu","doi":"10.1109/ICPP.2015.19","DOIUrl":"https://doi.org/10.1109/ICPP.2015.19","url":null,"abstract":"How to effectively distribute and share increasingly large volumes of data in large-scale network applications is a key challenge for Internet infrastructure. Although NDN, a promising new future internet architecture which takes data oriented transfer approaches, aims to better solve such needs than IP, it still faces problems like data redundancy transmission and inefficient in-network cache utilization. This paper combines network coding techniques to NDN to improve network throughput and efficiency. The merit of our design is that it is able to avoid duplicate and unproductive data delivery while transferring disjoint data segments along multiple paths and with no excess modification to NDN fundamentals. To quantify performance benefits of applying network coding in NDN, we integrate network coding into an NDN streaming media system implemented in the ndn SIM simulator. Basing on BRITE generated network topologies in our simulation, the experimental results clearly and fairly demonstrate that considering network coding in NDN can significantly improve the performance, reliability and QoS. More importantly, our approach is capable of and well fit for delivering growing Big Data applications including high-performance and high-density video streaming services.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130703875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Scan without a Glance: Towards Content-Free Crowd-Sourced Mobile Video Retrieval System 扫一眼:迈向无内容的众包移动视频检索系统

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.34

Cihang Liu, Lan Zhang, Kebin Liu, Yunhao Liu

Mobile videos contain rich information which could be utilized for various applications, like criminal investigation and scene reconstruction. Today's crowd-sourced mobile video retrieval systems are built on video content comparison, and their wide adoption has been hindered by onerous computation of CV algorithms and redundant networking traffic of the video transmission. In this work, we propose to leverage Field of View(FoV) as a content-free descriptor to measure video similarity with little accuracy loss. Based on FoV, our system can filter out unmatched videos before any content analysis and video transmission, which dramatically cuts down the computation and communication cost for crowd-sourced mobile video retrieval. Moreover, we design a video segmentation algorithm and an R-Tree based indexing structure to further reduce the networking traffic for mobile clients and potentiate the efficiency for the cloud server. We implement a prototype system and evaluate it from different aspects. The results show that FoV descriptors are much smaller and significantly faster to extract and match compared to content descriptors, while the FoV based similarity measurement achieves comparable search accuracy with the content-based method. Our evaluation also shows that the proposed retrieval scheme is scalable with data size and can response in less than 100ms when the data set has tens of thousands of video segments, and the networking traffic between the client and the server is negligible.

移动视频包含丰富的信息，可以用于各种应用，如刑事侦查和现场重建。目前的众包移动视频检索系统是建立在视频内容比较的基础上的，CV算法的繁重计算和视频传输的冗余网络流量阻碍了其广泛应用。在这项工作中，我们建议利用视场(FoV)作为无内容描述符来测量视频相似性，同时精度损失很小。基于FoV，我们的系统可以在任何内容分析和视频传输之前过滤掉不匹配的视频，大大减少了众包移动视频检索的计算和通信成本。此外，我们还设计了一种视频分割算法和基于r树的索引结构，以进一步减少移动客户端的网络流量，提高云服务器的效率。我们实现了一个原型系统，并从不同方面对其进行了评估。结果表明，视场描述子比内容描述子更小，提取和匹配速度更快，而基于视场的相似度度量与基于内容的方法的搜索精度相当。我们的评估还表明，所提出的检索方案具有数据大小的可扩展性，当数据集包含数万个视频片段时，可以在不到100ms的时间内响应，并且客户端和服务器之间的网络流量可以忽略不计。

{"title":"Scan without a Glance: Towards Content-Free Crowd-Sourced Mobile Video Retrieval System","authors":"Cihang Liu, Lan Zhang, Kebin Liu, Yunhao Liu","doi":"10.1109/ICPP.2015.34","DOIUrl":"https://doi.org/10.1109/ICPP.2015.34","url":null,"abstract":"Mobile videos contain rich information which could be utilized for various applications, like criminal investigation and scene reconstruction. Today's crowd-sourced mobile video retrieval systems are built on video content comparison, and their wide adoption has been hindered by onerous computation of CV algorithms and redundant networking traffic of the video transmission. In this work, we propose to leverage Field of View(FoV) as a content-free descriptor to measure video similarity with little accuracy loss. Based on FoV, our system can filter out unmatched videos before any content analysis and video transmission, which dramatically cuts down the computation and communication cost for crowd-sourced mobile video retrieval. Moreover, we design a video segmentation algorithm and an R-Tree based indexing structure to further reduce the networking traffic for mobile clients and potentiate the efficiency for the cloud server. We implement a prototype system and evaluate it from different aspects. The results show that FoV descriptors are much smaller and significantly faster to extract and match compared to content descriptors, while the FoV based similarity measurement achieves comparable search accuracy with the content-based method. Our evaluation also shows that the proposed retrieval scheme is scalable with data size and can response in less than 100ms when the data set has tens of thousands of video segments, and the networking traffic between the client and the server is negligible.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116996353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SCAN: A Smart Application Platform for Empowering Parallelizations of Big Genomic Data Analysis in Clouds SCAN:一个智能应用平台，支持云中大基因组数据分析的并行化

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.38

W. Xing, W. Jie, Crispin J. Miller

Cloud computing is often adopted to process big data for genome analysis due to its elasticity and pay-as-you-go features. In this paper, we present SCAN, a smart application platform to facilitate parallelization of big genome analysis in clouds. With a knowledge base and an intelligent application scheduler, the SCAN enables better understanding of bio-applications' characteristics, and helps to orchestrate huge, heterogeneous tasks efficiently and cost-effectively. We conducted a simulation study and found that the SCAN platform is able to improve the performance of genome analysis and reduce its cost in a wide variety of circumstances.

云计算由于其弹性和即用即付的特点，经常被用于处理基因组分析的大数据。在本文中，我们提出了一个智能应用平台SCAN，以促进大基因组分析在云中的并行化。通过知识库和智能应用程序调度程序，SCAN可以更好地理解生物应用程序的特征，并有助于高效和经济地协调庞大的异构任务。我们进行了模拟研究，发现SCAN平台能够在各种情况下提高基因组分析的性能并降低其成本。

引用次数: 0

Shorter On-Line Warmup for Sampled Simulation of Multi-threaded Applications 多线程应用采样模拟的较短在线预热

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.44

Chuntao Jiang, Zhibin Yu, Hai Jin, Xiaofei Liao, L. Eeckhout, Yonggang Zeng, Chengzhong Xu

Warm up is a crucial issue in sampled micro architectural simulation to avoid performance bias by constructing accurate states for micro-architectural structures before each sampling unit. Not until very recently have researchers proposed Time-Based Sampling (TBS) for the sampled simulation of multi-threaded applications. However, warm up in TBS is challenging and complicated, because (i) full functional warm up in TBS causes very high overhead, limiting overall simulation speed, (ii) traditional adaptive functional warm up for sampling single-threaded applications cannot be readily applied to TBS, and (iii) check pointing is inflexible (even invalid) due to the huge storage requirements and the variations across different runs for multi-threaded applications. In this work, we propose Shorter On-Line (SOL) warm up, which employs a two-stage strategy, using 'prime' warm up in the first stage, and an extended 'No-State-Loss (NSL)' method in the second stage. SOL is a single-pass, on-line warm up technique that addresses the warm up challenges posed in TBS in parallel simulators. SOL is highly accurate and efficient, providing a good trade-off between simulation accuracy and speed, and is easily deployed to different TBS techniques. For the PARSEC benchmarks on a simulated 8-core system, two state-of-the-art TBS techniques with SOL warm up provide a 7.2× and 37× simulation speedup over detailed simulation, respectively, compared to 3.1× and 4.5× under full warm up. SOL sacrifices only 0.3% in absolute execution time prediction accuracy on average.

在采样微建筑模拟中，预热是避免性能偏差的关键问题，可以在每个采样单元之前为微建筑结构构建准确的状态。直到最近，研究人员才提出了基于时间的采样(TBS)来对多线程应用程序进行采样模拟。然而，TBS中的预热是具有挑战性和复杂性的，因为(i) TBS中的全功能预热会导致非常高的开销，限制了整体模拟速度，(ii)传统的单线程应用程序采样的自适应功能预热不能轻易应用于TBS，以及(iii)由于巨大的存储需求和多线程应用程序不同运行之间的变化，检查指向是不灵活的(甚至无效)。在这项工作中，我们提出了缩短在线(SOL)预热，它采用两阶段策略，在第一阶段使用“初始”预热，在第二阶段使用扩展的“无状态损失(NSL)”方法。SOL是一种单次在线预热技术，用于解决并行模拟器中TBS所带来的预热挑战。SOL非常精确和高效，在仿真精度和速度之间提供了很好的权衡，并且很容易部署到不同的TBS技术中。对于模拟8核系统上的PARSEC基准测试，两种最先进的TBS技术与SOL预热相比，在详细模拟中分别提供了7.2倍和37倍的模拟加速，而在完全预热下分别为3.1倍和4.5倍。SOL平均只牺牲了0.3%的绝对执行时间预测精度。

{"title":"Shorter On-Line Warmup for Sampled Simulation of Multi-threaded Applications","authors":"Chuntao Jiang, Zhibin Yu, Hai Jin, Xiaofei Liao, L. Eeckhout, Yonggang Zeng, Chengzhong Xu","doi":"10.1109/ICPP.2015.44","DOIUrl":"https://doi.org/10.1109/ICPP.2015.44","url":null,"abstract":"Warm up is a crucial issue in sampled micro architectural simulation to avoid performance bias by constructing accurate states for micro-architectural structures before each sampling unit. Not until very recently have researchers proposed Time-Based Sampling (TBS) for the sampled simulation of multi-threaded applications. However, warm up in TBS is challenging and complicated, because (i) full functional warm up in TBS causes very high overhead, limiting overall simulation speed, (ii) traditional adaptive functional warm up for sampling single-threaded applications cannot be readily applied to TBS, and (iii) check pointing is inflexible (even invalid) due to the huge storage requirements and the variations across different runs for multi-threaded applications. In this work, we propose Shorter On-Line (SOL) warm up, which employs a two-stage strategy, using 'prime' warm up in the first stage, and an extended 'No-State-Loss (NSL)' method in the second stage. SOL is a single-pass, on-line warm up technique that addresses the warm up challenges posed in TBS in parallel simulators. SOL is highly accurate and efficient, providing a good trade-off between simulation accuracy and speed, and is easily deployed to different TBS techniques. For the PARSEC benchmarks on a simulated 8-core system, two state-of-the-art TBS techniques with SOL warm up provide a 7.2× and 37× simulation speedup over detailed simulation, respectively, compared to 3.1× and 4.5× under full warm up. SOL sacrifices only 0.3% in absolute execution time prediction accuracy on average.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132164830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1