Wireless Sensor Networks (WSNs) have been successfully applied in many application areas. Understanding the wireless link performance is very helpful for both protocol designers and network managers. Loss tomography is a popular approach to inferring the per-link loss ratios from end-to-end delivery ratios. Previous studies, however, are usually targeted for networks with static or slowly changing routing paths. In this work, we propose Dophy, a Dynamic loss tomography approach specifically designed for dynamic WSNs where each node dynamically selects the forwarding nodes towards the sink. The key idea of Dophy is based on an observation that most existing protocols use retransmissions to achieve high data delivery ratio. Dophy employs arithmetic encoding to compactly encode the number of retransmissions along the paths. Dophy incorporates two mechanisms to optimize its performance. First, Dophy intelligently reduces the size of symbol set by aggregating the number of retransmissions, reducing the encoding overhead significantly. Second, Dophy periodically updates the probability model to minimize the overall transmission overhead. We implement Dophy on the Tiny OS platform and evaluate its performance extensively using large-scale simulations. Results show that Dophy achieves both high encoding efficiency and high estimation accuracy. Comparative studies show that Dophy significantly outperforms traditional loss tomography approaches in terms of accuracy.
{"title":"Fine-Grained Loss Tomography in Dynamic Sensor Networks","authors":"Chenhong Cao, Yi Gao, Wei Dong, Jiajun Bu","doi":"10.1109/ICPP.2015.87","DOIUrl":"https://doi.org/10.1109/ICPP.2015.87","url":null,"abstract":"Wireless Sensor Networks (WSNs) have been successfully applied in many application areas. Understanding the wireless link performance is very helpful for both protocol designers and network managers. Loss tomography is a popular approach to inferring the per-link loss ratios from end-to-end delivery ratios. Previous studies, however, are usually targeted for networks with static or slowly changing routing paths. In this work, we propose Dophy, a Dynamic loss tomography approach specifically designed for dynamic WSNs where each node dynamically selects the forwarding nodes towards the sink. The key idea of Dophy is based on an observation that most existing protocols use retransmissions to achieve high data delivery ratio. Dophy employs arithmetic encoding to compactly encode the number of retransmissions along the paths. Dophy incorporates two mechanisms to optimize its performance. First, Dophy intelligently reduces the size of symbol set by aggregating the number of retransmissions, reducing the encoding overhead significantly. Second, Dophy periodically updates the probability model to minimize the overall transmission overhead. We implement Dophy on the Tiny OS platform and evaluate its performance extensively using large-scale simulations. Results show that Dophy achieves both high encoding efficiency and high estimation accuracy. Comparative studies show that Dophy significantly outperforms traditional loss tomography approaches in terms of accuracy.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115585113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transactional memory (TM) is emerging as an attractive synchronization mechanism for concurrent computing. In this work we aim at filling a relevant gap in the TM literature, by investigating the issue of energy efficiency for one crucial building block of TM systems: contention management. Green-CM, the solution proposed in this paper, is the first contention management scheme explicitly designed to jointly optimize both performance and energy consumption. To this end Green-TM combines three key mechanisms: i) it leverages on a novel asymmetric design, which combines different back-off policies in order to take advantage of dynamic frequency and voltage scaling, ii) it introduces an energy efficient design of the back-off mechanism, which combines spin-based and sleep-based implementations, iii) it makes extensive use of self-tuning mechanisms to pursue optimal efficiency across highly heterogeneous workloads. We evaluate Green-CM from both the energy and performance perspectives, and show that it can achieve enhanced efficiency by up to 2.35 times with respect to state of the art contention managers, with an average gain of more than 60% when using 64 threads.
{"title":"Green-CM: Energy Efficient Contention Management for Transactional Memory","authors":"S. Issa, P. Romano, M. Brorsson","doi":"10.1109/ICPP.2015.64","DOIUrl":"https://doi.org/10.1109/ICPP.2015.64","url":null,"abstract":"Transactional memory (TM) is emerging as an attractive synchronization mechanism for concurrent computing. In this work we aim at filling a relevant gap in the TM literature, by investigating the issue of energy efficiency for one crucial building block of TM systems: contention management. Green-CM, the solution proposed in this paper, is the first contention management scheme explicitly designed to jointly optimize both performance and energy consumption. To this end Green-TM combines three key mechanisms: i) it leverages on a novel asymmetric design, which combines different back-off policies in order to take advantage of dynamic frequency and voltage scaling, ii) it introduces an energy efficient design of the back-off mechanism, which combines spin-based and sleep-based implementations, iii) it makes extensive use of self-tuning mechanisms to pursue optimal efficiency across highly heterogeneous workloads. We evaluate Green-CM from both the energy and performance perspectives, and show that it can achieve enhanced efficiency by up to 2.35 times with respect to state of the art contention managers, with an average gain of more than 60% when using 64 threads.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124776388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to synchronization overhead, it is challenging to apply the parallel simulation techniques of multi-core processors to a larger scale. Although the use of lax synchronization scheme reduces the synchronous overhead and balances the load between synchronous points, it introduces timing errors. To improve the accuracy of lax synchronized simulations, we propose an error compensation technique, which leverages prediction methods to compensate for simulated time deviations due to timing errors. The rationale of our approach is that, in the simulated multi-core processor systems the errors typically propagate via the delays of some pivotal events that connect subsystem models across different hierarchies. By predicting delays based on the simulation results of the preceding pivotal events, our techniques can eliminate errors from the predicted delays before they propagate to the models at higher hierarchies, thereby effectively improving the simulation accuracy. Since the predictions don't have any constraint on synchronizations, our approach largely maintains the scalability of lax synchronization schemes. Furthermore, our proposed mechanism is orthogonal to other parallel simulation techniques and can be used in conjunction with them. Experimental results show error compensation improves the accuracy of lax synchronized simulations by 60.2% and achieves 98.2% accuracy when combined with an enhanced lax synchronization.
{"title":"Leveraging Error Compensation to Minimize Time Deviation in Parallel Multi-core Simulations","authors":"Xiaodong Zhu, Junmin Wu, Tao Li","doi":"10.1109/ICPP.2015.21","DOIUrl":"https://doi.org/10.1109/ICPP.2015.21","url":null,"abstract":"Due to synchronization overhead, it is challenging to apply the parallel simulation techniques of multi-core processors to a larger scale. Although the use of lax synchronization scheme reduces the synchronous overhead and balances the load between synchronous points, it introduces timing errors. To improve the accuracy of lax synchronized simulations, we propose an error compensation technique, which leverages prediction methods to compensate for simulated time deviations due to timing errors. The rationale of our approach is that, in the simulated multi-core processor systems the errors typically propagate via the delays of some pivotal events that connect subsystem models across different hierarchies. By predicting delays based on the simulation results of the preceding pivotal events, our techniques can eliminate errors from the predicted delays before they propagate to the models at higher hierarchies, thereby effectively improving the simulation accuracy. Since the predictions don't have any constraint on synchronizations, our approach largely maintains the scalability of lax synchronization schemes. Furthermore, our proposed mechanism is orthogonal to other parallel simulation techniques and can be used in conjunction with them. Experimental results show error compensation improves the accuracy of lax synchronized simulations by 60.2% and achieves 98.2% accuracy when combined with an enhanced lax synchronization.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114587878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huanzhou Zhu, Ligang He, Bo Gao, Kenli Li, Jianhua Sun, Hao Chen, Kuan-Ching Li
On-chip cache is often shared between processes that run concurrently on different cores of the same processor. Resource contention of this type causes performance degradation to the co-running processes. Contention-aware co-scheduling refers to the class of scheduling techniques to reduce the performance degradation. Most existing contention-aware co-schedulers only consider serial jobs. However, there often exist both parallel and serial jobs in computing systems. In this paper, the problem of co-scheduling a mix of serial and parallel jobs is modelled as an Integer Programming (IP) problem. Then the existing IP solver can be used to find the optimal co-scheduling solution that minimizes the performance degradation. However, we find that the IP-based method incurs high time overhead and can only be used to solve small-scale problems. Therefore, a graph-based method is also proposed in this paper to tackle this problem. We construct a co-scheduling graph to represent the co-scheduling problem and model the problem of finding the optimal co-scheduling solution as the problem of finding the shortest valid path in the co-scheduling graph. A heuristic A*-search algorithm (HA*) is then developed to find the near-optimal solutions efficiently. The extensive experiments have been conducted to verify the effectiveness and efficiency of the proposed methods. The experimental results show that compared with the IP-based method, HA* is able to find the near-optimal solutions with much less time.
{"title":"Modelling and Developing Co-scheduling Strategies on Multicore Processors","authors":"Huanzhou Zhu, Ligang He, Bo Gao, Kenli Li, Jianhua Sun, Hao Chen, Kuan-Ching Li","doi":"10.1109/ICPP.2015.31","DOIUrl":"https://doi.org/10.1109/ICPP.2015.31","url":null,"abstract":"On-chip cache is often shared between processes that run concurrently on different cores of the same processor. Resource contention of this type causes performance degradation to the co-running processes. Contention-aware co-scheduling refers to the class of scheduling techniques to reduce the performance degradation. Most existing contention-aware co-schedulers only consider serial jobs. However, there often exist both parallel and serial jobs in computing systems. In this paper, the problem of co-scheduling a mix of serial and parallel jobs is modelled as an Integer Programming (IP) problem. Then the existing IP solver can be used to find the optimal co-scheduling solution that minimizes the performance degradation. However, we find that the IP-based method incurs high time overhead and can only be used to solve small-scale problems. Therefore, a graph-based method is also proposed in this paper to tackle this problem. We construct a co-scheduling graph to represent the co-scheduling problem and model the problem of finding the optimal co-scheduling solution as the problem of finding the shortest valid path in the co-scheduling graph. A heuristic A*-search algorithm (HA*) is then developed to find the near-optimal solutions efficiently. The extensive experiments have been conducted to verify the effectiveness and efficiency of the proposed methods. The experimental results show that compared with the IP-based method, HA* is able to find the near-optimal solutions with much less time.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129828624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Open ACC standard has been developed to simplify parallel programming of heterogeneous systems. Based on a set of high-level compiler directives it allows application developers to offload code regions from a host CPU to an accelerator without the need for low-level programming with CUDA or Open CL. Details are implicit in the programming model and managed by Open ACC API-enabled compilers and runtimes. However, it is still possible for the application developer to explicitly specify several performance-related details for the execution. To tune an Open ACC program and efficiently utilize available hardware resources, sophisticated performance analysis tools are required. In this paper we present a framework for detailed analysis of Open ACC applications. We describe new analysis capabilities introduced with an Open ACC tools interface and depict the integration of performance analysis for low-level programming models. As proof of concept we implemented the concept into the measurement infrastructure Score-P and the trace browser Vampir. This provides the program developer with a clearer understanding of the dynamic runtime behavior of the application and for systematic identification of potential bottlenecks.
Open ACC标准的开发是为了简化异构系统的并行编程。基于一组高级编译器指令,它允许应用程序开发人员将代码区域从主机CPU卸载到加速器,而无需使用CUDA或Open CL进行低级编程。细节隐含在编程模型中,并由启用Open ACC api的编译器和运行时管理。但是,应用程序开发人员仍然可以显式地为执行指定一些与性能相关的细节。为了调优Open ACC程序并有效地利用可用的硬件资源,需要复杂的性能分析工具。在本文中,我们提出了一个详细分析Open ACC应用的框架。我们描述了由Open ACC工具接口引入的新分析功能,并描述了低级编程模型的性能分析集成。作为概念验证,我们将该概念实现到测量基础设施Score-P和跟踪浏览器Vampir中。这使程序开发人员能够更清楚地了解应用程序的动态运行时行为,并能够系统地识别潜在的瓶颈。
{"title":"Open ACC Programs Examined: A Performance Analysis Approach","authors":"R. Dietrich, G. Juckeland, M. Wolfe","doi":"10.1109/ICPP.2015.40","DOIUrl":"https://doi.org/10.1109/ICPP.2015.40","url":null,"abstract":"The Open ACC standard has been developed to simplify parallel programming of heterogeneous systems. Based on a set of high-level compiler directives it allows application developers to offload code regions from a host CPU to an accelerator without the need for low-level programming with CUDA or Open CL. Details are implicit in the programming model and managed by Open ACC API-enabled compilers and runtimes. However, it is still possible for the application developer to explicitly specify several performance-related details for the execution. To tune an Open ACC program and efficiently utilize available hardware resources, sophisticated performance analysis tools are required. In this paper we present a framework for detailed analysis of Open ACC applications. We describe new analysis capabilities introduced with an Open ACC tools interface and depict the integration of performance analysis for low-level programming models. As proof of concept we implemented the concept into the measurement infrastructure Score-P and the trace browser Vampir. This provides the program developer with a clearer understanding of the dynamic runtime behavior of the application and for systematic identification of potential bottlenecks.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124723699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Virtualization is a key technology for cloud data centers to implement infrastructure as a service (IaaS) and to provide flexible and cost-effective resource sharing. It introduces an additional layer of abstraction that produces resource utilization overhead. Disregarding this overhead may cause serious reduction of the monitoring accuracy of the cloud providers and may cause degradation of the VM performance. However, there is no previous work that comprehensively investigates the virtualization overhead. In this paper, we comprehensively measure and study the relationship between the resource utilizations of virtual machines (VMs) and the resource utilizations of the device driver domain, hypervisor and the physical machine (PM) with diverse workloads and scenarios in the Xen virtualization environment. We examine data from the real-world virtualized deployment to characterize VM workloads and assess their impact on the resource utilizations in the system. We show that the impact of virtualization overhead depends on the workloads, and that virtualization overhead is an important factor to consider in cloud resource provisioning. Based on the measurements, we build a regression model to estimate the resource utilization overhead of the PM resulting from providing virtualized resource to the VMs and from managing multiple VMs. Finally, our trace-driven real-world experimental results show the high accuracy of our model in predicting PM resource consumptions in the cloud datacenter, and the importance of considering the virtualization overhead in cloud resource provisioning.
{"title":"Profiling and Understanding Virtualization Overhead in Cloud","authors":"Liuhua Chen, Shilkumar Patel, Haiying Shen, Zhongyi Zhou","doi":"10.1109/ICPP.2015.12","DOIUrl":"https://doi.org/10.1109/ICPP.2015.12","url":null,"abstract":"Virtualization is a key technology for cloud data centers to implement infrastructure as a service (IaaS) and to provide flexible and cost-effective resource sharing. It introduces an additional layer of abstraction that produces resource utilization overhead. Disregarding this overhead may cause serious reduction of the monitoring accuracy of the cloud providers and may cause degradation of the VM performance. However, there is no previous work that comprehensively investigates the virtualization overhead. In this paper, we comprehensively measure and study the relationship between the resource utilizations of virtual machines (VMs) and the resource utilizations of the device driver domain, hypervisor and the physical machine (PM) with diverse workloads and scenarios in the Xen virtualization environment. We examine data from the real-world virtualized deployment to characterize VM workloads and assess their impact on the resource utilizations in the system. We show that the impact of virtualization overhead depends on the workloads, and that virtualization overhead is an important factor to consider in cloud resource provisioning. Based on the measurements, we build a regression model to estimate the resource utilization overhead of the PM resulting from providing virtualized resource to the VMs and from managing multiple VMs. Finally, our trace-driven real-world experimental results show the high accuracy of our model in predicting PM resource consumptions in the cloud datacenter, and the importance of considering the virtualization overhead in cloud resource provisioning.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130119439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
How to effectively distribute and share increasingly large volumes of data in large-scale network applications is a key challenge for Internet infrastructure. Although NDN, a promising new future internet architecture which takes data oriented transfer approaches, aims to better solve such needs than IP, it still faces problems like data redundancy transmission and inefficient in-network cache utilization. This paper combines network coding techniques to NDN to improve network throughput and efficiency. The merit of our design is that it is able to avoid duplicate and unproductive data delivery while transferring disjoint data segments along multiple paths and with no excess modification to NDN fundamentals. To quantify performance benefits of applying network coding in NDN, we integrate network coding into an NDN streaming media system implemented in the ndn SIM simulator. Basing on BRITE generated network topologies in our simulation, the experimental results clearly and fairly demonstrate that considering network coding in NDN can significantly improve the performance, reliability and QoS. More importantly, our approach is capable of and well fit for delivering growing Big Data applications including high-performance and high-density video streaming services.
{"title":"Network Coding for Effective NDN Content Delivery: Models, Experiments, and Applications","authors":"Kai Lei, Fangxing Zhu, Cheng Peng, Kuai Xu","doi":"10.1109/ICPP.2015.19","DOIUrl":"https://doi.org/10.1109/ICPP.2015.19","url":null,"abstract":"How to effectively distribute and share increasingly large volumes of data in large-scale network applications is a key challenge for Internet infrastructure. Although NDN, a promising new future internet architecture which takes data oriented transfer approaches, aims to better solve such needs than IP, it still faces problems like data redundancy transmission and inefficient in-network cache utilization. This paper combines network coding techniques to NDN to improve network throughput and efficiency. The merit of our design is that it is able to avoid duplicate and unproductive data delivery while transferring disjoint data segments along multiple paths and with no excess modification to NDN fundamentals. To quantify performance benefits of applying network coding in NDN, we integrate network coding into an NDN streaming media system implemented in the ndn SIM simulator. Basing on BRITE generated network topologies in our simulation, the experimental results clearly and fairly demonstrate that considering network coding in NDN can significantly improve the performance, reliability and QoS. More importantly, our approach is capable of and well fit for delivering growing Big Data applications including high-performance and high-density video streaming services.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130703875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile videos contain rich information which could be utilized for various applications, like criminal investigation and scene reconstruction. Today's crowd-sourced mobile video retrieval systems are built on video content comparison, and their wide adoption has been hindered by onerous computation of CV algorithms and redundant networking traffic of the video transmission. In this work, we propose to leverage Field of View(FoV) as a content-free descriptor to measure video similarity with little accuracy loss. Based on FoV, our system can filter out unmatched videos before any content analysis and video transmission, which dramatically cuts down the computation and communication cost for crowd-sourced mobile video retrieval. Moreover, we design a video segmentation algorithm and an R-Tree based indexing structure to further reduce the networking traffic for mobile clients and potentiate the efficiency for the cloud server. We implement a prototype system and evaluate it from different aspects. The results show that FoV descriptors are much smaller and significantly faster to extract and match compared to content descriptors, while the FoV based similarity measurement achieves comparable search accuracy with the content-based method. Our evaluation also shows that the proposed retrieval scheme is scalable with data size and can response in less than 100ms when the data set has tens of thousands of video segments, and the networking traffic between the client and the server is negligible.
{"title":"Scan without a Glance: Towards Content-Free Crowd-Sourced Mobile Video Retrieval System","authors":"Cihang Liu, Lan Zhang, Kebin Liu, Yunhao Liu","doi":"10.1109/ICPP.2015.34","DOIUrl":"https://doi.org/10.1109/ICPP.2015.34","url":null,"abstract":"Mobile videos contain rich information which could be utilized for various applications, like criminal investigation and scene reconstruction. Today's crowd-sourced mobile video retrieval systems are built on video content comparison, and their wide adoption has been hindered by onerous computation of CV algorithms and redundant networking traffic of the video transmission. In this work, we propose to leverage Field of View(FoV) as a content-free descriptor to measure video similarity with little accuracy loss. Based on FoV, our system can filter out unmatched videos before any content analysis and video transmission, which dramatically cuts down the computation and communication cost for crowd-sourced mobile video retrieval. Moreover, we design a video segmentation algorithm and an R-Tree based indexing structure to further reduce the networking traffic for mobile clients and potentiate the efficiency for the cloud server. We implement a prototype system and evaluate it from different aspects. The results show that FoV descriptors are much smaller and significantly faster to extract and match compared to content descriptors, while the FoV based similarity measurement achieves comparable search accuracy with the content-based method. Our evaluation also shows that the proposed retrieval scheme is scalable with data size and can response in less than 100ms when the data set has tens of thousands of video segments, and the networking traffic between the client and the server is negligible.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116996353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing is often adopted to process big data for genome analysis due to its elasticity and pay-as-you-go features. In this paper, we present SCAN, a smart application platform to facilitate parallelization of big genome analysis in clouds. With a knowledge base and an intelligent application scheduler, the SCAN enables better understanding of bio-applications' characteristics, and helps to orchestrate huge, heterogeneous tasks efficiently and cost-effectively. We conducted a simulation study and found that the SCAN platform is able to improve the performance of genome analysis and reduce its cost in a wide variety of circumstances.
{"title":"SCAN: A Smart Application Platform for Empowering Parallelizations of Big Genomic Data Analysis in Clouds","authors":"W. Xing, W. Jie, Crispin J. Miller","doi":"10.1109/ICPP.2015.38","DOIUrl":"https://doi.org/10.1109/ICPP.2015.38","url":null,"abstract":"Cloud computing is often adopted to process big data for genome analysis due to its elasticity and pay-as-you-go features. In this paper, we present SCAN, a smart application platform to facilitate parallelization of big genome analysis in clouds. With a knowledge base and an intelligent application scheduler, the SCAN enables better understanding of bio-applications' characteristics, and helps to orchestrate huge, heterogeneous tasks efficiently and cost-effectively. We conducted a simulation study and found that the SCAN platform is able to improve the performance of genome analysis and reduce its cost in a wide variety of circumstances.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123640320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuntao Jiang, Zhibin Yu, Hai Jin, Xiaofei Liao, L. Eeckhout, Yonggang Zeng, Chengzhong Xu
Warm up is a crucial issue in sampled micro architectural simulation to avoid performance bias by constructing accurate states for micro-architectural structures before each sampling unit. Not until very recently have researchers proposed Time-Based Sampling (TBS) for the sampled simulation of multi-threaded applications. However, warm up in TBS is challenging and complicated, because (i) full functional warm up in TBS causes very high overhead, limiting overall simulation speed, (ii) traditional adaptive functional warm up for sampling single-threaded applications cannot be readily applied to TBS, and (iii) check pointing is inflexible (even invalid) due to the huge storage requirements and the variations across different runs for multi-threaded applications. In this work, we propose Shorter On-Line (SOL) warm up, which employs a two-stage strategy, using 'prime' warm up in the first stage, and an extended 'No-State-Loss (NSL)' method in the second stage. SOL is a single-pass, on-line warm up technique that addresses the warm up challenges posed in TBS in parallel simulators. SOL is highly accurate and efficient, providing a good trade-off between simulation accuracy and speed, and is easily deployed to different TBS techniques. For the PARSEC benchmarks on a simulated 8-core system, two state-of-the-art TBS techniques with SOL warm up provide a 7.2× and 37× simulation speedup over detailed simulation, respectively, compared to 3.1× and 4.5× under full warm up. SOL sacrifices only 0.3% in absolute execution time prediction accuracy on average.
{"title":"Shorter On-Line Warmup for Sampled Simulation of Multi-threaded Applications","authors":"Chuntao Jiang, Zhibin Yu, Hai Jin, Xiaofei Liao, L. Eeckhout, Yonggang Zeng, Chengzhong Xu","doi":"10.1109/ICPP.2015.44","DOIUrl":"https://doi.org/10.1109/ICPP.2015.44","url":null,"abstract":"Warm up is a crucial issue in sampled micro architectural simulation to avoid performance bias by constructing accurate states for micro-architectural structures before each sampling unit. Not until very recently have researchers proposed Time-Based Sampling (TBS) for the sampled simulation of multi-threaded applications. However, warm up in TBS is challenging and complicated, because (i) full functional warm up in TBS causes very high overhead, limiting overall simulation speed, (ii) traditional adaptive functional warm up for sampling single-threaded applications cannot be readily applied to TBS, and (iii) check pointing is inflexible (even invalid) due to the huge storage requirements and the variations across different runs for multi-threaded applications. In this work, we propose Shorter On-Line (SOL) warm up, which employs a two-stage strategy, using 'prime' warm up in the first stage, and an extended 'No-State-Loss (NSL)' method in the second stage. SOL is a single-pass, on-line warm up technique that addresses the warm up challenges posed in TBS in parallel simulators. SOL is highly accurate and efficient, providing a good trade-off between simulation accuracy and speed, and is easily deployed to different TBS techniques. For the PARSEC benchmarks on a simulated 8-core system, two state-of-the-art TBS techniques with SOL warm up provide a 7.2× and 37× simulation speedup over detailed simulation, respectively, compared to 3.1× and 4.5× under full warm up. SOL sacrifices only 0.3% in absolute execution time prediction accuracy on average.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132164830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}