Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097855
Michael Matheny, Stephen Herbein, N. Podhorszki, S. Klasky, M. Taufer
On petascale systems, the selection of optimal values for I/O parameters without taking into account the I/O size and pattern can cause the I/O time to dominate the simulation time, compromising the application's scalability. In this paper, we adopt and adapt an engineering method called surrogate-based modeling to efficiently search for the optimal I/O parameter values and accurately predict the associated I/O times at the extreme scale. Our approach allows us to address both the search and prediction in a short time, even when the application's I/O is large and exhibits irregular patterns.
{"title":"Using surrogate-based modeling to predict optimal I/O parameters of applications at the extreme scale","authors":"Michael Matheny, Stephen Herbein, N. Podhorszki, S. Klasky, M. Taufer","doi":"10.1109/PADSW.2014.7097855","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097855","url":null,"abstract":"On petascale systems, the selection of optimal values for I/O parameters without taking into account the I/O size and pattern can cause the I/O time to dominate the simulation time, compromising the application's scalability. In this paper, we adopt and adapt an engineering method called surrogate-based modeling to efficiently search for the optimal I/O parameter values and accurately predict the associated I/O times at the extreme scale. Our approach allows us to address both the search and prediction in a short time, even when the application's I/O is large and exhibits irregular patterns.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing that possesses highly accessible and elastic computing resources perfectly matches the demands of video services, which employ massive storage and intensive computational power to store, transmit, compress, enhance, and analyze the videos, uploaded from commodity devices and surveillance cameras. However, most existing video processing programs are neither designed to run on parallel environments nor able to efficiently utilize the computational power of cloud platforms, which not only wastes the computing resources but also increases the cost of using cloud platforms. In this paper, we present three strategies to enhance the multicore utilization for video processing, namely producer-consumer model, intra-process overlapping, and inter-process overlapping. We experimented our strategies on a video enhancement program, which performs decoding, dehazing, and encoding, and the results showed the CPU utilization can be improved up to 31% for an 8 core instance, which can significantly reduce the cost in a long run.
{"title":"Achieving cost effective cloud video services via fine grained multicore scheduling","authors":"Hao-Che Kao, Hao-Ping Kang, Che-Rung Lee, Kun-Hsien Lu, Shu-Hsin Chang","doi":"10.1109/PADSW.2014.7097843","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097843","url":null,"abstract":"Cloud computing that possesses highly accessible and elastic computing resources perfectly matches the demands of video services, which employ massive storage and intensive computational power to store, transmit, compress, enhance, and analyze the videos, uploaded from commodity devices and surveillance cameras. However, most existing video processing programs are neither designed to run on parallel environments nor able to efficiently utilize the computational power of cloud platforms, which not only wastes the computing resources but also increases the cost of using cloud platforms. In this paper, we present three strategies to enhance the multicore utilization for video processing, namely producer-consumer model, intra-process overlapping, and inter-process overlapping. We experimented our strategies on a video enhancement program, which performs decoding, dehazing, and encoding, and the results showed the CPU utilization can be improved up to 31% for an 8 core instance, which can significantly reduce the cost in a long run.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129486029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097821
Gangyong Jia, Guangjie Han, Liang Shi, Jian Wan, Dong Dai
The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM's speed and throughput. Moreover, on multi-core platforms, DRAM memory shared by all cores usually suffers from the memory contention and interference problem, which can cause serious performance degradation and unfairness among parallel running threads. To address these problems, this paper proposes techniques to take both advantages of partitioning cores, threads and memory banks into groups to reduce interference among different groups and grouping the memory accesses of the same row together to reduce cache miss rate. A memory optimization framework combined thread scheduling with memory scheduling (CTMS) is proposed in this paper, which simultaneously minimizes memory access schedule length, memory access time and reduce interference to maximize performance for multi-core systems. Experimental results show CTMS is 12.6% shorter in memory access time, while improving 11.8% throughput on average. Moreover, CTMS also saves 5.8% of the energy consumption.
{"title":"Combine thread with memory scheduling for maximizing performance in multi-core systems","authors":"Gangyong Jia, Guangjie Han, Liang Shi, Jian Wan, Dong Dai","doi":"10.1109/PADSW.2014.7097821","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097821","url":null,"abstract":"The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM's speed and throughput. Moreover, on multi-core platforms, DRAM memory shared by all cores usually suffers from the memory contention and interference problem, which can cause serious performance degradation and unfairness among parallel running threads. To address these problems, this paper proposes techniques to take both advantages of partitioning cores, threads and memory banks into groups to reduce interference among different groups and grouping the memory accesses of the same row together to reduce cache miss rate. A memory optimization framework combined thread scheduling with memory scheduling (CTMS) is proposed in this paper, which simultaneously minimizes memory access schedule length, memory access time and reduce interference to maximize performance for multi-core systems. Experimental results show CTMS is 12.6% shorter in memory access time, while improving 11.8% throughput on average. Moreover, CTMS also saves 5.8% of the energy consumption.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128050976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097830
Yong Su, Zheng Cao, Zhiguo Fan, Zhan Wang, Xiaoli Liu, Xiaobing Liu, Li Qiang, Xuejun An, Ninghui Sun
Communication locality is an important characteristic of parallel applications. A great deal of research shows that utilizing the characteristic will favor most applications. Aiming at communication locality, we present a hierarchical direct network topology to accelerate neighbor communication. Combining mesh topology and complete graph topology, it can be used to optimize local communication and build large-scale network with low radix routers. Analyzing the characteristic of hierarchical topology, we find the presented topology has high cost performance and excellent expandability. We also design two minimum path routing algorithms and compare them with Mesh, Dragonfly and PERCS topologies. The results show the saturated throughput of hierarchical topology is nearly 40% with uniform random trace and 70% with local communication model of 4K nodes. That indicates high scalability for applications with local communication and cost efficiency for uniform random trace.
{"title":"Building a large-scale direct network with low-radix routers","authors":"Yong Su, Zheng Cao, Zhiguo Fan, Zhan Wang, Xiaoli Liu, Xiaobing Liu, Li Qiang, Xuejun An, Ninghui Sun","doi":"10.1109/PADSW.2014.7097830","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097830","url":null,"abstract":"Communication locality is an important characteristic of parallel applications. A great deal of research shows that utilizing the characteristic will favor most applications. Aiming at communication locality, we present a hierarchical direct network topology to accelerate neighbor communication. Combining mesh topology and complete graph topology, it can be used to optimize local communication and build large-scale network with low radix routers. Analyzing the characteristic of hierarchical topology, we find the presented topology has high cost performance and excellent expandability. We also design two minimum path routing algorithms and compare them with Mesh, Dragonfly and PERCS topologies. The results show the saturated throughput of hierarchical topology is nearly 40% with uniform random trace and 70% with local communication model of 4K nodes. That indicates high scalability for applications with local communication and cost efficiency for uniform random trace.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132357679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097808
J. Langguth, Xing Cai
A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform traditional CPUs by far and thus relegate them to pure control functions instead of computations. For irregular applications however, the gap in relative performance can be much smaller, and sometimes even reversed. Thus, maximizing overall performance in such systems requires that full use of all available computational resources is made. In this paper we study the attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs and multiple GPUs. Finite volume methods are widely used numerical strategies for solving partial differential equations. The advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. Our focus lies in demonstrating how a workload distribution that maximizes overall performance can be derived from the actual performance attained by the different computing devices in the heterogeneous environment. We also highlight the dual role of partitioning software in reordering and partitioning the input mesh, thus giving rise to a new combined approach to partitioning.
{"title":"Heterogeneous CPU-GPU computing for the finite volume method on 3D unstructured meshes","authors":"J. Langguth, Xing Cai","doi":"10.1109/PADSW.2014.7097808","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097808","url":null,"abstract":"A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform traditional CPUs by far and thus relegate them to pure control functions instead of computations. For irregular applications however, the gap in relative performance can be much smaller, and sometimes even reversed. Thus, maximizing overall performance in such systems requires that full use of all available computational resources is made. In this paper we study the attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs and multiple GPUs. Finite volume methods are widely used numerical strategies for solving partial differential equations. The advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. Our focus lies in demonstrating how a workload distribution that maximizes overall performance can be derived from the actual performance attained by the different computing devices in the heterogeneous environment. We also highlight the dual role of partitioning software in reordering and partitioning the input mesh, thus giving rise to a new combined approach to partitioning.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124347832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097837
A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros
Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.
{"title":"Performance analysis of HPC applications with irregular tree data structures","authors":"A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros","doi":"10.1109/PADSW.2014.7097837","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097837","url":null,"abstract":"Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114345779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097901
Guanglian Liu, Shigeng Zhang, Jianxin Wang, Xuan Liu
The Radio Frequency IDentification (RFID) technology provides a promising solution to location discovery in indoor environments. Existing RFID reader positioning algorithms usually use all the collected reference tags to determine the position of the target reader, and thus are time-consuming as well as susceptible to the communication irregularity between the reader and reference tags. Especially, they usually perform poorly when the target reader is near the wall or at the corner. In this paper, we propose ArPat, an Accurate RFID reader Positioning algorithm that uses mere boundary reference Tags to calculate the position of the reader. ArPat uses only boundary tags to determine the position of the target reader, which effectively mitigates the negative impact of communication irregularity on the localization accuracy. The localization accuracy of ArPat is higher than 0.2 ft when the space between references tags is 1 ft. Compared with state-of-the-art solutions for RFID reader positioning, ArPat improves localization accuracy by up to 42 percent and 36 percent on average. Furthermore, it uses a geometric approach rather than iterative optimization approaches employed by previous solutions, making it superior in time efficiency. Compared with previous solutions, the computational time of ArPat is nearly two orders of magnitude less. This is critical for a localization system to provide real time location discovery and tracking services.
{"title":"ArPat: Accurate RFID reader positioning with mere boundary tags","authors":"Guanglian Liu, Shigeng Zhang, Jianxin Wang, Xuan Liu","doi":"10.1109/PADSW.2014.7097901","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097901","url":null,"abstract":"The Radio Frequency IDentification (RFID) technology provides a promising solution to location discovery in indoor environments. Existing RFID reader positioning algorithms usually use all the collected reference tags to determine the position of the target reader, and thus are time-consuming as well as susceptible to the communication irregularity between the reader and reference tags. Especially, they usually perform poorly when the target reader is near the wall or at the corner. In this paper, we propose ArPat, an Accurate RFID reader Positioning algorithm that uses mere boundary reference Tags to calculate the position of the reader. ArPat uses only boundary tags to determine the position of the target reader, which effectively mitigates the negative impact of communication irregularity on the localization accuracy. The localization accuracy of ArPat is higher than 0.2 ft when the space between references tags is 1 ft. Compared with state-of-the-art solutions for RFID reader positioning, ArPat improves localization accuracy by up to 42 percent and 36 percent on average. Furthermore, it uses a geometric approach rather than iterative optimization approaches employed by previous solutions, making it superior in time efficiency. Compared with previous solutions, the computational time of ArPat is nearly two orders of magnitude less. This is critical for a localization system to provide real time location discovery and tracking services.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sparse Matrix-Transpose Vector Product (SMTVP) is a frequently used computation pattern in High Performance Computing applications. It is typically solved by transposition followed by a Sparse Matrix-Vector Product (SMVP) in current linear algebra packages. However, the transposition process can be a serious bottleneck on modern parallel computing platforms. A previous work proposed a relatively complex data structure for efficiently computing SMTVP with multi-core CPUs, but it proved to be inefficient on GPUs. In this work, we show that the Compressed Sparse Row (CSR) based SMVP algorithm can also be efficient for SMTVP computation on modern GPUs. The proposed method exploits atomic operations to perform the reduce operation in the computation of each inner product of a row in the transposed matrix and the vector. Experimental results show that the simple technique can outperform the SMTVP flow of transposition plus SMVP released in the CUSPARSE package by up to 405-fold.
{"title":"Atomic reduction based sparse matrix-transpose vector multiplication on GPUs","authors":"Yuan Tao, Yangdong Deng, Shuai Mu, Mingfa Zhu, Limin Xiao, Li Ruan, Zhibin Huang","doi":"10.1109/PADSW.2014.7097920","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097920","url":null,"abstract":"Sparse Matrix-Transpose Vector Product (SMTVP) is a frequently used computation pattern in High Performance Computing applications. It is typically solved by transposition followed by a Sparse Matrix-Vector Product (SMVP) in current linear algebra packages. However, the transposition process can be a serious bottleneck on modern parallel computing platforms. A previous work proposed a relatively complex data structure for efficiently computing SMTVP with multi-core CPUs, but it proved to be inefficient on GPUs. In this work, we show that the Compressed Sparse Row (CSR) based SMVP algorithm can also be efficient for SMTVP computation on modern GPUs. The proposed method exploits atomic operations to perform the reduce operation in the computation of each inner product of a row in the transposed matrix and the vector. Experimental results show that the simple technique can outperform the SMTVP flow of transposition plus SMVP released in the CUSPARSE package by up to 405-fold.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123578674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097836
P. Duan, Chao Peng, Qin Zhu, Jingmin Shi, Haibin Cai
VCPS (Vehicular Cyber Physical Systems) is a special kind of networked cyber physical system in which each vehicle is regarded as a communication unit. Vehicle's movement is restricted by road and environment in VCPS, while traditional random mobility model and waypoint mobility model cannot reflect the realistic vehicle traces. In VCPS, with the high speed of vehicles, the network topology undergoing tremendous changes all the time, which greatly undermines the stability of communication between vehicles. The diversity and complexity of traffic scenarios in VCPS have also increased the difficulty of designing an efficient and stable routing protocol. In this paper, we creatively combine SDN (Software Defined Networking) and VCPS together and propose a new VCPS communication architecture, which enable VCPS to be manageable by remote controller. SD-VCPS can flexibly change routing policies depending on different traffic scenes or traffic periods, adjusting the topology of VCPS to adapt to different network requirements. We further present a new location-based routing protocol for SD-VCPS, and corroborate the efficiency of our proposed framework by experiments using network simulator NS3.
{"title":"Design and analysis of software defined Vehicular Cyber Physical Systems","authors":"P. Duan, Chao Peng, Qin Zhu, Jingmin Shi, Haibin Cai","doi":"10.1109/PADSW.2014.7097836","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097836","url":null,"abstract":"VCPS (Vehicular Cyber Physical Systems) is a special kind of networked cyber physical system in which each vehicle is regarded as a communication unit. Vehicle's movement is restricted by road and environment in VCPS, while traditional random mobility model and waypoint mobility model cannot reflect the realistic vehicle traces. In VCPS, with the high speed of vehicles, the network topology undergoing tremendous changes all the time, which greatly undermines the stability of communication between vehicles. The diversity and complexity of traffic scenarios in VCPS have also increased the difficulty of designing an efficient and stable routing protocol. In this paper, we creatively combine SDN (Software Defined Networking) and VCPS together and propose a new VCPS communication architecture, which enable VCPS to be manageable by remote controller. SD-VCPS can flexibly change routing policies depending on different traffic scenes or traffic periods, adjusting the topology of VCPS to adapt to different network requirements. We further present a new location-based routing protocol for SD-VCPS, and corroborate the efficiency of our proposed framework by experiments using network simulator NS3.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121954679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/PADSW.2014.7097803
Sixiang Ma, Hao-peng Chen, Yuxi Shen, Heng Lu, Bin Wei, P. He
This paper presents the design, implementation, and evaluation of a multi-tiered storage system called MOBBS, which provides hybrid block storage for Virtual Machines (VMs) on top of object-based storage infrastructure. MOBBS is mainly motivated by the gap between the lack of studies on hybrid block storage for VMs and the increasing prevalence of hybrid storage systems. By stripping disk images into partitions and intelligently storing them on different storage tiers according to real-time workload patterns, MOBBS achieves efficient use of multiple storage devices and relieves the burden of data placement. Leveraging the benefits of object-based storage, MOBBS is able to dynamically perform non-disruptive and fine-grained data migration between storage tiers and distribute the complexity of data migration across entire storage nodes. Such designs enable our system to deliver storage for VMs with high scalability and availability under an efficient use of SSDs. We evaluated a Ceph implementation of MOBBS using both block and file system workloads. The results comprehensively demonstrate MOBBS's effectiveness in performance improvement as well as efficient utilization of different storage devices.
{"title":"Providing hybrid block storage for virtual machines using object-based storage","authors":"Sixiang Ma, Hao-peng Chen, Yuxi Shen, Heng Lu, Bin Wei, P. He","doi":"10.1109/PADSW.2014.7097803","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097803","url":null,"abstract":"This paper presents the design, implementation, and evaluation of a multi-tiered storage system called MOBBS, which provides hybrid block storage for Virtual Machines (VMs) on top of object-based storage infrastructure. MOBBS is mainly motivated by the gap between the lack of studies on hybrid block storage for VMs and the increasing prevalence of hybrid storage systems. By stripping disk images into partitions and intelligently storing them on different storage tiers according to real-time workload patterns, MOBBS achieves efficient use of multiple storage devices and relieves the burden of data placement. Leveraging the benefits of object-based storage, MOBBS is able to dynamically perform non-disruptive and fine-grained data migration between storage tiers and distribute the complexity of data migration across entire storage nodes. Such designs enable our system to deliver storage for VMs with high scalability and availability under an efficient use of SSDs. We evaluated a Ceph implementation of MOBBS using both block and file system workloads. The results comprehensively demonstrate MOBBS's effectiveness in performance improvement as well as efficient utilization of different storage devices.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116845534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}