Pub Date : 2009-10-04DOI: 10.1109/IISWC.2009.5306784
Danhua Guo, Guangdeng Liao, L. Bhuyan
Virtual Machine (VM) technology is experiencing a resurgent interest as the ubiquitous multi-core processors have become the de facto configuration on modern web servers. Multicore servers potentially provide sufficient physical resources to realize VM's benefits including performance isolation, manageability and scalability. However, the network performance of virtualized multi-core servers falls short of expectation. It is therefore important to understand the overhead implications. In this paper, we evaluate the network performance of a virtualized multi-core server using a TCP streaming microbenchmark (Iperf) and SPECweb2005. We first motivate our research by presenting the performance gap between native and virtualized environment. We then break down the overhead from an architectural viewpoint and show that the cache topology greatly influences the performance. We also profile the Virtual Machine Monitor (VMM) at a function level to illustrate that functions in the current version of the Xen scheduler are the major contributors to the poor utilization of cache topology. Consequently, we implement a static onloading scheme to separate interrupt handling from application processes and execute them on cores with cache affinity. Based on the observed benefits, we modify the Xen scheduler to migrate virtual CPUs dynamically to exploit the cache topology. Our results show that the VM performance improves by an average of 12% for Iperf and 15% for SPECweb2005.
{"title":"Performance characterization and cache-aware core scheduling in a virtualized multi-core server under 10GbE","authors":"Danhua Guo, Guangdeng Liao, L. Bhuyan","doi":"10.1109/IISWC.2009.5306784","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306784","url":null,"abstract":"Virtual Machine (VM) technology is experiencing a resurgent interest as the ubiquitous multi-core processors have become the de facto configuration on modern web servers. Multicore servers potentially provide sufficient physical resources to realize VM's benefits including performance isolation, manageability and scalability. However, the network performance of virtualized multi-core servers falls short of expectation. It is therefore important to understand the overhead implications. In this paper, we evaluate the network performance of a virtualized multi-core server using a TCP streaming microbenchmark (Iperf) and SPECweb2005. We first motivate our research by presenting the performance gap between native and virtualized environment. We then break down the overhead from an architectural viewpoint and show that the cache topology greatly influences the performance. We also profile the Virtual Machine Monitor (VMM) at a function level to illustrate that functions in the current version of the Xen scheduler are the major contributors to the poor utilization of cache topology. Consequently, we implement a static onloading scheme to separate interrupt handling from application processes and execute them on cores with cache affinity. Based on the observed benefits, we modify the Xen scheduler to migrate virtual CPUs dynamically to exploit the cache topology. Our results show that the VM performance improves by an average of 12% for Iperf and 15% for SPECweb2005.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124443967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/IISWC.2009.5306796
Qiang Xu, J. Subhlok, Rong Zheng, S. Voss
Communication traces are integral to performance modeling and analysis of parallel programs. However, execution on a large number of nodes results in a large trace volume that is cumbersome and expensive to analyze. This paper presents an automatic framework to convert all process traces corresponding to the parallel execution of an SPMD MPI program into a single logical trace. First, the application communication matrix is generated from process traces. Next, topology identification is performed based on the underlying communication structure and independent of the way ranks (or numbers) are assigned to processes. Finally, message exchanges between physical processes are converted into logical message exchanges that represent similar message exchanges across all processes, resulting in a trace volume reduction approximately equal to the number of processes executing the application. This logicalization framework has been implemented and the results report on its performance and effectiveness.
{"title":"Logicalization of communication traces from parallel execution","authors":"Qiang Xu, J. Subhlok, Rong Zheng, S. Voss","doi":"10.1109/IISWC.2009.5306796","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306796","url":null,"abstract":"Communication traces are integral to performance modeling and analysis of parallel programs. However, execution on a large number of nodes results in a large trace volume that is cumbersome and expensive to analyze. This paper presents an automatic framework to convert all process traces corresponding to the parallel execution of an SPMD MPI program into a single logical trace. First, the application communication matrix is generated from process traces. Next, topology identification is performed based on the underlying communication structure and independent of the way ranks (or numbers) are assigned to processes. Finally, message exchanges between physical processes are converted into logical message exchanges that represent similar message exchanges across all processes, resulting in a trace volume reduction approximately equal to the number of processes executing the application. This logicalization framework has been implemented and the results report on its performance and effectiveness.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122577886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/IISWC.2009.5306789
M. Murphy, K. Keutzer, Hong Wang
High-quality cameras are a standard feature of mobile platforms, but the computational capabilities of mobile processors limit the applications capable of exploiting them. Emerging mobile application domains, for example Mobile Augmented Reality (MAR), rely heavily on techniques from computer vision, requiring sophisticated analyses of images followed by higher-level processing. An important class of image analyses is the detection of sparse localized interest points. The Scale Invariant Feature Transform (SIFT), the most popular such analysis, is computationally representative of many other feature extractors. Using a novel code-generation framework, we demonstrate that a small set of optimizations produce high-performance SIFT implementations for three very different architectures: a laptop CPU (Core 2 Duo), a low-power CPU (Intel Atom), and a low-power GPU (GMA X3100). We improve the runtime of SIFT by more than 5X on our low-power architectures, enabling a low-power mobile device to extract SIFT features up to 63% as fast as the laptop CPU.
{"title":"Image feature extraction for mobile processors","authors":"M. Murphy, K. Keutzer, Hong Wang","doi":"10.1109/IISWC.2009.5306789","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306789","url":null,"abstract":"High-quality cameras are a standard feature of mobile platforms, but the computational capabilities of mobile processors limit the applications capable of exploiting them. Emerging mobile application domains, for example Mobile Augmented Reality (MAR), rely heavily on techniques from computer vision, requiring sophisticated analyses of images followed by higher-level processing. An important class of image analyses is the detection of sparse localized interest points. The Scale Invariant Feature Transform (SIFT), the most popular such analysis, is computationally representative of many other feature extractors. Using a novel code-generation framework, we demonstrate that a small set of optimizations produce high-performance SIFT implementations for three very different architectures: a laptop CPU (Core 2 Duo), a low-power CPU (Intel Atom), and a low-power GPU (GMA X3100). We improve the runtime of SIFT by more than 5X on our low-power architectures, enabling a low-power mobile device to extract SIFT features up to 63% as fast as the laptop CPU.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124060809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/IISWC.2009.5306797
Shuai Che, Michael Boyer, Jiayuan Meng, D. Tarjan, J. Sheaffer, Sang-Ha Lee, K. Skadron
This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
{"title":"Rodinia: A benchmark suite for heterogeneous computing","authors":"Shuai Che, Michael Boyer, Jiayuan Meng, D. Tarjan, J. Sheaffer, Sang-Ha Lee, K. Skadron","doi":"10.1109/IISWC.2009.5306797","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306797","url":null,"abstract":"This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122269023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-01DOI: 10.1109/IISWC.2009.5306794
Sravanthi Kota Venkata, Ikkjin Ahn, Donghwan Jeon, Anshuman Gupta, Christopher M. Louie, Saturnino Garcia, Serge J. Belongie, M. Taylor
In the era of multi-core, computer vision has emerged as an exciting application area which promises to continue to drive the demand for both more powerful and more energy efficient processors. Although there is still a long way to go, vision has matured significantly over the last few decades, and the list of applications that are useful to end users continues to grow. The parallelism inherent in vision applications makes them a promising workload for multi-core and many-core processors.
{"title":"SD-VBS: The San Diego Vision Benchmark Suite","authors":"Sravanthi Kota Venkata, Ikkjin Ahn, Donghwan Jeon, Anshuman Gupta, Christopher M. Louie, Saturnino Garcia, Serge J. Belongie, M. Taylor","doi":"10.1109/IISWC.2009.5306794","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306794","url":null,"abstract":"In the era of multi-core, computer vision has emerged as an exciting application area which promises to continue to drive the demand for both more powerful and more energy efficient processors. Although there is still a long way to go, vision has matured significantly over the last few decades, and the list of applications that are useful to end users continues to grow. The parallelism inherent in vision applications makes them a promising workload for multi-core and many-core processors.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115169447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/iiswc.2009.5306805
Tom Conte, Georgia Tech, David August, Hillery Hunter, David Kaeli, Charles Levine
v Program Committee David August, Princeton Leslie Barnes, AMD Pradeep Dubey, Intel Lieven Eeckhout, Ghent Paolo Faraboschi, HP Jim Held, Intel Michael Hind, IBM Research Hillery Hunter, IBM Research David Kaeli, Northeastern Hyesoon Kim, Georgia Tech Hsien-Hsin Lee, Georgia Tech Charles Levine, Microsoft Markus Levy, EEMBC Jose Martinez, Cornell Onur Mutlu, CMU Nacho Navarro, UPC JoAnn Paul, Virginia Tech Sanjay Patel, Illinois Yale Patt, UT-Austin Eric Rotenberg, NC State Ravi Soundararajan, VMWare Wayne Wolf, Georgia Tech
{"title":"IISWC 2009 organizing committee","authors":"Tom Conte, Georgia Tech, David August, Hillery Hunter, David Kaeli, Charles Levine","doi":"10.1109/iiswc.2009.5306805","DOIUrl":"https://doi.org/10.1109/iiswc.2009.5306805","url":null,"abstract":"v Program Committee David August, Princeton Leslie Barnes, AMD Pradeep Dubey, Intel Lieven Eeckhout, Ghent Paolo Faraboschi, HP Jim Held, Intel Michael Hind, IBM Research Hillery Hunter, IBM Research David Kaeli, Northeastern Hyesoon Kim, Georgia Tech Hsien-Hsin Lee, Georgia Tech Charles Levine, Microsoft Markus Levy, EEMBC Jose Martinez, Cornell Onur Mutlu, CMU Nacho Navarro, UPC JoAnn Paul, Virginia Tech Sanjay Patel, Illinois Yale Patt, UT-Austin Eric Rotenberg, NC State Ravi Soundararajan, VMWare Wayne Wolf, Georgia Tech","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123502883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/iiswc.2009.5306802
David August, L. Barnes, Pradeep Dubey, L. Eeckhout, P. Faraboschi, J. Held, M. Hind, Sunpyo Hong, Hillery Hunter, D. Kaeli, Hyesoon Kim, Minjang Kim YoonguKim, Nagesh B. Lakshminarayana, Hsien-Hsin Lee, Jaekyu Lee, Charles Levine, M. Levy, J. Martínez, OnurMutlu Nacho Navarro, J. Paul, S. Patel, Y. Patt, E. Rotenberg, Ravi Soundararajan
{"title":"IISWC 2009 reviewers","authors":"David August, L. Barnes, Pradeep Dubey, L. Eeckhout, P. Faraboschi, J. Held, M. Hind, Sunpyo Hong, Hillery Hunter, D. Kaeli, Hyesoon Kim, Minjang Kim YoonguKim, Nagesh B. Lakshminarayana, Hsien-Hsin Lee, Jaekyu Lee, Charles Levine, M. Levy, J. Martínez, OnurMutlu Nacho Navarro, J. Paul, S. Patel, Y. Patt, E. Rotenberg, Ravi Soundararajan","doi":"10.1109/iiswc.2009.5306802","DOIUrl":"https://doi.org/10.1109/iiswc.2009.5306802","url":null,"abstract":"","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133240371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}