首页 > 最新文献

IEEE International Symposium on Workload Characterization (IISWC'10)最新文献

英文 中文
Improving virtualization performance and scalability with advanced hardware accelerations 通过高级硬件加速改进虚拟化性能和可伸缩性
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649499
Yaozu Dong, Xudong Zheng, Xiantao Zhang, J. Dai, Jianhui Li, Xin Li, Gang Zhai, Haibing Guan
Many advanced hardware accelerations for virtualization, such as Pause Loop Exit (PLE), Extended Page Table (EPT), and Single Root I/O Virtualization (SR-IOV), have been introduced recently to improve the virtualization performance and scalability. In this paper, we share our experience with the performance and scalability issues of virtualization, especially those brought by the modern, multi-core and/or overcommitted systems. We then describe our work on the implementation and optimizations of the advanced hardware acceleration support in the latest version of Xen. Finally, we present performance evaluations and characterizations of these hardware accelerations, using both micro-benchmarks and a server consolidation benchmark (vConsolidate). The experimental results demonstrate an up to 77% improvement with these hardware accelerations, 49% of which is due to EPT and another 28% due to SR-IOV.
最近引入了许多用于虚拟化的高级硬件加速,例如暂停循环出口(PLE)、扩展页表(EPT)和单根I/O虚拟化(SR-IOV),以提高虚拟化性能和可伸缩性。在本文中,我们将分享我们在虚拟化的性能和可伸缩性问题上的经验,特别是那些由现代、多核和/或过度使用的系统带来的问题。然后,我们描述了在最新版本的Xen中实现和优化高级硬件加速支持的工作。最后,我们使用微基准测试和服务器整合基准测试(vconsolidation)对这些硬件加速进行性能评估和表征。实验结果表明,在这些硬件加速下,性能提高了77%,其中49%归因于EPT,另外28%归因于SR-IOV。
{"title":"Improving virtualization performance and scalability with advanced hardware accelerations","authors":"Yaozu Dong, Xudong Zheng, Xiantao Zhang, J. Dai, Jianhui Li, Xin Li, Gang Zhai, Haibing Guan","doi":"10.1109/IISWC.2010.5649499","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5649499","url":null,"abstract":"Many advanced hardware accelerations for virtualization, such as Pause Loop Exit (PLE), Extended Page Table (EPT), and Single Root I/O Virtualization (SR-IOV), have been introduced recently to improve the virtualization performance and scalability. In this paper, we share our experience with the performance and scalability issues of virtualization, especially those brought by the modern, multi-core and/or overcommitted systems. We then describe our work on the implementation and optimizations of the advanced hardware acceleration support in the latest version of Xen. Finally, we present performance evaluations and characterizations of these hardware accelerations, using both micro-benchmarks and a server consolidation benchmark (vConsolidate). The experimental results demonstrate an up to 77% improvement with these hardware accelerations, 49% of which is due to EPT and another 28% due to SR-IOV.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124460805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Parallelization and characterization of GARCH option pricing on GPUs gpu上GARCH期权定价的并行化与表征
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5648864
Ren-Shuo Liu, Yun-Cheng Tsai, Chia-Lin Yang
Option pricing is an important problem in computational finance due to the fast-growing market and increasing complexity of options. For option pricing, a model is required to describe the price process of the underlying asset. The GARCH model is one of the prominent option pricing models since it can model stochastic volatility of the underlying asset. To derive expected profit based on the GARCH model, tree-based simulations are one of the commonly used approaches. Tree-based GARCH option pricing is computing intensive since the tree grows exponentially, and it requires enormous floating point arithmetic operations. In this paper, we present the first work on accelerating the tree-based GARCH option pricing on GPUs with CUDA. As the conventional tree data structure is not memory access friendly to GPUs, we propose a new family of tree data structures which position concurrently accessed nodes in contiguous and aligned memory locations. Moreover, to reduce memory bandwidth requirement, we apply fusion optimization, which combines two threads into one to keep data with temporal locality in register files. Our results show 50× speedup compared to a multi-threaded program on a 4-core CPU.
随着期权市场的快速发展和期权交易的日益复杂,期权定价成为计算金融中的一个重要问题。对于期权定价,需要一个模型来描述标的资产的价格过程。GARCH模型可以对标的资产的随机波动率进行建模,是期权定价的重要模型之一。基于GARCH模型的预期利润,基于树的模拟是常用的方法之一。基于树的GARCH期权定价是计算密集型的,因为树呈指数增长,并且需要大量的浮点算术运算。在本文中,我们提出了在gpu上使用CUDA加速基于树的GARCH期权定价的第一项工作。由于传统的树形数据结构对gpu的内存访问不友好,我们提出了一种新的树形数据结构,它将并发访问的节点定位在连续和对齐的内存位置。此外,为了减少内存带宽需求,我们采用融合优化,将两个线程合并为一个线程,在寄存器文件中保留具有时间局部性的数据。我们的结果显示,与4核CPU上的多线程程序相比,速度提高了50倍。
{"title":"Parallelization and characterization of GARCH option pricing on GPUs","authors":"Ren-Shuo Liu, Yun-Cheng Tsai, Chia-Lin Yang","doi":"10.1109/IISWC.2010.5648864","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5648864","url":null,"abstract":"Option pricing is an important problem in computational finance due to the fast-growing market and increasing complexity of options. For option pricing, a model is required to describe the price process of the underlying asset. The GARCH model is one of the prominent option pricing models since it can model stochastic volatility of the underlying asset. To derive expected profit based on the GARCH model, tree-based simulations are one of the commonly used approaches. Tree-based GARCH option pricing is computing intensive since the tree grows exponentially, and it requires enormous floating point arithmetic operations. In this paper, we present the first work on accelerating the tree-based GARCH option pricing on GPUs with CUDA. As the conventional tree data structure is not memory access friendly to GPUs, we propose a new family of tree data structures which position concurrently accessed nodes in contiguous and aligned memory locations. Moreover, to reduce memory bandwidth requirement, we apply fusion optimization, which combines two threads into one to keep data with temporal locality in register files. Our results show 50× speedup compared to a multi-threaded program on a 4-core CPU.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133940654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Performance characterization and acceleration of Optical Character Recognition on handheld platforms 手持平台上光学字符识别的性能表征与加速
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5648852
S. Srinivasan, Li Zhao, Lin Sun, Zhen Fang, Peng Li, Tao Wang, R. Iyer, Dong Liu
Optical Character Recognition (OCR) converts images of handwritten or printed text captured by camera or scanner into editable text. OCR has seen limited adoption in mobile platforms due to the performance constraints of these systems. Intel® Atom™ processors have enabled general purpose applications to be executed on handheld devices. In this paper, we analyze a reference implementation of the OCR workload on a low power general purpose processor and identify the primary hotspot functions that incur a large fraction of the overall response time. We also present a detailed architectural characterization of the hotspot functions in terms of CPI, MPI, etc. We then implement and analyze several software/algorithmic optimizations such as i) Multi-threading, ii) image sampling for a hotspot function and iii) miscellaneous code optimization. Our results show that up to 2X performance improvement in execution time of the application and almost 9X improvement for a hotspot can be achieved by using various software optimizations. We designed and implemented a hardware accelerator for one of the hotspots to further reduce the execution time and power. Overall, we believe our analysis provides a detailed understanding of the processing overheads for OCR running on a new class of low power compute platforms.
光学字符识别(OCR)将相机或扫描仪捕获的手写或打印文本的图像转换为可编辑的文本。由于这些系统的性能限制,OCR在移动平台上的应用有限。英特尔®Atom™处理器使通用应用程序能够在手持设备上执行。在本文中,我们分析了低功耗通用处理器上OCR工作负载的参考实现,并确定了占用大部分总体响应时间的主要热点功能。我们还从CPI、MPI等方面详细描述了热点功能的体系结构特征。然后,我们实现和分析了几个软件/算法优化,如i)多线程,ii)热点函数的图像采样和iii)杂项代码优化。我们的结果表明,通过使用各种软件优化,应用程序的执行时间可以提高2倍的性能,热点可以提高近9倍。我们为其中一个热点设计并实现了一个硬件加速器,以进一步减少执行时间和功耗。总的来说,我们相信我们的分析提供了对在一类新的低功耗计算平台上运行OCR的处理开销的详细理解。
{"title":"Performance characterization and acceleration of Optical Character Recognition on handheld platforms","authors":"S. Srinivasan, Li Zhao, Lin Sun, Zhen Fang, Peng Li, Tao Wang, R. Iyer, Dong Liu","doi":"10.1109/IISWC.2010.5648852","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5648852","url":null,"abstract":"Optical Character Recognition (OCR) converts images of handwritten or printed text captured by camera or scanner into editable text. OCR has seen limited adoption in mobile platforms due to the performance constraints of these systems. Intel® Atom™ processors have enabled general purpose applications to be executed on handheld devices. In this paper, we analyze a reference implementation of the OCR workload on a low power general purpose processor and identify the primary hotspot functions that incur a large fraction of the overall response time. We also present a detailed architectural characterization of the hotspot functions in terms of CPI, MPI, etc. We then implement and analyze several software/algorithmic optimizations such as i) Multi-threading, ii) image sampling for a hotspot function and iii) miscellaneous code optimization. Our results show that up to 2X performance improvement in execution time of the application and almost 9X improvement for a hotspot can be achieved by using various software optimizations. We designed and implemented a hardware accelerator for one of the hotspots to further reduce the execution time and power. Overall, we believe our analysis provides a detailed understanding of the processing overheads for OCR running on a new class of low power compute platforms.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115747199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Exploiting approximate value locality for data synchronization on multi-core processors 利用近似值局部性实现多核处理器上的数据同步
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650333
Jaswanth Sreeram, S. Pande
This paper shows that for a variety of parallel “soft computing” programs that use optimistic synchronization, the approximate nature of the values produced during execution can be exploited to improve performance significantly. Specifically, through mechanisms for imprecise sharing of values between threads, the amount of contention in these programs can be reduced thereby avoiding expensive aborts and improving parallel performance while keeping the results produced by the program within the bounds of an acceptable approximation. This is made possible due to our observation that for many such programs, a large fraction of the values produced during execution exhibit a substantial amount of value locality. We describe how this locality can be exploited using extensions to C/C++ language types that allow specification of limits on the precision and accuracy required and a novel value-aware conflict detection scheme that minimizes the number of conflicts while respecting these limits. Our experiments indicate that for the programs studied substantial speedups can be achieved - upto 5.7x over the original program for the same number of threads. We also present experimental evidence that for the programs studied, the amount of error introduced often grows relatively slowly.
本文表明,对于使用乐观同步的各种并行“软计算”程序,可以利用执行期间产生的值的近似性质来显着提高性能。具体来说,通过在线程之间不精确地共享值的机制,可以减少这些程序中的争用量,从而避免代价高昂的中止,提高并行性能,同时将程序产生的结果保持在可接受的近似范围内。这是可能的,因为我们观察到,对于许多这样的程序,在执行期间产生的很大一部分值表现出大量的值局部性。我们描述了如何使用C/ c++语言类型的扩展来利用这种局域性,这些扩展允许对所需的精度和准确性进行限制的规范,以及一种新的值感知冲突检测方案,该方案在尊重这些限制的同时最大限度地减少冲突的数量。我们的实验表明,对于所研究的程序,可以实现显著的速度提升——对于相同数量的线程,可以达到原始程序的5.7倍。我们还提供了实验证据,表明对于所研究的程序,引入的误差量通常增长相对缓慢。
{"title":"Exploiting approximate value locality for data synchronization on multi-core processors","authors":"Jaswanth Sreeram, S. Pande","doi":"10.1109/IISWC.2010.5650333","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650333","url":null,"abstract":"This paper shows that for a variety of parallel “soft computing” programs that use optimistic synchronization, the approximate nature of the values produced during execution can be exploited to improve performance significantly. Specifically, through mechanisms for imprecise sharing of values between threads, the amount of contention in these programs can be reduced thereby avoiding expensive aborts and improving parallel performance while keeping the results produced by the program within the bounds of an acceptable approximation. This is made possible due to our observation that for many such programs, a large fraction of the values produced during execution exhibit a substantial amount of value locality. We describe how this locality can be exploited using extensions to C/C++ language types that allow specification of limits on the precision and accuracy required and a novel value-aware conflict detection scheme that minimizes the number of conflicts while respecting these limits. Our experiments indicate that for the programs studied substantial speedups can be achieved - upto 5.7x over the original program for the same number of threads. We also present experimental evidence that for the programs studied, the amount of error introduced often grows relatively slowly.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127359983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data handling inefficiencies between CUDA, 3D rendering, and system memory CUDA、3D渲染和系统内存之间的数据处理效率低下
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5648828
Brian Gordon, S. Sohoni, D. Chandler
While GPGPU programming offers faster computation of highly parallelized code, the memory bandwidth between the system and the GPU can create a bottleneck that reduces the potential gains. CUDA is a prominent GPGPU API which can transfer data to and from system code, and which can also access data used by 3D rendering APIs. In an application that relies on both GPU programming APIs to accelerate 3D modeling and an easily parallelized algorithm, the hidden inefficiencies of nVidia's data handling with CUDA become apparent. First, CUDA uses the CPU's store units to copy data between the graphics card and system memory instead of using a more efficient method like DMA. Second, data exchanged between the two GPU-based APIs travels through the main processor instead of staying on the GPU. As a result, a non-GPGPU implementation of a program runs faster than the same program using GPGPU.
虽然GPGPU编程为高度并行化的代码提供了更快的计算速度,但系统和GPU之间的内存带宽可能会造成瓶颈,从而降低潜在的收益。CUDA是一个突出的GPGPU API,它可以在系统代码之间传输数据,也可以访问3D渲染API使用的数据。在一个既依赖GPU编程api来加速3D建模又依赖易于并行化的算法的应用程序中,nVidia使用CUDA处理数据的隐藏效率低下变得显而易见。首先,CUDA使用CPU的存储单元在显卡和系统内存之间复制数据,而不是使用像DMA这样更有效的方法。其次,两个基于GPU的api之间交换的数据通过主处理器而不是停留在GPU上。因此,一个程序的非GPGPU实现比使用GPGPU的相同程序运行得更快。
{"title":"Data handling inefficiencies between CUDA, 3D rendering, and system memory","authors":"Brian Gordon, S. Sohoni, D. Chandler","doi":"10.1109/IISWC.2010.5648828","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5648828","url":null,"abstract":"While GPGPU programming offers faster computation of highly parallelized code, the memory bandwidth between the system and the GPU can create a bottleneck that reduces the potential gains. CUDA is a prominent GPGPU API which can transfer data to and from system code, and which can also access data used by 3D rendering APIs. In an application that relies on both GPU programming APIs to accelerate 3D modeling and an easily parallelized algorithm, the hidden inefficiencies of nVidia's data handling with CUDA become apparent. First, CUDA uses the CPU's store units to copy data between the graphics card and system memory instead of using a more efficient method like DMA. Second, data exchanged between the two GPU-based APIs travels through the main processor instead of staying on the GPU. As a result, a non-GPGPU implementation of a program runs faster than the same program using GPGPU.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126031859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Real Java applications in software transactional memory 软件事务性内存中的真实Java应用程序
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5654431
T. Nakaike, Rei Odaira, T. Nakatani, Maged M. Michael
Transactional Memory (TM) shows promise as a new concurrency control mechanism to replace lock-based synchronization. However, there have been few studies of TM systems with real applications, and the real-world benefits and barriers of TM remain unknown. In this paper, we present a detailed analysis of the behavior of real applications on a software transactional memory system. Based on this analysis, we aim to clarify what programming work is required to achieve reasonable performance in TM-based applications. We selected three existing Java applications: (1) HSQLDB, (2) the Geronimo application server, and (3) the GlassFish application server, because each application has a scalability problem caused by lock contentions. We identified the critical sections where lock contentions frequently occur, and modified the source code so that the critical sections are executed transactionally. However, this simple modification proved insufficient to achieve reasonable performance because of excessive data conflicts. We found that most of the data conflicts were caused by application-level optimizations such as reusing objects to reduce the memory usage. After modifying the source code to disable those optimizations, the TM-based applications showed higher or competitive performance compared to lock-based applications. Another finding is that the number of variables that actually cause data conflicts is much smaller than the number of variables that can be accessed in critical sections. This implies that the performance tuning of TM-based applications may be easier than that of lock-based applications where we need to take care of all of the variables that can be accessed in the critical sections.
事务内存(Transactional Memory, TM)有望作为一种新的并发控制机制取代基于锁的同步。然而,对TM系统的实际应用研究很少,并且TM在现实世界中的好处和障碍仍然未知。在本文中,我们详细分析了实际应用程序在软件事务性内存系统上的行为。基于此分析,我们的目标是澄清在基于tm的应用程序中需要哪些编程工作来实现合理的性能。我们选择了三个现有的Java应用程序:(1)HSQLDB、(2)Geronimo应用程序服务器和(3)GlassFish应用程序服务器,因为每个应用程序都有由锁争用引起的可伸缩性问题。我们确定了经常发生锁争用的临界区,并修改了源代码,以便以事务方式执行临界区。然而,由于数据冲突过多,这种简单的修改不足以达到合理的性能。我们发现,大多数数据冲突是由应用程序级优化引起的,例如重用对象以减少内存使用。在修改源代码以禁用这些优化之后,与基于锁的应用程序相比,基于tm的应用程序显示出更高的性能。另一个发现是,实际导致数据冲突的变量的数量远远小于临界区中可以访问的变量的数量。这意味着基于tm的应用程序的性能调优可能比基于锁的应用程序更容易,因为在基于锁的应用程序中,我们需要处理所有可以在临界区访问的变量。
{"title":"Real Java applications in software transactional memory","authors":"T. Nakaike, Rei Odaira, T. Nakatani, Maged M. Michael","doi":"10.1109/IISWC.2010.5654431","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5654431","url":null,"abstract":"Transactional Memory (TM) shows promise as a new concurrency control mechanism to replace lock-based synchronization. However, there have been few studies of TM systems with real applications, and the real-world benefits and barriers of TM remain unknown. In this paper, we present a detailed analysis of the behavior of real applications on a software transactional memory system. Based on this analysis, we aim to clarify what programming work is required to achieve reasonable performance in TM-based applications. We selected three existing Java applications: (1) HSQLDB, (2) the Geronimo application server, and (3) the GlassFish application server, because each application has a scalability problem caused by lock contentions. We identified the critical sections where lock contentions frequently occur, and modified the source code so that the critical sections are executed transactionally. However, this simple modification proved insufficient to achieve reasonable performance because of excessive data conflicts. We found that most of the data conflicts were caused by application-level optimizations such as reusing objects to reduce the memory usage. After modifying the source code to disable those optimizations, the TM-based applications showed higher or competitive performance compared to lock-based applications. Another finding is that the number of variables that actually cause data conflicts is much smaller than the number of variables that can be accessed in critical sections. This implies that the performance tuning of TM-based applications may be easier than that of lock-based applications where we need to take care of all of the variables that can be accessed in the critical sections.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130212495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Performance variations of two open-source cloud platforms 两个开源云平台的性能变化
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650280
Yohei Ueda, T. Nakatani
The performance of workloads running on cloud platforms varies significantly depending on the cloud platform configurations. We evaluated the performance variations using two open-source cloud platforms, OpenNebula and Eucalyptus.
在云平台上运行的工作负载的性能根据云平台配置的不同而有很大差异。我们使用两个开源云平台,OpenNebula和Eucalyptus来评估性能变化。
{"title":"Performance variations of two open-source cloud platforms","authors":"Yohei Ueda, T. Nakatani","doi":"10.1109/IISWC.2010.5650280","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650280","url":null,"abstract":"The performance of workloads running on cloud platforms varies significantly depending on the cloud platform configurations. We evaluated the performance variations using two open-source cloud platforms, OpenNebula and Eucalyptus.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"454 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132695019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management 使用统计度量建模进行运行时工作负载行为预测,并将其应用于动态电源管理
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650339
R. Sarikaya, C. Isci, A. Buyuktosunoglu
Adaptive computing systems rely on accurate predictions of workload behavior to understand and respond to the dynamically-varying application characteristics. In this study, we propose a Statistical Metric Model (SMM) that is system-and metric-independent for predicting workload behavior. SMM is a probability distribution over workload patterns and it attempts to model how frequently a specific behavior occurs. Maximum Likelihood Estimation (MLE) criterion is used to estimate the parameters of the SMM. The model parameters are further refined with a smoothing method to improve prediction robustness. The SMM learns the application patterns during runtime as applications run, and at the same time predicts the upcoming program phases based on what it has learned so far. An extensive and rigorous series of prediction experiments demonstrates the superior performance of the SMM predictor over existing predictors on a wide range of benchmarks. For some of the benchmarks, SMM improves prediction accuracy by 10X and 3X, compared to the existing last-value and table-based prediction approaches respectively. SMM's improved prediction accuracy results in superior power-performance trade-offs when it is applied to dynamic power management.
自适应计算系统依赖于对工作负载行为的准确预测来理解和响应动态变化的应用程序特征。在这项研究中,我们提出了一个统计度量模型(SMM),该模型与系统和度量无关,用于预测工作负载行为。SMM是工作负载模式的概率分布,它试图对特定行为发生的频率进行建模。采用极大似然估计准则对SMM的参数进行估计。采用平滑方法进一步细化模型参数,提高预测的鲁棒性。SMM在应用程序运行时学习应用程序模式,同时根据迄今所学到的内容预测即将到来的程序阶段。一系列广泛而严格的预测实验表明,在广泛的基准测试中,SMM预测器的性能优于现有的预测器。对于一些基准测试,与现有的last-value和基于表的预测方法相比,SMM的预测精度分别提高了10倍和3倍。当SMM应用于动态电源管理时,其改进的预测精度导致了优越的功耗性能权衡。
{"title":"Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management","authors":"R. Sarikaya, C. Isci, A. Buyuktosunoglu","doi":"10.1109/IISWC.2010.5650339","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650339","url":null,"abstract":"Adaptive computing systems rely on accurate predictions of workload behavior to understand and respond to the dynamically-varying application characteristics. In this study, we propose a Statistical Metric Model (SMM) that is system-and metric-independent for predicting workload behavior. SMM is a probability distribution over workload patterns and it attempts to model how frequently a specific behavior occurs. Maximum Likelihood Estimation (MLE) criterion is used to estimate the parameters of the SMM. The model parameters are further refined with a smoothing method to improve prediction robustness. The SMM learns the application patterns during runtime as applications run, and at the same time predicts the upcoming program phases based on what it has learned so far. An extensive and rigorous series of prediction experiments demonstrates the superior performance of the SMM predictor over existing predictors on a wide range of benchmarks. For some of the benchmarks, SMM improves prediction accuracy by 10X and 3X, compared to the existing last-value and table-based prediction approaches respectively. SMM's improved prediction accuracy results in superior power-performance trade-offs when it is applied to dynamic power management.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132824491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Tackling the challenges of server consolidation on multi-core systems 应对多核系统上服务器整合的挑战
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5654398
Hui Lv, Xudong Zheng, Zhiteng Huang, Jiangang Duan
With increasing demand to reduce system operation cost amidst growing adoption of virtualization technologies, multiple servers on a single physical chip is fast seeing typical usage in modern data centers. This trend has become more obvious with advances in system design that puts more and more multi-core CPUs into one system. As a result, it is interesting to investigate the challenges of virtualization on top of a multi-core system and the scalability of the consolidation workload on top of it.
随着越来越多地采用虚拟化技术,降低系统操作成本的需求不断增加,单个物理芯片上的多个服务器在现代数据中心中的典型应用正在迅速增加。随着系统设计的进步,将越来越多的多核cpu放入一个系统中,这种趋势变得更加明显。因此,研究在多核系统之上的虚拟化所面临的挑战以及在多核系统之上的整合工作负载的可伸缩性是很有趣的。
{"title":"Tackling the challenges of server consolidation on multi-core systems","authors":"Hui Lv, Xudong Zheng, Zhiteng Huang, Jiangang Duan","doi":"10.1109/IISWC.2010.5654398","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5654398","url":null,"abstract":"With increasing demand to reduce system operation cost amidst growing adoption of virtualization technologies, multiple servers on a single physical chip is fast seeing typical usage in modern data centers. This trend has become more obvious with advances in system design that puts more and more multi-core CPUs into one system. As a result, it is interesting to investigate the challenges of virtualization on top of a multi-core system and the scalability of the consolidation workload on top of it.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129076968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Eigenbench: A simple exploration tool for orthogonal TM characteristics 特征bench:一种简单的TM正交特征探测工具
Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5648812
Sungpack Hong, Tayo Oguntebi, J. Casper, N. Bronson, C. Kozyrakis, K. Olukotun
There are a significant number of Transactional Memory(TM) proposals, varying in almost all aspects of the design space. Although several transactional benchmarks have been suggested, a simple, yet thorough, evaluation framework is still needed to completely characterize a TM system and allow for comparison among the various proposals. Unfortunately, TM system evaluation is difficult because the application characteristics which affect performance are often difficult to isolate from each other. We propose a set of orthogonal application characteristics that form a basis for transactional behavior and are useful in fully understanding the performance of a TM system. In this paper, we present EigenBench, a lightweight yet powerful microbenchmark for fully evaluating a transactional memory system. We show that EigenBench is useful for thoroughly exploring the orthogonal space of TM application characteristics. Because of its flexibility, our microbenchmark is also capable of reproducing a representative set of TM performance pathologies. In this paper, we use Eigenbench to evaluate two well-known TM systems and provide significant insight about their strengths and weaknesses. We also demonstrate how EigenBench can be used to mimic the evaluation coverage of a popular TM benchmark suite called STAMP.
有大量的事务性内存(Transactional Memory, TM)建议,它们在设计空间的几乎所有方面都有所不同。尽管已经提出了几个事务基准,但仍然需要一个简单而彻底的评估框架来完全描述TM系统的特征,并允许在各种建议之间进行比较。不幸的是,TM系统评估是困难的,因为影响性能的应用程序特征通常很难相互隔离。我们提出了一组正交的应用程序特征,这些特征构成了事务行为的基础,有助于充分理解TM系统的性能。在本文中,我们提出了EigenBench,一个轻量级但功能强大的微基准,用于全面评估事务性内存系统。我们证明了特征bench对于深入探索TM应用特征的正交空间是有用的。由于其灵活性,我们的微基准测试还能够再现一组具有代表性的TM性能病态。在本文中,我们使用特征bench来评估两个知名的TM系统,并对它们的优缺点提供了重要的见解。我们还演示了如何使用EigenBench来模拟称为STAMP的流行TM基准套件的评估覆盖率。
{"title":"Eigenbench: A simple exploration tool for orthogonal TM characteristics","authors":"Sungpack Hong, Tayo Oguntebi, J. Casper, N. Bronson, C. Kozyrakis, K. Olukotun","doi":"10.1109/IISWC.2010.5648812","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5648812","url":null,"abstract":"There are a significant number of Transactional Memory(TM) proposals, varying in almost all aspects of the design space. Although several transactional benchmarks have been suggested, a simple, yet thorough, evaluation framework is still needed to completely characterize a TM system and allow for comparison among the various proposals. Unfortunately, TM system evaluation is difficult because the application characteristics which affect performance are often difficult to isolate from each other. We propose a set of orthogonal application characteristics that form a basis for transactional behavior and are useful in fully understanding the performance of a TM system. In this paper, we present EigenBench, a lightweight yet powerful microbenchmark for fully evaluating a transactional memory system. We show that EigenBench is useful for thoroughly exploring the orthogonal space of TM application characteristics. Because of its flexibility, our microbenchmark is also capable of reproducing a representative set of TM performance pathologies. In this paper, we use Eigenbench to evaluate two well-known TM systems and provide significant insight about their strengths and weaknesses. We also demonstrate how EigenBench can be used to mimic the evaluation coverage of a popular TM benchmark suite called STAMP.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127960294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
期刊
IEEE International Symposium on Workload Characterization (IISWC'10)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1