首页 > 最新文献

2009 IEEE International Symposium on Workload Characterization (IISWC)最新文献

英文 中文
Evaluation of disk-level workloads at different time-scales 在不同时间尺度上评估磁盘级工作负载
Pub Date : 2009-10-04 DOI: 10.1145/1639562.1639589
Alma Riska, E. Riedel
In this paper, we characterize three different sets of disk-level traces collected from enterprise systems. The data sets differ in the granularity of the recorded information and are called accordingly the Millisecond, the Hour, and the Lifetime traces. We analyze the disk-level utilization, the availability of idleness, the dynamics of the read and write traffic, over time and across an entire drive family. Our evaluation confirms that disk drives operate in moderate utilization and experience long stretches of idleness. The workload arriving at the disk is bursty across all time scales evaluated. Also, there is variability across drives of the same family, with a portion of them fully utilizing the available disk bandwidth for hours at a time.
在本文中,我们描述了从企业系统收集的三组不同的磁盘级跟踪。这些数据集在记录信息的粒度上有所不同,因此被称为毫秒、小时和生命周期跟踪。我们分析了磁盘级别的利用率、空闲可用性、读写流量的动态变化,以及整个驱动器系列的情况。我们的评估证实,磁盘驱动器的利用率适中,并且长时间处于空闲状态。到达磁盘的工作负载在评估的所有时间尺度上都是突发的。此外,同一系列的驱动器之间存在可变性,其中一部分一次可以充分利用可用的磁盘带宽数小时。
{"title":"Evaluation of disk-level workloads at different time-scales","authors":"Alma Riska, E. Riedel","doi":"10.1145/1639562.1639589","DOIUrl":"https://doi.org/10.1145/1639562.1639589","url":null,"abstract":"In this paper, we characterize three different sets of disk-level traces collected from enterprise systems. The data sets differ in the granularity of the recorded information and are called accordingly the Millisecond, the Hour, and the Lifetime traces. We analyze the disk-level utilization, the availability of idleness, the dynamics of the read and write traffic, over time and across an entire drive family. Our evaluation confirms that disk drives operate in moderate utilization and experience long stretches of idleness. The workload arriving at the disk is bursty across all time scales evaluated. Also, there is variability across drives of the same family, with a portion of them fully utilizing the available disk bandwidth for hours at a time.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131362164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Evaluation of the Intel® Core™ i7 Turbo Boost feature 英特尔®酷睿™i7 Turbo Boost功能的评估
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306782
James Charles, Preet Jassi, N. Ananth, Abbas Sadat, Alexandra Fedorova
The Intel® Core™ i7 processor code named Nehalem has a novel feature called Turbo Boost which dynamically varies the frequencies of the processor's cores. The frequency of a core is determined by core temperature, the number of active cores, the estimated power and the estimated current consumption. We perform an extensive analysis of the Turbo Boost technology to characterize its behavior in varying workload conditions. In particular, we analyze how the activation of Turbo Boost is affected by inherent properties of applications (i.e., their rate of memory accesses) and by the overall load imposed on the processor. Furthermore, we analyze the capability of Turbo Boost to mitigate Amdahl's law by accelerating sequential phases of parallel applications. Finally, we estimate the impact of the Turbo Boost technology on the overall energy consumption. We found that Turbo Boost can provide (on average) up to a 6% reduction in execution time but can result in an increase in energy consumption up to 16%. Our results also indicate that Turbo Boost sets the processor to operate at maximum frequency (where it has the potential to provide the maximum gain in performance) when the mapping of threads to hardware contexts is sub-optimal.
代号为Nehalem的英特尔®酷睿™i7处理器具有称为Turbo Boost的新功能,可动态改变处理器核心的频率。磁芯的频率由磁芯温度、活动磁芯数量、估计功率和估计电流消耗决定。我们对Turbo Boost技术进行了广泛的分析,以表征其在不同工作负载条件下的行为。特别是,我们分析了Turbo Boost的激活如何受到应用程序的固有属性(即,它们的内存访问速率)和施加在处理器上的总体负载的影响。此外,我们还分析了Turbo Boost通过加速并行应用的顺序相位来缓解Amdahl定律的能力。最后,我们估计了涡轮增压技术对整体能耗的影响。我们发现,Turbo Boost(平均而言)可以提供高达6%的执行时间减少,但可能导致能源消耗增加高达16%。我们的结果还表明,当线程到硬件上下文的映射不是最优时,Turbo Boost将处理器设置为以最高频率运行(它有可能提供最大的性能增益)。
{"title":"Evaluation of the Intel® Core™ i7 Turbo Boost feature","authors":"James Charles, Preet Jassi, N. Ananth, Abbas Sadat, Alexandra Fedorova","doi":"10.1109/IISWC.2009.5306782","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306782","url":null,"abstract":"The Intel® Core™ i7 processor code named Nehalem has a novel feature called Turbo Boost which dynamically varies the frequencies of the processor's cores. The frequency of a core is determined by core temperature, the number of active cores, the estimated power and the estimated current consumption. We perform an extensive analysis of the Turbo Boost technology to characterize its behavior in varying workload conditions. In particular, we analyze how the activation of Turbo Boost is affected by inherent properties of applications (i.e., their rate of memory accesses) and by the overall load imposed on the processor. Furthermore, we analyze the capability of Turbo Boost to mitigate Amdahl's law by accelerating sequential phases of parallel applications. Finally, we estimate the impact of the Turbo Boost technology on the overall energy consumption. We found that Turbo Boost can provide (on average) up to a 6% reduction in execution time but can result in an increase in energy consumption up to 16%. Our results also indicate that Turbo Boost sets the processor to operate at maximum frequency (where it has the potential to provide the maximum gain in performance) when the mapping of threads to hardware contexts is sub-optimal.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133837196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 149
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system Phoenix重生:大规模共享内存系统上的可伸缩MapReduce
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306783
Richard M. Yoo, Anthony Romano, C. Kozyrakis
Dynamic runtimes can simplify parallel programming by automatically managing concurrency and locality without further burdening the programmer. Nevertheless, implementing such runtime systems for large-scale, shared-memory systems can be challenging. This work optimizes Phoenix, a MapReduce runtime for shared-memory multi-cores and multiprocessors, on a quad-chip, 32-core, 256-thread UltraSPARC T2+ system with NUMA characteristics. We show how a multi-layered approach that comprises optimizations on the algorithm, implementation, and OS interaction leads to significant speedup improvements with 256 threads (average of 2.5× higher speedup, maximum of 19×). We also identify the roadblocks that limit the scalability of parallel runtimes on shared-memory systems, which are inherently tied to the OS scalability on large-scale systems.
动态运行时可以通过自动管理并发性和局部性来简化并行编程,而不会进一步增加程序员的负担。然而,为大规模共享内存系统实现这样的运行时系统可能具有挑战性。本研究在具有NUMA特性的四芯片、32核、256线程UltraSPARC T2+系统上优化了Phoenix,这是一个用于共享内存多核和多处理器的MapReduce运行时。我们展示了包含算法、实现和操作系统交互优化的多层方法如何通过256个线程(平均提高2.5倍,最高提高19倍)显著提高加速。我们还确定了限制共享内存系统上并行运行时可伸缩性的障碍,这些障碍与大规模系统上的操作系统可伸缩性内在地联系在一起。
{"title":"Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system","authors":"Richard M. Yoo, Anthony Romano, C. Kozyrakis","doi":"10.1109/IISWC.2009.5306783","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306783","url":null,"abstract":"Dynamic runtimes can simplify parallel programming by automatically managing concurrency and locality without further burdening the programmer. Nevertheless, implementing such runtime systems for large-scale, shared-memory systems can be challenging. This work optimizes Phoenix, a MapReduce runtime for shared-memory multi-cores and multiprocessors, on a quad-chip, 32-core, 256-thread UltraSPARC T2+ system with NUMA characteristics. We show how a multi-layered approach that comprises optimizations on the algorithm, implementation, and OS interaction leads to significant speedup improvements with 256 threads (average of 2.5× higher speedup, maximum of 19×). We also identify the roadblocks that limit the scalability of parallel runtimes on shared-memory systems, which are inherently tied to the OS scalability on large-scale systems.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132062428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 266
Browser workload characterization for an Ajax-based commercial online service 基于ajax的商业在线服务的浏览器工作负载表征
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306780
Shu Xu, Bo Huang, Junyong Ding, J. Dai
The transition to cloud computing and SaaS is a disruptive trend where users can conveniently access the services through browsers at any clients. In addition, with the prevalence of Web 2.0 and AJAX techniques, a browser-based client can have complex application logic and fancy user interface that are comparable to traditional desktop applications. This paper reports the study of workload construction and characterization for browser-based clients, using the Ajax-based web client of Zimbra (a commercial online messaging and collaboration suite). By comparing the various workload behaviors across different Zimbra server datasets, different browsers and different client platforms, it presents the characteristics of a real-life web application, which has significant differences from existing browser benchmarks in the literature. In addition, the platform-independent and browser-independent design of our workload makes it portable across various clients. Finally, this paper also provides valuable insights to the browser internals by analyzing the workload execution, the browser memory footprint and the breakdown of browser sub-modules.
向云计算和SaaS的过渡是一种颠覆性趋势,用户可以在任何客户机上通过浏览器方便地访问这些服务。此外,随着Web 2.0和AJAX技术的流行,基于浏览器的客户机可以具有与传统桌面应用程序相当的复杂应用程序逻辑和花哨的用户界面。本文使用基于ajax的web客户端Zimbra(一个商业在线消息传递和协作套件),对基于浏览器的客户端的工作负载构建和特征进行了研究。通过比较不同Zimbra服务器数据集、不同浏览器和不同客户端平台的各种工作负载行为,呈现出现实生活中web应用程序的特征,与文献中已有的浏览器基准有显著差异。此外,我们的工作负载的独立于平台和浏览器的设计使其可以跨各种客户机移植。最后,本文还通过分析工作负载执行、浏览器内存占用和浏览器子模块的分解,对浏览器内部提供了有价值的见解。
{"title":"Browser workload characterization for an Ajax-based commercial online service","authors":"Shu Xu, Bo Huang, Junyong Ding, J. Dai","doi":"10.1109/IISWC.2009.5306780","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306780","url":null,"abstract":"The transition to cloud computing and SaaS is a disruptive trend where users can conveniently access the services through browsers at any clients. In addition, with the prevalence of Web 2.0 and AJAX techniques, a browser-based client can have complex application logic and fancy user interface that are comparable to traditional desktop applications. This paper reports the study of workload construction and characterization for browser-based clients, using the Ajax-based web client of Zimbra (a commercial online messaging and collaboration suite). By comparing the various workload behaviors across different Zimbra server datasets, different browsers and different client platforms, it presents the characteristics of a real-life web application, which has significant differences from existing browser benchmarks in the literature. In addition, the platform-independent and browser-independent design of our workload makes it portable across various clients. Finally, this paper also provides valuable insights to the browser internals by analyzing the workload execution, the browser memory footprint and the breakdown of browser sub-modules.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134234794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
High-speed network modeling for full system simulation 高速网络建模全系统仿真
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306799
D. Lugones, Daniel Franco, Dolores Rexachs, J. Moure, E. Luque, Eduardo Argollo, Ayose Falcón, Daniel Ortega, P. Faraboschi
The widespread adoption of cluster computing systems has shifted the modeling focus from synthetic traffic to realistic workloads to better capture the complex interactions between applications and architecture. In this context, a full-system simulation environment also needs to model the networking component, but the simulation duration that is practically affordable is too short to appropriately stress the networking bottlenecks. In this paper, we present a methodology that overcomes this problem and enables the modeling of interconnection networks while ensuring representative results with fast simulation turnaround. We use standard network tools to extract simplified models that are statistically validated and at the same time compatible with a full system simulation environment. We propose three models with different accuracy vs. speed ratios that compute network latency times according to the estimated traffic and measure them on a real-world parallel scientific application.
集群计算系统的广泛采用已经将建模重点从合成流量转移到实际工作负载,以便更好地捕获应用程序和体系结构之间的复杂交互。在这种情况下,全系统仿真环境还需要对网络组件进行建模,但是实际负担得起的仿真持续时间太短,无法适当地强调网络瓶颈。在本文中,我们提出了一种方法,克服了这一问题,使互连网络的建模,同时确保具有代表性的结果与快速的仿真周转。我们使用标准的网络工具来提取简化的模型,这些模型经过统计验证,同时与完整的系统仿真环境兼容。我们提出了三个具有不同精度和速度比的模型,这些模型根据估计的流量计算网络延迟时间,并在现实世界的并行科学应用程序上进行测量。
{"title":"High-speed network modeling for full system simulation","authors":"D. Lugones, Daniel Franco, Dolores Rexachs, J. Moure, E. Luque, Eduardo Argollo, Ayose Falcón, Daniel Ortega, P. Faraboschi","doi":"10.1109/IISWC.2009.5306799","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306799","url":null,"abstract":"The widespread adoption of cluster computing systems has shifted the modeling focus from synthetic traffic to realistic workloads to better capture the complex interactions between applications and architecture. In this context, a full-system simulation environment also needs to model the networking component, but the simulation duration that is practically affordable is too short to appropriately stress the networking bottlenecks. In this paper, we present a methodology that overcomes this problem and enables the modeling of interconnection networks while ensuring representative results with fast simulation turnaround. We use standard network tools to extract simplified models that are statistically validated and at the same time compatible with a full system simulation environment. We propose three models with different accuracy vs. speed ratios that compute network latency times according to the estimated traffic and measure them on a real-world parallel scientific application.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132755627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
On the (dis)similarity of transactional memory workloads 事务性内存工作负载的(非)相似性
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306790
C. Hughes, James Poe, Amer Qouneh, Tao Li
Programming to exploit the resources in a multicore system remains a major obstacle for both computer and software engineers. Transactional memory offers an attractive alternative to traditional concurrent programming but implementations emerged before the programming model, leaving a gap in the design process. In previous research, transactional microbenchmarks have been used to evaluate designs or lock-based multithreaded workloads have been manually converted into their transactional equivalents; others have even created dedicated transactional benchmarks. Yet, throughout all of the investigations, transactional memory researchers have not settled on a way to describe the runtime characteristics that these programs exhibit; nor has there been any attempt to unify the way transactional memory implementations are evaluated. In addition, the similarity (or redundancy) of these workloads is largely unknown. Evaluating transactional memory designs using workloads that exhibit similar characteristics will unnecessarily increase the number of simulations without contributing new insight. On the other hand, arbitrarily choosing a subset of transactional memory workloads for evaluation can miss important features and lead to biased or incorrect conclusions. In this work, we propose a set of architecture-independent transaction-oriented workload characteristics that can accurately capture the behavior of transactional code. We apply principle component analysis and clustering algorithms to analyze the proposed workload characteristics collected from a set of SPLASH-2, STAMP, and PARSEC transactional memory programs. Our results show that using transactional characteristics to cluster the chosen benchmarks can reduce the number of required simulations by almost half. We also show that the methods presented in this paper can be used to identify specific feature subsets. With the increasing number of TM workloads in the future, we believe that the proposed transactional memory workload characterization techniques will help TM architects select a small, diverse, set of TM workloads for their design evaluation.
在多核系统中开发资源的编程仍然是计算机和软件工程师面临的主要障碍。事务性内存为传统的并发编程提供了一个有吸引力的替代方案,但是实现出现在编程模型之前,在设计过程中留下了空白。在以前的研究中,事务性微基准测试被用于评估设计,或者基于锁的多线程工作负载被手动转换为事务性工作负载;其他人甚至创建了专门的事务基准。然而,在所有的调查中,事务性内存研究人员还没有找到一种方法来描述这些程序所表现出的运行时特征;也没有任何尝试统一事务内存实现的评估方式。此外,这些工作负载的相似性(或冗余性)在很大程度上是未知的。使用表现出相似特征的工作负载评估事务性内存设计将不必要地增加模拟次数,而不会产生新的见解。另一方面,任意选择事务内存工作负载的一个子集进行评估可能会错过重要的特性,并导致有偏见或不正确的结论。在这项工作中,我们提出了一组独立于体系结构的面向事务的工作负载特征,这些特征可以准确地捕获事务代码的行为。我们应用主成分分析和聚类算法来分析从一组SPLASH-2、STAMP和PARSEC事务性内存程序中收集的工作负载特征。我们的结果表明,使用事务特征对所选基准进行聚类可以将所需的模拟次数减少近一半。我们还表明,本文提出的方法可以用于识别特定的特征子集。随着未来TM工作负载数量的增加,我们相信所提出的事务性内存工作负载表征技术将帮助TM架构师选择一个小的、多样化的TM工作负载集来进行设计评估。
{"title":"On the (dis)similarity of transactional memory workloads","authors":"C. Hughes, James Poe, Amer Qouneh, Tao Li","doi":"10.1109/IISWC.2009.5306790","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306790","url":null,"abstract":"Programming to exploit the resources in a multicore system remains a major obstacle for both computer and software engineers. Transactional memory offers an attractive alternative to traditional concurrent programming but implementations emerged before the programming model, leaving a gap in the design process. In previous research, transactional microbenchmarks have been used to evaluate designs or lock-based multithreaded workloads have been manually converted into their transactional equivalents; others have even created dedicated transactional benchmarks. Yet, throughout all of the investigations, transactional memory researchers have not settled on a way to describe the runtime characteristics that these programs exhibit; nor has there been any attempt to unify the way transactional memory implementations are evaluated. In addition, the similarity (or redundancy) of these workloads is largely unknown. Evaluating transactional memory designs using workloads that exhibit similar characteristics will unnecessarily increase the number of simulations without contributing new insight. On the other hand, arbitrarily choosing a subset of transactional memory workloads for evaluation can miss important features and lead to biased or incorrect conclusions. In this work, we propose a set of architecture-independent transaction-oriented workload characteristics that can accurately capture the behavior of transactional code. We apply principle component analysis and clustering algorithms to analyze the proposed workload characteristics collected from a set of SPLASH-2, STAMP, and PARSEC transactional memory programs. Our results show that using transactional characteristics to cluster the chosen benchmarks can reduce the number of required simulations by almost half. We also show that the methods presented in this paper can be used to identify specific feature subsets. With the increasing number of TM workloads in the future, we believe that the proposed transactional memory workload characterization techniques will help TM architects select a small, diverse, set of TM workloads for their design evaluation.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129821136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Performance characterization and optimization of mobile augmented reality on handheld platforms 手持平台上移动增强现实的性能表征与优化
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306788
S. Srinivasan, Zhen Fang, R. Iyer, Steven Zhang, Michael Espig, D. Newell, Daniel Cermak, Yi Wu, I. Kozintsev, H. Haussecker
The introduction of low power general purpose processors (like the Intel® Atom™ processor) expands the capability of handheld and mobile internet devices (MIDs) to include compelling visual computing applications. One rapidly emerging visual computing usage model is known as mobile augmented reality (MAR). In the MAR usage model, the user is able to point the handheld camera to an object (like a wine bottle) or a set of objects (like an outdoor scene of buildings or monuments) and the device automatically recognizes and displays information regarding the object(s). Achieving this on the handheld requires significant compute processing resulting in a response time in the order of several seconds. In this paper, we analyze a MAR workload and identify the primary hotspot functions that incur a large fraction of the overall response time. We also present a detailed architectural characterization of the hotspot functions in terms of CPI, MPI, etc. We then implement and analyze the benefits of several software optimizations: (a) vectorization, (b) multi-threading, (c) cache conflict avoidance and (d) miscellaneous code optimizations that reduce the number of computations. We show that a 3X performance improvement in execution time can be achieved by implementing these optimizations. Overall, we believe our analysis provides a detailed understanding of the processing for a new domain of visual computing workloads (i.e. MAR) running on low power handheld compute platforms.
低功耗通用处理器(如Intel®Atom™处理器)的引入扩展了手持和移动互联网设备(mid)的功能,以包括引人注目的视觉计算应用程序。一种快速出现的视觉计算使用模型被称为移动增强现实(MAR)。在MAR使用模型中,用户可以将手持相机指向一个对象(如酒瓶)或一组对象(如建筑物或纪念碑的户外场景),设备会自动识别并显示有关该对象的信息。在手持设备上实现这一点需要大量的计算处理,从而导致几秒钟的响应时间。在本文中,我们分析了一个MAR工作负载,并确定了占用大部分总体响应时间的主要热点功能。我们还从CPI、MPI等方面详细描述了热点功能的体系结构特征。然后,我们实现并分析了几种软件优化的好处:(a)向量化,(b)多线程,(c)缓存冲突避免和(d)减少计算次数的杂项代码优化。我们表明,通过实现这些优化,可以在执行时间上实现3倍的性能改进。总的来说,我们相信我们的分析提供了对运行在低功耗手持计算平台上的视觉计算工作负载(即MAR)的新领域的处理的详细理解。
{"title":"Performance characterization and optimization of mobile augmented reality on handheld platforms","authors":"S. Srinivasan, Zhen Fang, R. Iyer, Steven Zhang, Michael Espig, D. Newell, Daniel Cermak, Yi Wu, I. Kozintsev, H. Haussecker","doi":"10.1109/IISWC.2009.5306788","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306788","url":null,"abstract":"The introduction of low power general purpose processors (like the Intel® Atom™ processor) expands the capability of handheld and mobile internet devices (MIDs) to include compelling visual computing applications. One rapidly emerging visual computing usage model is known as mobile augmented reality (MAR). In the MAR usage model, the user is able to point the handheld camera to an object (like a wine bottle) or a set of objects (like an outdoor scene of buildings or monuments) and the device automatically recognizes and displays information regarding the object(s). Achieving this on the handheld requires significant compute processing resulting in a response time in the order of several seconds. In this paper, we analyze a MAR workload and identify the primary hotspot functions that incur a large fraction of the overall response time. We also present a detailed architectural characterization of the hotspot functions in terms of CPI, MPI, etc. We then implement and analyze the benefits of several software optimizations: (a) vectorization, (b) multi-threading, (c) cache conflict avoidance and (d) miscellaneous code optimizations that reduce the number of computations. We show that a 3X performance improvement in execution time can be achieved by implementing these optimizations. Overall, we believe our analysis provides a detailed understanding of the processing for a new domain of visual computing workloads (i.e. MAR) running on low power handheld compute platforms.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117270010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Storage characterization for unstructured data in online services applications 在线服务应用中非结构化数据的存储特性
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306786
S. Sankar, Kushagra Vaid
Mega datacenters hosting large scale web services have unique workload attributes that need to be taken into account for optimal service scalability. Provisioning compute and storage resources to provide a seamless user experience is challenging since customer traffic loads vary widely across time and geographies, and the servers hosting these applications have to be rightsized to provide both performance within a single server and across a scale-out cluster. Typical user-facing web services have a three tiered hierarchy — front-end web servers, middle-tier application logic, and back-end data storage and processing layer. In this paper, we address the challenge of disk subsystem design for back-end servers hosting large amounts of unstructured (also called blob) data. Examples of typical content hosted on such servers include user generated content such as photos, email messages, videos, and social networking updates. Specific server applications analyzed in this paper correspond to the message store of a large scale email application, image tile storage for a large scale geo-mapping application, and user content storage for Web 2.0 type applications. We analyze the storage subsystems for these web services in a live production environment and provide an overview of the disk traffic patterns and access characteristics for each of these applications. We then explore time-series characteristics and derive probabilistic models showing state transitions between locations on the data volumes for these applications. We then explore how these probabilistic models could be extended into a framework for synthetic benchmark generation for such applications. Finally, we discuss how this framework can be used for storage subsystem rightsizing for optimal scalability of such backend storage clusters.
托管大规模web服务的大型数据中心具有独特的工作负载属性,需要考虑这些属性以实现最佳的服务可伸缩性。配置计算和存储资源以提供无缝的用户体验是具有挑战性的,因为客户流量负载在不同的时间和地理位置上变化很大,托管这些应用程序的服务器必须适当调整大小,以便在单个服务器和跨横向扩展集群内提供性能。典型的面向用户的web服务具有三层层次结构——前端web服务器、中间层应用程序逻辑以及后端数据存储和处理层。在本文中,我们解决了为承载大量非结构化(也称为blob)数据的后端服务器设计磁盘子系统的挑战。托管在这种服务器上的典型内容示例包括用户生成的内容,如照片、电子邮件消息、视频和社交网络更新。本文分析的具体服务器应用分别对应于大型电子邮件应用程序的消息存储、大型地理地图应用程序的图像存储和Web 2.0类型应用程序的用户内容存储。我们在实时生产环境中分析这些web服务的存储子系统,并概述每个应用程序的磁盘流量模式和访问特征。然后,我们探索时间序列特征,并推导概率模型,显示这些应用程序数据卷上位置之间的状态转换。然后,我们将探讨如何将这些概率模型扩展到一个框架中,以便为此类应用程序生成综合基准。最后,我们讨论了如何使用该框架来调整存储子系统的大小,以实现此类后端存储集群的最佳可伸缩性。
{"title":"Storage characterization for unstructured data in online services applications","authors":"S. Sankar, Kushagra Vaid","doi":"10.1109/IISWC.2009.5306786","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306786","url":null,"abstract":"Mega datacenters hosting large scale web services have unique workload attributes that need to be taken into account for optimal service scalability. Provisioning compute and storage resources to provide a seamless user experience is challenging since customer traffic loads vary widely across time and geographies, and the servers hosting these applications have to be rightsized to provide both performance within a single server and across a scale-out cluster. Typical user-facing web services have a three tiered hierarchy — front-end web servers, middle-tier application logic, and back-end data storage and processing layer. In this paper, we address the challenge of disk subsystem design for back-end servers hosting large amounts of unstructured (also called blob) data. Examples of typical content hosted on such servers include user generated content such as photos, email messages, videos, and social networking updates. Specific server applications analyzed in this paper correspond to the message store of a large scale email application, image tile storage for a large scale geo-mapping application, and user content storage for Web 2.0 type applications. We analyze the storage subsystems for these web services in a live production environment and provide an overview of the disk traffic patterns and access characteristics for each of these applications. We then explore time-series characteristics and derive probabilistic models showing state transitions between locations on the data volumes for these applications. We then explore how these probabilistic models could be extended into a framework for synthetic benchmark generation for such applications. Finally, we discuss how this framework can be used for storage subsystem rightsizing for optimal scalability of such backend storage clusters.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131698428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A communication characterisation of Splash-2 and Parsec 飞溅-2和秒秒的通信特性
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306792
Nick Barrow-Williams, Christian Fensch, S. Moore
Recent benchmark suite releases such as Parsec specifically utilise the tightly coupled cores available in chip-multiprocessors to allow the use of newer, high performance, models of parallelisation. However, these techniques introduce additional irregularity and complexity to data sharing and are entirely dependent on efficient communication performance between processors. This paper thoroughly examines the crucial communication and sharing behaviour of these future applications. The infrastructure used allows both accurate and comprehensive program analysis, employing a full Linux OS running on a simulated 32-core x86 machine. Experiments use full program runs, with communication classified at both core and thread granularities. Migratory, read-only and producer-consumer sharing patterns are observed and their behaviour characterised. The temporal and spatial characteristics of communication are presented for the full collection of Splash-2 and Parsec benchmarks. Our results aim to support the design of future communication systems for CMPs, encompassing coherence protocols, network-on-chip and thread mapping.
最近发布的基准测试套件,如Parsec,专门利用芯片多处理器中可用的紧密耦合内核,以允许使用更新,高性能的并行化模型。然而,这些技术给数据共享带来了额外的不规则性和复杂性,并且完全依赖于处理器之间有效的通信性能。本文深入研究了这些未来应用的关键通信和共享行为。所使用的基础设施允许准确和全面的程序分析,使用在模拟32核x86机器上运行的完整Linux操作系统。实验使用完整的程序运行,通信按核心和线程粒度分类。观察到迁移模式、只读模式和生产者-消费者共享模式及其行为特征。在完整的Splash-2和Parsec基准测试中,给出了通信的时间和空间特征。我们的研究结果旨在支持cmp未来通信系统的设计,包括相干协议、片上网络和线程映射。
{"title":"A communication characterisation of Splash-2 and Parsec","authors":"Nick Barrow-Williams, Christian Fensch, S. Moore","doi":"10.1109/IISWC.2009.5306792","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306792","url":null,"abstract":"Recent benchmark suite releases such as Parsec specifically utilise the tightly coupled cores available in chip-multiprocessors to allow the use of newer, high performance, models of parallelisation. However, these techniques introduce additional irregularity and complexity to data sharing and are entirely dependent on efficient communication performance between processors. This paper thoroughly examines the crucial communication and sharing behaviour of these future applications. The infrastructure used allows both accurate and comprehensive program analysis, employing a full Linux OS running on a simulated 32-core x86 machine. Experiments use full program runs, with communication classified at both core and thread granularities. Migratory, read-only and producer-consumer sharing patterns are observed and their behaviour characterised. The temporal and spatial characteristics of communication are presented for the full collection of Splash-2 and Parsec benchmarks. Our results aim to support the design of future communication systems for CMPs, encompassing coherence protocols, network-on-chip and thread mapping.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121238911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 194
The importance of accurate task arrival characterization in the design of processing cores 准确的任务到达表征在加工核设计中的重要性
Pub Date : 2009-10-04 DOI: 10.1109/IISWC.2009.5306795
H. H. Najaf-abadi, E. Rotenberg
This paper studies the importance of accounting for a neglected facet of overall workload behavior, the pattern of task arrival. A stochastic characterization is formulated that defines regularity in the task arrival pattern. This characterization is used as the basis for a quantitative evaluation of the importance of accurately accounting for the task arrival behavior in the design of the processing cores of a Chip Multi-processor (CMP).
本文研究了对总体工作量行为中一个被忽视的方面——任务到达模式进行核算的重要性。制定了一个随机特征,定义了任务到达模式的规律性。这一特征被用作定量评估在芯片多处理器(CMP)处理核心设计中准确计算任务到达行为重要性的基础。
{"title":"The importance of accurate task arrival characterization in the design of processing cores","authors":"H. H. Najaf-abadi, E. Rotenberg","doi":"10.1109/IISWC.2009.5306795","DOIUrl":"https://doi.org/10.1109/IISWC.2009.5306795","url":null,"abstract":"This paper studies the importance of accounting for a neglected facet of overall workload behavior, the pattern of task arrival. A stochastic characterization is formulated that defines regularity in the task arrival pattern. This characterization is used as the basis for a quantitative evaluation of the importance of accurately accounting for the task arrival behavior in the design of the processing cores of a Chip Multi-processor (CMP).","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134323525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2009 IEEE International Symposium on Workload Characterization (IISWC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1