2008 IEEE International Symposium on Workload Characterization最新文献

Reproducible simulation of multi-threaded workloads for architecture design exploration 用于架构设计探索的多线程工作负载的可重复模拟

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636102

C. Pereira, H. Patil, B. Calder

As multiprocessors become mainstream, techniques to address efficient simulation of multi-threaded workloads are needed. Multi-threaded simulation presents a new challenge: non-determinism across simulations for different architecture configurations. If the execution paths between two simulation runs of the same benchmark with the same input are too different, the simulation results cannot be used to compare the configurations. In this paper we focus on a simulation technique to efficiently collect simulation checkpoints for multi-threaded workloads, and to compare simulation runs addressing this non-determinism problem. We focus on user-level simulation of multi-threaded workloads for multiprocessor architectures. We present an approach, based on binary instrumentation, to collect checkpoints for simulation. Our checkpoints allow reproducible execution of the samples across different architecture configurations by controlling the sources of nondeterminism during simulation. This results in stalls that would not naturally occur in execution. We propose techniques that allow us to accurately compare performance across architecture configurations in the presence of these stalls.

随着多处理器成为主流，需要解决多线程工作负载的高效模拟的技术。多线程模拟提出了一个新的挑战:不同架构配置的模拟的不确定性。如果具有相同输入的相同基准的两次模拟运行之间的执行路径差异太大，则模拟结果不能用于比较配置。在本文中，我们重点关注一种模拟技术，以有效地收集多线程工作负载的模拟检查点，并比较解决这种不确定性问题的模拟运行。我们专注于多处理器架构的多线程工作负载的用户级模拟。我们提出了一种基于二进制仪器的方法来收集模拟的检查点。我们的检查点通过在模拟过程中控制不确定性的来源，允许跨不同架构配置的样本可重复执行。这将导致在执行中不会自然发生的停顿。我们提出了一些技术，使我们能够在存在这些停顿的情况下准确地比较不同架构配置的性能。

{"title":"Reproducible simulation of multi-threaded workloads for architecture design exploration","authors":"C. Pereira, H. Patil, B. Calder","doi":"10.1109/IISWC.2008.4636102","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636102","url":null,"abstract":"As multiprocessors become mainstream, techniques to address efficient simulation of multi-threaded workloads are needed. Multi-threaded simulation presents a new challenge: non-determinism across simulations for different architecture configurations. If the execution paths between two simulation runs of the same benchmark with the same input are too different, the simulation results cannot be used to compare the configurations. In this paper we focus on a simulation technique to efficiently collect simulation checkpoints for multi-threaded workloads, and to compare simulation runs addressing this non-determinism problem. We focus on user-level simulation of multi-threaded workloads for multiprocessor architectures. We present an approach, based on binary instrumentation, to collect checkpoints for simulation. Our checkpoints allow reproducible execution of the samples across different architecture configurations by controlling the sources of nondeterminism during simulation. This results in stalls that would not naturally occur in execution. We propose techniques that allow us to accurately compare performance across architecture configurations in the presence of these stalls.","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128873957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Evaluating the impact of dynamic binary translation systems on hardware cache performance 评估动态二进制转换系统对硬件缓存性能的影响

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636098

Arkaitz Ruiz-Alvarez, K. Hazelwood

Dynamic binary translation systems enable a wide range of applications such as program instrumentation, optimization, and security. DBTs use a software code cache to store previously translated instructions. The code layout in the code cache greatly differs from the code layout of the original program. This paper provides an exhaustive analysis of the performance of the instruction/trace cache and other structures of the micro-architecture while executing DBTs that focus on program instrumentation, such as DynamoRIO and Pin. We performed our evaluation along two axes. First, we directly accessed the hardware performance counters to determine actual cache miss counts. Second, we used simulation to analyze the spatial locality of the translated application. Our results show that when executing an application under the control of Pin or DynamoRIO, the icache miss counts actually increase over 2X. Surprisingly, the L2 cache and the L1 data cache show a much lower performance degradation or even break even with the native application. We also found that overall performance degradations are due to the instructions added by the DBT itself, and that these extra instructions outweigh any possible spatial locality benefits exhibited in the code cache. Our observations held regardless of the trace length, code cache size, or the presence of a hardware trace cache. These results provide a better understanding of the efficiency of current instrumentation tools and their effects on instruction/trace cache performance and other structures of the microarchitecture.

动态二进制翻译系统支持广泛的应用，如程序检测、优化和安全性。dbt使用软件代码缓存来存储以前翻译过的指令。代码缓存中的代码布局与原始程序的代码布局有很大的不同。本文详尽地分析了指令/跟踪缓存和微架构的其他结构在执行dbt时的性能，这些dbt侧重于程序插接，如DynamoRIO和Pin。我们沿着两个轴进行计算。首先，我们直接访问硬件性能计数器以确定实际的缓存丢失计数。其次，我们使用仿真分析翻译应用程序的空间局部性。我们的结果表明，当在Pin或DynamoRIO控制下执行应用程序时，icache miss计数实际上增加了2倍以上。令人惊讶的是，L2缓存和L1数据缓存表现出更低的性能下降，甚至与本地应用程序持平。我们还发现，总体性能下降是由于DBT本身添加的指令造成的，这些额外的指令超过了代码缓存中显示的任何可能的空间局部性优势。我们的观察结果与跟踪长度、代码缓存大小或硬件跟踪缓存的存在无关。这些结果有助于更好地理解当前检测工具的效率及其对指令/跟踪缓存性能和微体系结构的其他结构的影响。

{"title":"Evaluating the impact of dynamic binary translation systems on hardware cache performance","authors":"Arkaitz Ruiz-Alvarez, K. Hazelwood","doi":"10.1109/IISWC.2008.4636098","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636098","url":null,"abstract":"Dynamic binary translation systems enable a wide range of applications such as program instrumentation, optimization, and security. DBTs use a software code cache to store previously translated instructions. The code layout in the code cache greatly differs from the code layout of the original program. This paper provides an exhaustive analysis of the performance of the instruction/trace cache and other structures of the micro-architecture while executing DBTs that focus on program instrumentation, such as DynamoRIO and Pin. We performed our evaluation along two axes. First, we directly accessed the hardware performance counters to determine actual cache miss counts. Second, we used simulation to analyze the spatial locality of the translated application. Our results show that when executing an application under the control of Pin or DynamoRIO, the icache miss counts actually increase over 2X. Surprisingly, the L2 cache and the L1 data cache show a much lower performance degradation or even break even with the native application. We also found that overall performance degradations are due to the instructions added by the DBT itself, and that these extra instructions outweigh any possible spatial locality benefits exhibited in the code cache. Our observations held regardless of the trace length, code cache size, or the presence of a hardware trace cache. These results provide a better understanding of the efficiency of current instrumentation tools and their effects on instruction/trace cache performance and other structures of the microarchitecture.","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115641507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Wild speculation on consumer workloads in 2010–2020 对2010-2020年消费者工作量的疯狂猜测

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636084

Tim Sweeney

Summary form only given. Games are among the most performance-intensive consumer applications, and often lead the way in bringing research technologies into practice. This occasionally leads to non-evolutionary leaps in performance and workload characteristics, such as the 1000-fold increase in 3D throughput enabled by consumer graphics accelerators beginning in 1998. The speaker will argue that another revolution in consumer computing performance is on the horizon, driven by large-scale multi-core CPUs with vector-processing extensions inspired by todaypsilas graphics processors (GPUs). He will present a view of the key problems and solutions facing consumer software developers in 2010-2020, and speculate on the shape and scale of workloads in that timeframe. The essential questions to cover are: What portions of an application can scale effectively to many cores and vector processors? How and when can concurrency research bring techniques like functional programming, software transactional memory, and vectorization into mainstream practice?

只提供摘要形式。游戏是性能最密集的消费者应用程序之一，并且经常引领将研究技术应用于实践的道路。这偶尔会导致性能和工作负载特征的非进化飞跃，例如从1998年开始，消费者图形加速器使3D吞吐量增加了1000倍。演讲者将会说，另一场消费级计算性能的革命即将到来，它是由大规模多核cpu和矢量处理扩展驱动的，这些扩展是受今天的psilas图形处理器(gpu)的启发。他将介绍2010-2020年消费者软件开发人员面临的关键问题和解决方案，并推测这段时间内工作负载的形状和规模。要讨论的基本问题是:应用程序的哪些部分可以有效地扩展到多个内核和矢量处理器?并发研究如何以及何时将函数式编程、软件事务性内存和向量化等技术引入主流实践?

引用次数: 0

On the representativeness of embedded Java benchmarks 关于嵌入式Java基准的代表性

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636100

C. Isen, L. John, Jung-Pil Choi, H. Song

Java has become one of the predominant languages for embedded and mobile platforms due to its architecturally neutral design, portability, and security. But Java execution in the embedded world encompasses Java virtual machines (JVMs) specially tuned for the embedded world, with stripped-down capabilities, and configurations for memory-limited environments. While there have been some studies on desktop and server Java, there have been very few studies on embedded Java. The non proliferation of embedded Java benchmarks and the lack of widespread profiling tools and simulators have only exacerbated the problem. While the industry uses some benchmarks such as MorphMark, MIDPMark, and EEMBC Java Grinder Bench, their representativeness in comparison to actual embedded Java applications has not been studied. In order to conduct such a study, we gathered an actual mobile phone application suite and characterized it in detail. We measure several properties of the various applications and benchmarks, perform similarity/dissimilarity analysis and shed light on the representativeness of current industry standard embedded benchmarks against actual mobile Java applications. It was observed that for many characteristics, the applications had a broader range, indicating that the benchmarks were under representing the range of characteristics in the real world. Furthermore, we find that the applications exhibit less code reuse/hotness compared to the benchmarks. We also draw comparisons of the embedded benchmarks against popular desktop/client Java benchmarks, such as the SPECjvm98 and DaCapo. Interestingly, embedded applications spend a significant amount of time in standard library code, on average 65%, suggesting to the usefulness of software and hardware techniques to facilitate pre-compilation with out the real time resource overhead of JIT.

由于其架构中立的设计、可移植性和安全性，Java已经成为嵌入式和移动平台的主要语言之一。但是，嵌入式环境中的Java执行包含专门针对嵌入式环境进行调优的Java虚拟机(jvm)，具有精简的功能和内存有限环境的配置。虽然有一些关于桌面和服务器Java的研究，但关于嵌入式Java的研究却很少。嵌入式Java基准的不普及以及缺乏广泛的分析工具和模拟器只会加剧这个问题。虽然业界使用了一些基准，如MorphMark、MIDPMark和EEMBC Java Grinder Bench，但它们与实际嵌入式Java应用程序相比的代表性尚未得到研究。为了进行这样的研究，我们收集了一个实际的手机应用程序套件并对其进行了详细的描述。我们测量了各种应用程序和基准的几个属性，执行相似/不相似分析，并阐明了当前行业标准嵌入式基准与实际移动Java应用程序的代表性。人们注意到，对于许多特征，应用程序的范围更大，这表明基准不足以代表现实世界中的特征范围。此外，我们发现与基准测试相比，应用程序表现出更少的代码重用/热度。我们还将嵌入式基准与流行的桌面/客户机Java基准(如SPECjvm98和DaCapo)进行了比较。有趣的是，嵌入式应用程序在标准库代码中花费了大量的时间，平均为65%，这表明软件和硬件技术在避免JIT的实时资源开销的情况下促进预编译是有用的。

{"title":"On the representativeness of embedded Java benchmarks","authors":"C. Isen, L. John, Jung-Pil Choi, H. Song","doi":"10.1109/IISWC.2008.4636100","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636100","url":null,"abstract":"Java has become one of the predominant languages for embedded and mobile platforms due to its architecturally neutral design, portability, and security. But Java execution in the embedded world encompasses Java virtual machines (JVMs) specially tuned for the embedded world, with stripped-down capabilities, and configurations for memory-limited environments. While there have been some studies on desktop and server Java, there have been very few studies on embedded Java. The non proliferation of embedded Java benchmarks and the lack of widespread profiling tools and simulators have only exacerbated the problem. While the industry uses some benchmarks such as MorphMark, MIDPMark, and EEMBC Java Grinder Bench, their representativeness in comparison to actual embedded Java applications has not been studied. In order to conduct such a study, we gathered an actual mobile phone application suite and characterized it in detail. We measure several properties of the various applications and benchmarks, perform similarity/dissimilarity analysis and shed light on the representativeness of current industry standard embedded benchmarks against actual mobile Java applications. It was observed that for many characteristics, the applications had a broader range, indicating that the benchmarks were under representing the range of characteristics in the real world. Furthermore, we find that the applications exhibit less code reuse/hotness compared to the benchmarks. We also draw comparisons of the embedded benchmarks against popular desktop/client Java benchmarks, such as the SPECjvm98 and DaCapo. Interestingly, embedded applications spend a significant amount of time in standard library code, on average 65%, suggesting to the usefulness of software and hardware techniques to facilitate pre-compilation with out the real time resource overhead of JIT.","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"19 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120813740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Temporal streams in commercial server applications 商业服务器应用程序中的时间流

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636095

T. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, Andreas Moshovos

Commercial server applications remain memory bound on modern multiprocessor systems because of their large data footprints, frequent sharing, complex non-strided access patterns, and long chains of dependant misses. To improve memory system performance despite these challenging access patterns, researchers have proposed prefetchers that exploit temporal streams-recurring sequences of memory accesses. Although prior studies show substantial performance improvement from such schemes, they fail to explain why temporal streams arise; that is, they treat commercial applications as a black box and do not identify the specific behaviors that lead to recurring miss sequences. In this paper, we perform an information-theoretic analysis of miss traces from single-chip and multi-chip multiprocessors to identify recurring temporal streams in web serving, online transaction processing, and decision support workloads. Then, using function names embedded in the application binaries and Solaris kernel, we identify the code modules and behaviors that give rise to temporal streams.

商业服务器应用程序在现代多处理器系统上仍然是内存绑定的，因为它们占用大量数据、频繁共享、复杂的非跨步访问模式和长链的依赖缺失。为了在这些具有挑战性的访问模式下提高内存系统的性能，研究人员提出了利用时间流(内存访问的循环序列)的预取器。虽然先前的研究表明，这些方案的性能有了实质性的改善，但它们未能解释为什么会出现时间流;也就是说，它们将商业应用程序视为黑盒，而不识别导致重复缺失序列的特定行为。在本文中，我们对单芯片和多芯片多处理器的缺失痕迹进行了信息论分析，以识别web服务，在线事务处理和决策支持工作负载中重复出现的时间流。然后，使用嵌入在应用程序二进制文件和Solaris内核中的函数名，我们确定产生时态流的代码模块和行为。

引用次数: 47

Characterizing and improving the performance of Intel Threading Building Blocks 表征和改进英特尔线程构建块的性能

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636091

Gilberto Contreras, M. Martonosi

The Intel threading building blocks (TBB) runtime library is a popular C++ parallelization environment (D. Bolton, 2007) that offers a set of methods and templates for creating parallel applications. Through support of parallel tasks rather than parallel threads, the TBB runtime library offers improved performance scalability by dynamically redistributing parallel tasks across available processors. This not only creates more scalable, portable parallel applications, but also increases programming productivity by allowing programmers to focus their efforts on identifying concurrency rather than worrying about its management. While many applications benefit from dynamic management of parallelism, dynamic management carries parallelization overhead that increases with increasing core counts and decreasing task sizes. Understanding the sources of these overheads and their implications on application performance can help programmers make more efficient use of available parallelism. Clearly understanding the behavior of these overheads is the first step in creating efficient, scalable parallelization environments targeted at future CMP systems. In this paper we study and characterize some of the overheads of the Intel Threading Building Blocks through the use of real-hardware and simulation performance measurements. Our results show that synchronization overheads within TBB can have a significant and detrimental effect on parallelism performance. Random stealing, while simple and effective at low core counts, becomes less effective as application heterogeneity and core counts increase. Overall, our study provides valuable insights that can be used to create more robust, scalable runtime libraries.

Intel线程构建块(TBB)运行库是一种流行的c++并行化环境(D. Bolton, 2007)，它为创建并行应用程序提供了一组方法和模板。通过支持并行任务而不是并行线程，TBB运行时库通过在可用处理器之间动态地重新分配并行任务，提供了改进的性能可伸缩性。这不仅可以创建更可伸缩、可移植的并行应用程序，而且还可以通过允许程序员将精力集中在识别并发性上而不必担心并发性的管理，从而提高编程效率。虽然许多应用程序受益于并行性的动态管理，但动态管理带来的并行化开销会随着核心数量的增加和任务大小的减小而增加。了解这些开销的来源及其对应用程序性能的影响可以帮助程序员更有效地利用可用的并行性。清楚地理解这些开销的行为是创建针对未来CMP系统的高效、可扩展并行化环境的第一步。在本文中，我们通过使用真实硬件和模拟性能测量来研究和表征英特尔线程构建块的一些开销。我们的结果表明，TBB中的同步开销可能会对并行性能产生重大而有害的影响。随机窃取虽然在低核数时简单有效，但随着应用程序的异构性和核数的增加，它的效率就会降低。总的来说，我们的研究提供了有价值的见解，可用于创建更健壮、可扩展的运行时库。

{"title":"Characterizing and improving the performance of Intel Threading Building Blocks","authors":"Gilberto Contreras, M. Martonosi","doi":"10.1109/IISWC.2008.4636091","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636091","url":null,"abstract":"The Intel threading building blocks (TBB) runtime library is a popular C++ parallelization environment (D. Bolton, 2007) that offers a set of methods and templates for creating parallel applications. Through support of parallel tasks rather than parallel threads, the TBB runtime library offers improved performance scalability by dynamically redistributing parallel tasks across available processors. This not only creates more scalable, portable parallel applications, but also increases programming productivity by allowing programmers to focus their efforts on identifying concurrency rather than worrying about its management. While many applications benefit from dynamic management of parallelism, dynamic management carries parallelization overhead that increases with increasing core counts and decreasing task sizes. Understanding the sources of these overheads and their implications on application performance can help programmers make more efficient use of available parallelism. Clearly understanding the behavior of these overheads is the first step in creating efficient, scalable parallelization environments targeted at future CMP systems. In this paper we study and characterize some of the overheads of the Intel Threading Building Blocks through the use of real-hardware and simulation performance measurements. Our results show that synchronization overheads within TBB can have a significant and detrimental effect on parallelism performance. Random stealing, while simple and effective at low core counts, becomes less effective as application heterogeneity and core counts increase. Overall, our study provides valuable insights that can be used to create more robust, scalable runtime libraries.","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133349409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 127

Empirical examination of a collaborative web application 协作式web应用程序的实证检验

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636094

Christopher Stewart, Matthew Leventi, Kai Shen

Online instructional applications, social networking sites, Wiki-based Web sites, and other emerging Web applications that rely on end users for the generation of web content are increasingly popular. However, these collaborative Web applications are still absent from the benchmark suites commonly used in the evaluation of online systems. This paper argues that collaborative Web applications are unlike traditional online benchmarks, and therefore warrant a new class of benchmarks. Specifically, request behaviors in collaborative Web applications are determined by contributions from end users, which leads to qualitatively more diverse server-side resource requirements and execution patterns compared to traditional online benchmarks. Our arguments stem from an empirical examination of WeBWorK-a widely-used collaborative Web application that allows teachers to post math or physics problems for their students to solve online. Compared to traditional online benchmarks (like TPC-C, SPECweb, and RUBiS), WeBWorK requests are harder to cluster according to their resource consumption, and they follow less regular patterns. Further, we demonstrate that the use of a WeBWorK-style benchmark would probably have led to different results in some recent research studies concerning request classification from event chains and type-based resource usage prediction.

在线教学应用程序、社会网络站点、基于wiki的Web站点和其他依赖最终用户生成Web内容的新兴Web应用程序越来越流行。然而，这些协作Web应用程序仍然没有出现在在线系统评估中常用的基准套件中。本文认为协作Web应用程序与传统的在线基准测试不同，因此需要一类新的基准测试。具体来说，协作Web应用程序中的请求行为是由最终用户的贡献决定的，与传统的在线基准测试相比，这会导致服务器端资源需求和执行模式在质量上更加多样化。我们的论点源于对webwork的实证研究，webwork是一个广泛使用的协作网络应用程序，允许教师将数学或物理问题发布给学生，让他们在线解决。与传统的在线基准测试(如TPC-C、SPECweb和RUBiS)相比，WeBWorK请求更难根据它们的资源消耗进行集群，并且它们遵循较少的规则模式。此外，我们还证明，在最近的一些关于从事件链中进行请求分类和基于类型的资源使用预测的研究中，使用webwork风格的基准可能会导致不同的结果。

{"title":"Empirical examination of a collaborative web application","authors":"Christopher Stewart, Matthew Leventi, Kai Shen","doi":"10.1109/IISWC.2008.4636094","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636094","url":null,"abstract":"Online instructional applications, social networking sites, Wiki-based Web sites, and other emerging Web applications that rely on end users for the generation of web content are increasingly popular. However, these collaborative Web applications are still absent from the benchmark suites commonly used in the evaluation of online systems. This paper argues that collaborative Web applications are unlike traditional online benchmarks, and therefore warrant a new class of benchmarks. Specifically, request behaviors in collaborative Web applications are determined by contributions from end users, which leads to qualitatively more diverse server-side resource requirements and execution patterns compared to traditional online benchmarks. Our arguments stem from an empirical examination of WeBWorK-a widely-used collaborative Web application that allows teachers to post math or physics problems for their students to solve online. Compared to traditional online benchmarks (like TPC-C, SPECweb, and RUBiS), WeBWorK requests are harder to cluster according to their resource consumption, and they follow less regular patterns. Further, we demonstrate that the use of a WeBWorK-style benchmark would probably have led to different results in some recent research studies concerning request classification from event chains and type-based resource usage prediction.","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124350076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

STAMP: Stanford Transactional Applications for Multi-Processing 斯坦福多处理事务应用程序

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636089

C. Minh, Jaewoong Chung, C. Kozyrakis, K. Olukotun

Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been evaluated using microbenchmarks, which may not be representative of any real-world behavior, or individual applications, which do not stress a wide range of execution scenarios. We introduce the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems. STAMP includes eight applications and thirty variants of input parameters and data sets in order to represent several application domains and cover a wide range of transactional execution cases (frequent or rare use of transactions, large or small transactions, high or low contention, etc.). Moreover, STAMP is portable across many types of TM systems, including hardware, software, and hybrid systems. In this paper, we provide descriptions and a detailed characterization of the applications in STAMP. We also use the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.

事务性内存(Transactional Memory, TM)是一种很有前途的简化并行编程的技术。虽然研究文献中已经提出了几个TM系统，但我们仍然缺少分析和比较这些建议所必需的工具和工作负载。大多数TM系统都是使用微基准测试进行评估的，微基准测试可能不代表任何真实世界的行为，或者使用单个应用程序进行评估，微基准测试不强调广泛的执行场景。我们介绍了斯坦福多处理事务应用程序(STAMP)，这是一个用于评估TM系统的综合基准套件。STAMP包括8个应用程序和30个输入参数和数据集的变体，以表示多个应用程序领域，并涵盖广泛的事务执行情况(事务的频繁或罕见使用、大事务或小事务、高争用或低争用等)。此外，STAMP可以跨许多类型的TM系统移植，包括硬件、软件和混合系统。在本文中，我们对STAMP中的应用进行了详细的描述和表征。我们还使用该套件来评估六个不同的TM系统，确定其缺点，并激励对其性能特征的进一步研究。

{"title":"STAMP: Stanford Transactional Applications for Multi-Processing","authors":"C. Minh, Jaewoong Chung, C. Kozyrakis, K. Olukotun","doi":"10.1109/IISWC.2008.4636089","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636089","url":null,"abstract":"Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been evaluated using microbenchmarks, which may not be representative of any real-world behavior, or individual applications, which do not stress a wide range of execution scenarios. We introduce the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems. STAMP includes eight applications and thirty variants of input parameters and data sets in order to represent several application domains and cover a wide range of transactional execution cases (frequent or rare use of transactions, large or small transactions, high or low contention, etc.). Moreover, STAMP is portable across many types of TM systems, including hardware, software, and hybrid systems. In this paper, we provide descriptions and a detailed characterization of the applications in STAMP. We also use the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125530328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1002

A workload for evaluating deep packet inspection architectures 评估深度包检测体系结构的工作负载

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636093

M. Becchi, M. Franklin, P. Crowley

High-speed content inspection of network traffic is an important new application area for programmable networking systems, and has recently led to several proposals for high-performance regular expression matching. At the same time, the number and complexity of the patterns present in well-known network intrusion detection systems has been rapidly increasing. This increase is important since both the practicality and the performance of specific pattern matching designs are strictly dependent upon characteristics of the underlying regular expression set. However, a commonly agreed upon workload for the evaluation of deep packet inspection architectures is still missing, leading to frequent unfair comparisons, and to designs lacking in generality or scalability. In this paper, we propose a workload for the evaluation of regular expression matching architectures. The workload includes a regular expression model and a traffic generator, with the former characterizing different levels of expressiveness within rule-sets and the latter characterizing varying degrees of malicious network activity. The proposed workload is used here to evaluate designs (e.g., different memory layouts and hardware organizations) where the matching algorithm is based on compressed deterministic and non deterministic finite automata (DFAs and NFAs).

网络流量的高速内容检测是可编程网络系统的一个重要的新应用领域，最近提出了几种高性能正则表达式匹配的方案。与此同时，在知名的网络入侵检测系统中，检测模式的数量和复杂性也在迅速增加。这种增加很重要，因为特定模式匹配设计的实用性和性能都严格依赖于底层正则表达式集的特征。然而，对于深度包检测架构的评估，仍然缺少一个普遍认可的工作负载，导致频繁的不公平比较，以及缺乏通用性或可扩展性的设计。在本文中，我们提出了一个评估正则表达式匹配架构的工作负载。工作负载包括一个正则表达式模型和一个流量生成器，前者表征规则集中不同级别的表达能力，后者表征不同程度的恶意网络活动。建议的工作负载在这里用于评估设计(例如，不同的内存布局和硬件组织)，其中匹配算法基于压缩的确定性和非确定性有限自动机(dfa和nfa)。

{"title":"A workload for evaluating deep packet inspection architectures","authors":"M. Becchi, M. Franklin, P. Crowley","doi":"10.1109/IISWC.2008.4636093","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636093","url":null,"abstract":"High-speed content inspection of network traffic is an important new application area for programmable networking systems, and has recently led to several proposals for high-performance regular expression matching. At the same time, the number and complexity of the patterns present in well-known network intrusion detection systems has been rapidly increasing. This increase is important since both the practicality and the performance of specific pattern matching designs are strictly dependent upon characteristics of the underlying regular expression set. However, a commonly agreed upon workload for the evaluation of deep packet inspection architectures is still missing, leading to frequent unfair comparisons, and to designs lacking in generality or scalability. In this paper, we propose a workload for the evaluation of regular expression matching architectures. The workload includes a regular expression model and a traffic generator, with the former characterizing different levels of expressiveness within rule-sets and the latter characterizing varying degrees of malicious network activity. The proposed workload is used here to evaluate designs (e.g., different memory layouts and hardware organizations) where the matching algorithm is based on compressed deterministic and non deterministic finite automata (DFAs and NFAs).","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121609557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 108

Workload characterization of selected JEE-based Web 2.0 applications 选定的基于jee的Web 2.0应用程序的工作负载特性

2008 IEEE International Symposium on Workload Characterization

Pub Date : 2008-09-30 DOI: 10.1109/IISWC.2008.4636096

P. Nagpurkar, William P. Horn, U. Gopalakrishnan, Niteesh Dubey, J. Jann, P. Pattnaik

Web 2.0 represents the evolution of the web from a source of information to a platform. Network advances have permitted users to migrate from desktop applications to so-called Rich Internet Applications (RIAs) characterized by thin clients, which are browser-based and store their state on managed servers. Other Web 2.0 technologies have enabled users to more easily participate, collaborate, and share in web-based communities. With the emergence of wikis, blogs, and social networking, users are no longer only consumers, they become contributors to the collective knowledge accessible on the web. In another Web 2.0 development, content aggregation is moving from portal-based technologies to more sophisticated so-called mashups where aggregation capabilities are greatly expanded. While Web 2.0 has generated a great deal of interest and discussion, there has not been much work on analyzing these emerging workloads. This paper presents a detailed characterization of several applications that exploit Web 2.0 technologies, running on an IBM Power5 system, with the goal of establishing, whether the server-side workloads generated by Web 2.0 applications are significantly different from traditional web workloads, and whether they present new challenges to underlying systems. In this paper, we present a detailed characterization of three Web 2.0 workloads, and a synthetic benchmark representing commercial workloads that do not exploit Web 2.0, for comparison.

Web 2.0代表了Web从信息源到平台的演变。网络的进步已经允许用户从桌面应用程序迁移到以瘦客户机为特征的所谓的富Internet应用程序(ria)，后者基于浏览器并将其状态存储在托管服务器上。其他Web 2.0技术使用户能够更轻松地在基于Web的社区中参与、协作和共享。随着wiki、博客和社交网络的出现，用户不再仅仅是消费者，他们成为了网络上可访问的集体知识的贡献者。在另一个Web 2.0开发中，内容聚合正在从基于门户的技术转向更复杂的所谓mashup，其中聚合功能得到了极大的扩展。虽然Web 2.0引起了人们的极大兴趣和讨论，但是在分析这些新兴工作负载方面还没有太多的工作。本文详细描述了在IBM Power5系统上运行的几个利用Web 2.0技术的应用程序，目的是确定Web 2.0应用程序生成的服务器端工作负载是否与传统的Web工作负载有显著不同，以及它们是否对底层系统提出了新的挑战。在本文中，我们提供了三个Web 2.0工作负载的详细特征，以及一个代表不利用Web 2.0的商业工作负载的综合基准，以便进行比较。

{"title":"Workload characterization of selected JEE-based Web 2.0 applications","authors":"P. Nagpurkar, William P. Horn, U. Gopalakrishnan, Niteesh Dubey, J. Jann, P. Pattnaik","doi":"10.1109/IISWC.2008.4636096","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636096","url":null,"abstract":"Web 2.0 represents the evolution of the web from a source of information to a platform. Network advances have permitted users to migrate from desktop applications to so-called Rich Internet Applications (RIAs) characterized by thin clients, which are browser-based and store their state on managed servers. Other Web 2.0 technologies have enabled users to more easily participate, collaborate, and share in web-based communities. With the emergence of wikis, blogs, and social networking, users are no longer only consumers, they become contributors to the collective knowledge accessible on the web. In another Web 2.0 development, content aggregation is moving from portal-based technologies to more sophisticated so-called mashups where aggregation capabilities are greatly expanded. While Web 2.0 has generated a great deal of interest and discussion, there has not been much work on analyzing these emerging workloads. This paper presents a detailed characterization of several applications that exploit Web 2.0 technologies, running on an IBM Power5 system, with the goal of establishing, whether the server-side workloads generated by Web 2.0 applications are significantly different from traditional web workloads, and whether they present new challenges to underlying systems. In this paper, we present a detailed characterization of three Web 2.0 workloads, and a synthetic benchmark representing commercial workloads that do not exploit Web 2.0, for comparison.","PeriodicalId":447179,"journal":{"name":"2008 IEEE International Symposium on Workload Characterization","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114755669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27