首页 > 最新文献

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

英文 中文
Reciprocal abstraction for computer architecture co-simulation 计算机体系结构协同仿真的互反抽象
Michael Moeng, A. Jones, R. Melhem
Co-simulation of computer architecture elements at different levels of abstraction and fidelity is becoming an increasing necessity for efficient experimentation and research. We propose reciprocal abstraction for computer architecture cosimulation, which allows the integration of simulation methods that utilize different levels of abstraction and fidelity of simulation. Further, reciprocal abstraction avoids the need to conduct detailed evaluations of individual computer architecture components entirely in a vacuum, which can lead to significant inaccuracies from ignoring the system context. Moreover, it allows an exploration of the impact on the full system resulting from design choices in the detailed component model. We demonstrate the potential inaccuracies of isolated component simulation. Using reciprocal abstraction, we integrate a parallel cycle-level networkon- chip (NoC) component into a detailed but more coarse-grain full system simulator.We show that co-simulation using reciprocal abstraction of the cycle-level network model reduces packet latency error compared to the more abstract network model by 69% on average. Additionally, as simulating a detailed network at the cycle-level can greatly increase simulation time over an abstract model, we implemented detailed network simulator using a GPU coprocessor. The CPU+GPU can reduce simulation time for the reciprocal abstraction co-simulation by 16% for a 256-core target machine and 65% for a 512-core target machine.
不同抽象层次和保真度的计算机体系结构元素的联合仿真对于高效的实验和研究越来越有必要。我们提出了计算机体系结构协同仿真的互抽象,它允许利用不同层次的抽象和仿真保真度的仿真方法的集成。此外,相互抽象避免了完全在真空中对单个计算机体系结构组件进行详细评估的需要,这可能会由于忽略系统上下文而导致显著的不准确性。此外,它允许在详细组件模型中探索设计选择对整个系统的影响。我们证明了孤立组件模拟的潜在不准确性。利用互反抽象,我们将一个并行周期级网络片上(NoC)组件集成到一个详细但更粗粒度的全系统模拟器中。我们表明,与更抽象的网络模型相比,使用循环级网络模型的互惠抽象的联合模拟平均减少了69%的数据包延迟误差。此外,由于在周期级别上模拟详细网络可以大大增加抽象模型的模拟时间,因此我们使用GPU协处理器实现了详细网络模拟器。CPU+GPU可以在256核的目标机器上减少16%的交互抽象协同模拟的模拟时间,在512核的目标机器上减少65%的模拟时间。
{"title":"Reciprocal abstraction for computer architecture co-simulation","authors":"Michael Moeng, A. Jones, R. Melhem","doi":"10.1109/ISPASS.2015.7095812","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095812","url":null,"abstract":"Co-simulation of computer architecture elements at different levels of abstraction and fidelity is becoming an increasing necessity for efficient experimentation and research. We propose reciprocal abstraction for computer architecture cosimulation, which allows the integration of simulation methods that utilize different levels of abstraction and fidelity of simulation. Further, reciprocal abstraction avoids the need to conduct detailed evaluations of individual computer architecture components entirely in a vacuum, which can lead to significant inaccuracies from ignoring the system context. Moreover, it allows an exploration of the impact on the full system resulting from design choices in the detailed component model. We demonstrate the potential inaccuracies of isolated component simulation. Using reciprocal abstraction, we integrate a parallel cycle-level networkon- chip (NoC) component into a detailed but more coarse-grain full system simulator.We show that co-simulation using reciprocal abstraction of the cycle-level network model reduces packet latency error compared to the more abstract network model by 69% on average. Additionally, as simulating a detailed network at the cycle-level can greatly increase simulation time over an abstract model, we implemented detailed network simulator using a GPU coprocessor. The CPU+GPU can reduce simulation time for the reciprocal abstraction co-simulation by 16% for a 256-core target machine and 65% for a 512-core target machine.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127273535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Micro-architecture independent branch behavior characterization 独立于微架构的分支行为表征
S. D. Pestel, Stijn Eyerman, L. Eeckhout
In this paper, we propose linear branch entropy, a new metric for characterizing branch behavior. The metric is independent of the configuration of a specific branch predictor, but it is highly correlated with the branch miss rate of any predictor. In particular, we show that there is a linear relationship between linear branch entropy and the branch miss rate. This means that the metric can be used to estimate branch miss rates without simulating a branch predictor by constructing a linear function between entropy and miss rate. The resulting model is more accurate than previously proposed branch classification models, such as taken rate and transition rate. Furthermore, linear branch entropy can be used to analyze the branch behavior of applications, independent of specific branch predictor implementations, and the linear branch miss rate function enables comparing branch predictors on how well they perform on easy-to-predict versus hard-topredict branches. As a case study, we find that the winner of the latest branch predictor competition performs worse on hardto- predict branches, compared to the third runner-up; however, since the benchmark suite mainly consisted of easy branches, a predictor that performs well on easy-to-predict branches has a lower average miss rate.
在本文中,我们提出了线性分支熵,这是一个表征分支行为的新度量。该指标独立于特定分支预测器的配置,但它与任何预测器的分支缺失率高度相关。特别是,我们证明了线性分支熵与分支缺失率之间存在线性关系。这意味着该指标可用于估计分支脱靶率,而无需通过在熵和脱靶率之间构建线性函数来模拟分支预测器。所得到的模型比以前提出的分支分类模型(如占用率和转移率)更准确。此外,线性分支熵可以用来分析应用程序的分支行为,独立于特定的分支预测器实现,并且线性分支失误率函数可以比较分支预测器在易于预测和难以预测的分支上的表现。作为一个案例研究,我们发现,与第三名相比,最新的分行预测比赛的获胜者在难以预测的分行上表现更差;然而,由于基准套件主要由容易预测的分支组成,在容易预测的分支上执行良好的预测器具有较低的平均失误率。
{"title":"Micro-architecture independent branch behavior characterization","authors":"S. D. Pestel, Stijn Eyerman, L. Eeckhout","doi":"10.1109/ISPASS.2015.7095792","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095792","url":null,"abstract":"In this paper, we propose linear branch entropy, a new metric for characterizing branch behavior. The metric is independent of the configuration of a specific branch predictor, but it is highly correlated with the branch miss rate of any predictor. In particular, we show that there is a linear relationship between linear branch entropy and the branch miss rate. This means that the metric can be used to estimate branch miss rates without simulating a branch predictor by constructing a linear function between entropy and miss rate. The resulting model is more accurate than previously proposed branch classification models, such as taken rate and transition rate. Furthermore, linear branch entropy can be used to analyze the branch behavior of applications, independent of specific branch predictor implementations, and the linear branch miss rate function enables comparing branch predictors on how well they perform on easy-to-predict versus hard-topredict branches. As a case study, we find that the winner of the latest branch predictor competition performs worse on hardto- predict branches, compared to the third runner-up; however, since the benchmark suite mainly consisted of easy branches, a predictor that performs well on easy-to-predict branches has a lower average miss rate.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132478749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analyzing graphics processor unit (GPU) instruction set architectures 分析图形处理器单元(GPU)指令集体系结构
Kothiya Mayank, Hongwen Dai, Jizeng Wei, Huiyang Zhou
Because of their high throughput and power efficiency, massively parallel architectures like graphics processing units (GPUs) become a popular platform for generous purpose computing. However, there are few studies and analyses on GPU instruction set architectures (ISAs) although it is wellknown that the ISA is a fundamental design issue of all modern processors including GPUs.
由于它们的高吞吐量和高能效,像图形处理单元(gpu)这样的大规模并行体系结构成为广泛用途计算的流行平台。然而,尽管众所周知,GPU指令集架构是包括GPU在内的所有现代处理器的基本设计问题,但对GPU指令集架构的研究和分析却很少。
{"title":"Analyzing graphics processor unit (GPU) instruction set architectures","authors":"Kothiya Mayank, Hongwen Dai, Jizeng Wei, Huiyang Zhou","doi":"10.1109/ISPASS.2015.7095794","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095794","url":null,"abstract":"Because of their high throughput and power efficiency, massively parallel architectures like graphics processing units (GPUs) become a popular platform for generous purpose computing. However, there are few studies and analyses on GPU instruction set architectures (ISAs) although it is wellknown that the ISA is a fundamental design issue of all modern processors including GPUs.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134590156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Revisiting symbiotic job scheduling 重温共生作业调度
Stijn Eyerman, P. Michaud, W. Rogiest
Symbiotic job scheduling exploits the fact that in a system with shared resources, the performance of jobs is impacted by the behavior of other co-running jobs. By coscheduling combinations of jobs that have low interference, the performance of a system can be increased. In this paper, we investigate the impact of using symbiotic job scheduling for increasing throughput. We find that even for a theoretically optimal scheduler, this impact is very low, despite the substantial sensitivity of per job performance to which other jobs are coscheduled: for example, our experiments on a 4-thread SMT processor show that, on average, the job IPC varies by 37% depending on coscheduled jobs, the per-coschedule throughput varies by 69%, and yet the average throughput gain brought by optimal symbiotic scheduling is only 3%. This small margin of improvement can be explained by the observation that all the jobs need to be eventually executed, restricting the job combinations a symbiotic job scheduler can select to optimize throughput. We explain why previous work reported a substantial gain from symbiotic job scheduling, and we find that (only) reporting turnaround time can lead to misleading conclusions. Furthermore, we show how the impact of scheduling can be evaluated in microarchitectural studies, without having to implement a scheduler.
共生作业调度利用了这样一个事实,即在具有共享资源的系统中,作业的性能受到其他共同运行作业的行为的影响。通过协同调度具有低干扰的作业组合,可以提高系统的性能。在本文中,我们研究了使用共生作业调度对提高吞吐量的影响。我们发现,即使对于理论上最优的调度器,这种影响也非常低,尽管每个作业的性能对其他作业的协同调度有很大的敏感性:例如,我们在4线程SMT处理器上的实验表明,平均而言,作业IPC根据协同调度的作业变化37%,每个协同调度的吞吐量变化69%,然而最优共生调度带来的平均吞吐量增益仅为3%。这种微小的改进可以通过观察所有作业最终都需要执行来解释,这限制了共生作业调度器可以选择的作业组合来优化吞吐量。我们解释了为什么以前的工作报告了共生作业调度的实质性收益,并且我们发现(仅)报告周转时间可能导致误导性结论。此外,我们还展示了如何在微架构研究中评估调度的影响,而无需实现调度程序。
{"title":"Revisiting symbiotic job scheduling","authors":"Stijn Eyerman, P. Michaud, W. Rogiest","doi":"10.1109/ISPASS.2015.7095791","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095791","url":null,"abstract":"Symbiotic job scheduling exploits the fact that in a system with shared resources, the performance of jobs is impacted by the behavior of other co-running jobs. By coscheduling combinations of jobs that have low interference, the performance of a system can be increased. In this paper, we investigate the impact of using symbiotic job scheduling for increasing throughput. We find that even for a theoretically optimal scheduler, this impact is very low, despite the substantial sensitivity of per job performance to which other jobs are coscheduled: for example, our experiments on a 4-thread SMT processor show that, on average, the job IPC varies by 37% depending on coscheduled jobs, the per-coschedule throughput varies by 69%, and yet the average throughput gain brought by optimal symbiotic scheduling is only 3%. This small margin of improvement can be explained by the observation that all the jobs need to be eventually executed, restricting the job combinations a symbiotic job scheduler can select to optimize throughput. We explain why previous work reported a substantial gain from symbiotic job scheduling, and we find that (only) reporting turnaround time can lead to misleading conclusions. Furthermore, we show how the impact of scheduling can be evaluated in microarchitectural studies, without having to implement a scheduler.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132953841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
DNOC: an accurate and fast virtual channel and deflection routing network-on-chip simulator DNOC:一个精确和快速的虚拟通道和偏转路由网络芯片模拟器
G. Oxman, S. Weiss
We present DNOC, a network-on-chip simulator. DNOC simulates custom network topologies with detailed router models. Both classic virtual channel (VC) based router models and deflection routing models are supported. We validate the simulation models against hardware RTL router models. DNOC can generate various statistics, such as network latency and power. We evaluate the simulator in three typical use cases. In stand-alone simulation, synthetic traffic generators are used to offer load to the network. In synchronous co-simulation, the simulator is integrated as a module within a larger system simulator with synchronization every simulated cycle. In the faster model based co-simulation mode, a latency model is built, and re-tuned periodically at longer time intervals. We demonstrate co-simulation by running applications from the Rodinia and SPLASH-2 benchmark sets on mesh variants. DNOC is also able to run on multiple x86 cores in parallel, speeding up the simulation of large networks.
我们提出了DNOC,一个片上网络模拟器。DNOC用详细的路由器模型模拟自定义网络拓扑。支持经典的基于虚拟信道(VC)的路由模型和偏转路由模型。我们用硬件RTL路由器模型验证了仿真模型。DNOC可以生成各种统计数据,例如网络延迟和功率。我们在三个典型用例中评估模拟器。在单机仿真中,采用合成流量发生器为网络提供负载。在同步协同仿真中,模拟器作为一个模块集成在一个更大的系统模拟器中,每个模拟周期都是同步的。在更快的基于模型的联合仿真模式中,建立了一个延迟模型,并以更长的时间间隔周期性地重新调优。我们通过在网格变量上运行来自Rodinia和SPLASH-2基准集的应用程序来演示联合仿真。DNOC还能够在多个x86内核上并行运行,加快大型网络的模拟速度。
{"title":"DNOC: an accurate and fast virtual channel and deflection routing network-on-chip simulator","authors":"G. Oxman, S. Weiss","doi":"10.1109/ISPASS.2015.7095805","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095805","url":null,"abstract":"We present DNOC, a network-on-chip simulator. DNOC simulates custom network topologies with detailed router models. Both classic virtual channel (VC) based router models and deflection routing models are supported. We validate the simulation models against hardware RTL router models. DNOC can generate various statistics, such as network latency and power. We evaluate the simulator in three typical use cases. In stand-alone simulation, synthetic traffic generators are used to offer load to the network. In synchronous co-simulation, the simulator is integrated as a module within a larger system simulator with synchronization every simulated cycle. In the faster model based co-simulation mode, a latency model is built, and re-tuned periodically at longer time intervals. We demonstrate co-simulation by running applications from the Rodinia and SPLASH-2 benchmark sets on mesh variants. DNOC is also able to run on multiple x86 cores in parallel, speeding up the simulation of large networks.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"212 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127406953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Micro-architecture independent analytical processor performance and power modeling 微架构独立分析处理器性能和功耗建模
S. V. D. Steen, S. D. Pestel, Moncef Mechri, Stijn Eyerman, Trevor E. Carlson, D. Black-Schaffer, Erik Hagersten, L. Eeckhout
Optimizing processors for specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energyefficiency gains from technology scaling, such approaches may become increasingly important. However, designing applicationspecific processors require fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance estimations and insight into the interaction between an application's characteristics and the micro-architecture of a processor. Unfortunately, current analytical models require some microarchitecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration, which is far more time-consuming than evaluating the actual performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large design space almost instantaneously. We show that using a micro-architecture independent profile leads to a speedup of 25× for our evaluated design space, compared to an analytical model that uses micro-architecture dependent profiles. Over a large design space, the model has a 13% error for performance and a 7% error for power, compared to cycle-level simulation. The model is able to accurately determine the optimal processor configuration for different applications under power or performance constraints, and it can provide insight into performance through cycle stacks.
针对特定应用优化处理器可以大大提高能源效率。随着登纳德标度的终结,以及技术标度带来的能效收益相应减少,这种方法可能变得越来越重要。然而,设计特定于应用程序的处理器需要快速的设计空间探索工具来针对目标应用程序进行优化。分析模型非常适合这种设计空间探索,因为它们提供了快速的性能评估和对应用程序特征与处理器微体系结构之间交互的洞察。不幸的是,当前的分析模型需要一些微体系结构相关的输入,例如缓存缺失率、分支缺失率和内存级并行性。这需要为每个缓存和分支预测器配置分析应用程序,这比评估实际性能模型要耗时得多。在这项工作中,我们提出了一个微架构独立的分析器和相关的分析模型,使我们能够在几乎即时的情况下产生跨大型设计空间的性能和功耗估计。我们表明,与使用微架构相关配置文件的分析模型相比,使用独立于微架构的配置文件可以使我们评估的设计空间的速度提高25倍。在较大的设计空间内,与周期级仿真相比,该模型的性能误差为13%,功率误差为7%。该模型能够准确地确定功耗或性能限制下不同应用程序的最佳处理器配置,并且可以通过周期堆栈提供对性能的洞察。
{"title":"Micro-architecture independent analytical processor performance and power modeling","authors":"S. V. D. Steen, S. D. Pestel, Moncef Mechri, Stijn Eyerman, Trevor E. Carlson, D. Black-Schaffer, Erik Hagersten, L. Eeckhout","doi":"10.1109/ISPASS.2015.7095782","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095782","url":null,"abstract":"Optimizing processors for specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energyefficiency gains from technology scaling, such approaches may become increasingly important. However, designing applicationspecific processors require fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance estimations and insight into the interaction between an application's characteristics and the micro-architecture of a processor. Unfortunately, current analytical models require some microarchitecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration, which is far more time-consuming than evaluating the actual performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large design space almost instantaneously. We show that using a micro-architecture independent profile leads to a speedup of 25× for our evaluated design space, compared to an analytical model that uses micro-architecture dependent profiles. Over a large design space, the model has a 13% error for performance and a 7% error for power, compared to cycle-level simulation. The model is able to accurately determine the optimal processor configuration for different applications under power or performance constraints, and it can provide insight into performance through cycle stacks.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126076320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Emulating cache organizations on real hardware using performance cloning 使用性能克隆在真实硬件上模拟缓存组织
Yipeng Wang, Yan Solihin
Computer system designers need a deep understanding of end users' workload in order to arrive at an optimum design. Unfortunately, many end users will not share their software to designers due to the proprietary or confidential nature of their software. Researchers have proposed workload cloning, which is a process of extracting statistics that summarize the behavior of users' workloads through profiling, followed by using them to drive the generation of a representative synthetic workload (clone). Clones can be used in place of the original workloads to evaluate computer system performance, helping designers to understand the behavior of users workload on the simulated machine models without the users having to disclose proprietary or sensitive information about the original workload. In this paper, we propose infusing environment-specific information into the clone. This Environment-Specific Clone (ESC) enables the simulation of hypothetical cache configurations directly on a machine with a different cache configuration. We validate ESC on both real systems as well as cache simulations. Furthermore, we present a case study of how page mapping affects cache performance. ESC enables such a study at native machine speed by infusing the page mapping information into clones, without needing to modify the OS or hardware. We then analyze the factors that determine how page mapping impact cache performance, and how various applications are affected differently.
计算机系统设计人员需要深入了解最终用户的工作量,以便达到最佳设计。不幸的是,由于软件的专有性或机密性,许多终端用户不会与设计人员共享他们的软件。研究人员提出了工作负载克隆,这是一个提取统计数据的过程,这些统计数据通过分析总结了用户工作负载的行为,然后使用它们来驱动有代表性的合成工作负载(克隆)的生成。克隆可以代替原始工作负载来评估计算机系统性能,帮助设计人员了解用户工作负载在模拟机器模型上的行为,而无需用户披露有关原始工作负载的专有或敏感信息。在本文中,我们建议在克隆中注入特定环境的信息。这种特定于环境的克隆(ESC)支持在具有不同缓存配置的机器上直接模拟假设的缓存配置。我们在真实系统和缓存模拟上验证了ESC。此外,我们还提供了一个页面映射如何影响缓存性能的案例研究。ESC通过将页面映射信息注入到克隆中,而无需修改操作系统或硬件,从而能够以本机机器速度进行这样的研究。然后,我们分析了决定页面映射如何影响缓存性能的因素,以及各种应用程序如何受到不同的影响。
{"title":"Emulating cache organizations on real hardware using performance cloning","authors":"Yipeng Wang, Yan Solihin","doi":"10.1109/ISPASS.2015.7095815","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095815","url":null,"abstract":"Computer system designers need a deep understanding of end users' workload in order to arrive at an optimum design. Unfortunately, many end users will not share their software to designers due to the proprietary or confidential nature of their software. Researchers have proposed workload cloning, which is a process of extracting statistics that summarize the behavior of users' workloads through profiling, followed by using them to drive the generation of a representative synthetic workload (clone). Clones can be used in place of the original workloads to evaluate computer system performance, helping designers to understand the behavior of users workload on the simulated machine models without the users having to disclose proprietary or sensitive information about the original workload. In this paper, we propose infusing environment-specific information into the clone. This Environment-Specific Clone (ESC) enables the simulation of hypothetical cache configurations directly on a machine with a different cache configuration. We validate ESC on both real systems as well as cache simulations. Furthermore, we present a case study of how page mapping affects cache performance. ESC enables such a study at native machine speed by infusing the page mapping information into clones, without needing to modify the OS or hardware. We then analyze the factors that determine how page mapping impact cache performance, and how various applications are affected differently.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121152611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ARACompiler: a prototyping flow and evaluation framework for accelerator-rich architectures araccompiler:一个用于富含加速器的架构的原型流程和评估框架
Yu-Ting Chen, J. Cong, Bingjun Xiao
Accelerator-rich architectures (ARAs) provide energy-efficient solutions for domain-specific computing in the age of dark silicon. However, due to the complex interaction between the general-purpose cores, accelerators, customized onchip interconnects, customized memory systems, and operating systems, it has been difficult to get detailed and accurate evaluations and analyses of ARAs on complex real-life benchmarks using the existing full-system simulators. In this paper we develop the ARACompiler, which is a highly automated design flow for prototyping ARAs and performing evaluation on FPGAs. An efficient system software stack is generated automatically to handle resource management and TLB misses.We further provide application programming interfaces (APIs) for users to develop their applications using accelerators. The flow can provide 2.9x to 42.6x evaluation time saving over the full-system simulations.
富加速器架构(ara)为暗硅时代的特定领域计算提供了高能效的解决方案。然而,由于通用核心、加速器、定制片上互连、定制存储系统和操作系统之间的复杂交互,使用现有的全系统模拟器很难在复杂的实际基准测试中对ara进行详细而准确的评估和分析。本文开发了araccompiler,它是一个高度自动化的设计流程,用于对fpga进行原型设计和执行评估。自动生成有效的系统软件栈,以处理资源管理和TLB失误。我们进一步为用户提供应用程序编程接口(api),以使用加速器开发他们的应用程序。与全系统模拟相比,该流程可以节省2.9到42.6倍的评估时间。
{"title":"ARACompiler: a prototyping flow and evaluation framework for accelerator-rich architectures","authors":"Yu-Ting Chen, J. Cong, Bingjun Xiao","doi":"10.1109/ISPASS.2015.7095795","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095795","url":null,"abstract":"Accelerator-rich architectures (ARAs) provide energy-efficient solutions for domain-specific computing in the age of dark silicon. However, due to the complex interaction between the general-purpose cores, accelerators, customized onchip interconnects, customized memory systems, and operating systems, it has been difficult to get detailed and accurate evaluations and analyses of ARAs on complex real-life benchmarks using the existing full-system simulators. In this paper we develop the ARACompiler, which is a highly automated design flow for prototyping ARAs and performing evaluation on FPGAs. An efficient system software stack is generated automatically to handle resource management and TLB misses.We further provide application programming interfaces (APIs) for users to develop their applications using accelerators. The flow can provide 2.9x to 42.6x evaluation time saving over the full-system simulations.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121233289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Graph Processing Platforms at Scale: Practices and Experiences 大规模图形处理平台:实践与经验
Seung-Hwan Lim, S. Lee, Gautam Ganesh, Tyler C. Brown, S. Sukumar
Graph analysis has revealed patterns and relationships hidden in data from a variety of domains such as transportation networks, social networks, clinical pathways, and collaboration networks. As these networks grow in size, variety and complexity, it is a challenge to find the right combination of tools and implementation of algorithms to discover new insights from the data. Addressing this challenge, our study presents an extensive empirical evaluation of three representative graph processing platforms: Pegasus, GraphX, and Urika. Each system represents a combination of options in data model, processing paradigm, and infrastructure. We benchmark each platform using three popular graph mining operations, degree distribution, connected components, and PageRank over real-world graphs. Our experiments show that each graph processing platform owns a particular strength for different types of graph operations. While Urika performs the best in non-iterative graph operations like degree distribution, GraphX outperforms iterative operations like connected components and PageRank. We conclude this paper by discussing options to optimize the performance of a graph-theoretic operation on each platform for large-scale real world graphs.
图形分析揭示了隐藏在各种领域数据中的模式和关系,如交通网络、社会网络、临床途径和协作网络。随着这些网络在规模、多样性和复杂性方面的增长,找到工具和算法的正确组合以从数据中发现新的见解是一个挑战。为了应对这一挑战,我们的研究对三个代表性的图形处理平台:Pegasus、GraphX和Urika进行了广泛的实证评估。每个系统都代表了数据模型、处理范例和基础设施中的选项组合。我们使用三种流行的图挖掘操作,度分布,连接组件和真实世界图的PageRank对每个平台进行基准测试。我们的实验表明,对于不同类型的图操作,每个图处理平台都具有特定的强度。虽然Urika在度分布等非迭代图操作中表现最好,但GraphX优于连接组件和PageRank等迭代操作。我们通过讨论在每个平台上优化大规模真实世界图的图论操作性能的选项来结束本文。
{"title":"Graph Processing Platforms at Scale: Practices and Experiences","authors":"Seung-Hwan Lim, S. Lee, Gautam Ganesh, Tyler C. Brown, S. Sukumar","doi":"10.1109/ISPASS.2015.7095783","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095783","url":null,"abstract":"Graph analysis has revealed patterns and relationships hidden in data from a variety of domains such as transportation networks, social networks, clinical pathways, and collaboration networks. As these networks grow in size, variety and complexity, it is a challenge to find the right combination of tools and implementation of algorithms to discover new insights from the data. Addressing this challenge, our study presents an extensive empirical evaluation of three representative graph processing platforms: Pegasus, GraphX, and Urika. Each system represents a combination of options in data model, processing paradigm, and infrastructure. We benchmark each platform using three popular graph mining operations, degree distribution, connected components, and PageRank over real-world graphs. Our experiments show that each graph processing platform owns a particular strength for different types of graph operations. While Urika performs the best in non-iterative graph operations like degree distribution, GraphX outperforms iterative operations like connected components and PageRank. We conclude this paper by discussing options to optimize the performance of a graph-theoretic operation on each platform for large-scale real world graphs.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"55 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113972443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Performance and energy evaluation of data prefetching on intel Xeon Phi intel Xeon Phi处理器上数据预取的性能和能量评估
D. Guttman, M. Kandemir, Meenakshi Arunachalam, V. Calina
There is an urgent need to evaluate the existing parallelism and data locality-oriented techniques on emerging manycore machines using multithreaded applications. Data prefetching is a well-known latency hiding technique that comes with various hardware- and software-based implementations in almost all commercial machines. A well-tuned prefetcher can reduce the observed data access latencies significantly by bringing the soonto- be-requested data into the cache ahead of time, eventually improving application execution time. Motivated by this, we present in this paper a detailed performance and power characterization of software (compiler-guided) and hardware data prefetching on an Intel Xeon Phi-based system. Our main contributions are (i) an analysis of the interactions between hardware and software prefetching, showing how hardware prefetching can throttle itself in response to software; (ii) results on the power and energy behavior of prefetching, showing how performance and energy gains outweigh the increased power cost of prefetching; and (iii) an evaluation of the use of intrinsic prefetch instructions to prefetch for applications with difficult-to-detect access patterns.
在新兴的多核机器上使用多线程应用程序,迫切需要评估现有的并行性和面向数据位置的技术。数据预取是一种众所周知的延迟隐藏技术,在几乎所有商用机器中都有各种基于硬件和软件的实现。一个调优的预取器可以通过提前将即将被请求的数据放入缓存,从而显著减少观察到的数据访问延迟,最终改善应用程序的执行时间。基于此,我们在本文中详细介绍了基于Intel Xeon phi的系统上软件(编译器引导)和硬件数据预取的性能和功耗特性。我们的主要贡献是(i)硬件预取和软件预取之间的相互作用的分析,显示硬件预取如何在响应软件时限制自己;(ii)关于预取的功率和能量行为的结果,显示性能和能量收益如何超过预取增加的功率成本;以及(iii)对使用固有预取指令对具有难以检测的访问模式的应用程序进行预取的评估。
{"title":"Performance and energy evaluation of data prefetching on intel Xeon Phi","authors":"D. Guttman, M. Kandemir, Meenakshi Arunachalam, V. Calina","doi":"10.1109/ISPASS.2015.7095814","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095814","url":null,"abstract":"There is an urgent need to evaluate the existing parallelism and data locality-oriented techniques on emerging manycore machines using multithreaded applications. Data prefetching is a well-known latency hiding technique that comes with various hardware- and software-based implementations in almost all commercial machines. A well-tuned prefetcher can reduce the observed data access latencies significantly by bringing the soonto- be-requested data into the cache ahead of time, eventually improving application execution time. Motivated by this, we present in this paper a detailed performance and power characterization of software (compiler-guided) and hardware data prefetching on an Intel Xeon Phi-based system. Our main contributions are (i) an analysis of the interactions between hardware and software prefetching, showing how hardware prefetching can throttle itself in response to software; (ii) results on the power and energy behavior of prefetching, showing how performance and energy gains outweigh the increased power cost of prefetching; and (iii) an evaluation of the use of intrinsic prefetch instructions to prefetch for applications with difficult-to-detect access patterns.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124386456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1