Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095812
Michael Moeng, A. Jones, R. Melhem
Co-simulation of computer architecture elements at different levels of abstraction and fidelity is becoming an increasing necessity for efficient experimentation and research. We propose reciprocal abstraction for computer architecture cosimulation, which allows the integration of simulation methods that utilize different levels of abstraction and fidelity of simulation. Further, reciprocal abstraction avoids the need to conduct detailed evaluations of individual computer architecture components entirely in a vacuum, which can lead to significant inaccuracies from ignoring the system context. Moreover, it allows an exploration of the impact on the full system resulting from design choices in the detailed component model. We demonstrate the potential inaccuracies of isolated component simulation. Using reciprocal abstraction, we integrate a parallel cycle-level networkon- chip (NoC) component into a detailed but more coarse-grain full system simulator.We show that co-simulation using reciprocal abstraction of the cycle-level network model reduces packet latency error compared to the more abstract network model by 69% on average. Additionally, as simulating a detailed network at the cycle-level can greatly increase simulation time over an abstract model, we implemented detailed network simulator using a GPU coprocessor. The CPU+GPU can reduce simulation time for the reciprocal abstraction co-simulation by 16% for a 256-core target machine and 65% for a 512-core target machine.
{"title":"Reciprocal abstraction for computer architecture co-simulation","authors":"Michael Moeng, A. Jones, R. Melhem","doi":"10.1109/ISPASS.2015.7095812","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095812","url":null,"abstract":"Co-simulation of computer architecture elements at different levels of abstraction and fidelity is becoming an increasing necessity for efficient experimentation and research. We propose reciprocal abstraction for computer architecture cosimulation, which allows the integration of simulation methods that utilize different levels of abstraction and fidelity of simulation. Further, reciprocal abstraction avoids the need to conduct detailed evaluations of individual computer architecture components entirely in a vacuum, which can lead to significant inaccuracies from ignoring the system context. Moreover, it allows an exploration of the impact on the full system resulting from design choices in the detailed component model. We demonstrate the potential inaccuracies of isolated component simulation. Using reciprocal abstraction, we integrate a parallel cycle-level networkon- chip (NoC) component into a detailed but more coarse-grain full system simulator.We show that co-simulation using reciprocal abstraction of the cycle-level network model reduces packet latency error compared to the more abstract network model by 69% on average. Additionally, as simulating a detailed network at the cycle-level can greatly increase simulation time over an abstract model, we implemented detailed network simulator using a GPU coprocessor. The CPU+GPU can reduce simulation time for the reciprocal abstraction co-simulation by 16% for a 256-core target machine and 65% for a 512-core target machine.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127273535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095792
S. D. Pestel, Stijn Eyerman, L. Eeckhout
In this paper, we propose linear branch entropy, a new metric for characterizing branch behavior. The metric is independent of the configuration of a specific branch predictor, but it is highly correlated with the branch miss rate of any predictor. In particular, we show that there is a linear relationship between linear branch entropy and the branch miss rate. This means that the metric can be used to estimate branch miss rates without simulating a branch predictor by constructing a linear function between entropy and miss rate. The resulting model is more accurate than previously proposed branch classification models, such as taken rate and transition rate. Furthermore, linear branch entropy can be used to analyze the branch behavior of applications, independent of specific branch predictor implementations, and the linear branch miss rate function enables comparing branch predictors on how well they perform on easy-to-predict versus hard-topredict branches. As a case study, we find that the winner of the latest branch predictor competition performs worse on hardto- predict branches, compared to the third runner-up; however, since the benchmark suite mainly consisted of easy branches, a predictor that performs well on easy-to-predict branches has a lower average miss rate.
{"title":"Micro-architecture independent branch behavior characterization","authors":"S. D. Pestel, Stijn Eyerman, L. Eeckhout","doi":"10.1109/ISPASS.2015.7095792","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095792","url":null,"abstract":"In this paper, we propose linear branch entropy, a new metric for characterizing branch behavior. The metric is independent of the configuration of a specific branch predictor, but it is highly correlated with the branch miss rate of any predictor. In particular, we show that there is a linear relationship between linear branch entropy and the branch miss rate. This means that the metric can be used to estimate branch miss rates without simulating a branch predictor by constructing a linear function between entropy and miss rate. The resulting model is more accurate than previously proposed branch classification models, such as taken rate and transition rate. Furthermore, linear branch entropy can be used to analyze the branch behavior of applications, independent of specific branch predictor implementations, and the linear branch miss rate function enables comparing branch predictors on how well they perform on easy-to-predict versus hard-topredict branches. As a case study, we find that the winner of the latest branch predictor competition performs worse on hardto- predict branches, compared to the third runner-up; however, since the benchmark suite mainly consisted of easy branches, a predictor that performs well on easy-to-predict branches has a lower average miss rate.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132478749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Because of their high throughput and power efficiency, massively parallel architectures like graphics processing units (GPUs) become a popular platform for generous purpose computing. However, there are few studies and analyses on GPU instruction set architectures (ISAs) although it is wellknown that the ISA is a fundamental design issue of all modern processors including GPUs.
{"title":"Analyzing graphics processor unit (GPU) instruction set architectures","authors":"Kothiya Mayank, Hongwen Dai, Jizeng Wei, Huiyang Zhou","doi":"10.1109/ISPASS.2015.7095794","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095794","url":null,"abstract":"Because of their high throughput and power efficiency, massively parallel architectures like graphics processing units (GPUs) become a popular platform for generous purpose computing. However, there are few studies and analyses on GPU instruction set architectures (ISAs) although it is wellknown that the ISA is a fundamental design issue of all modern processors including GPUs.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134590156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095791
Stijn Eyerman, P. Michaud, W. Rogiest
Symbiotic job scheduling exploits the fact that in a system with shared resources, the performance of jobs is impacted by the behavior of other co-running jobs. By coscheduling combinations of jobs that have low interference, the performance of a system can be increased. In this paper, we investigate the impact of using symbiotic job scheduling for increasing throughput. We find that even for a theoretically optimal scheduler, this impact is very low, despite the substantial sensitivity of per job performance to which other jobs are coscheduled: for example, our experiments on a 4-thread SMT processor show that, on average, the job IPC varies by 37% depending on coscheduled jobs, the per-coschedule throughput varies by 69%, and yet the average throughput gain brought by optimal symbiotic scheduling is only 3%. This small margin of improvement can be explained by the observation that all the jobs need to be eventually executed, restricting the job combinations a symbiotic job scheduler can select to optimize throughput. We explain why previous work reported a substantial gain from symbiotic job scheduling, and we find that (only) reporting turnaround time can lead to misleading conclusions. Furthermore, we show how the impact of scheduling can be evaluated in microarchitectural studies, without having to implement a scheduler.
{"title":"Revisiting symbiotic job scheduling","authors":"Stijn Eyerman, P. Michaud, W. Rogiest","doi":"10.1109/ISPASS.2015.7095791","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095791","url":null,"abstract":"Symbiotic job scheduling exploits the fact that in a system with shared resources, the performance of jobs is impacted by the behavior of other co-running jobs. By coscheduling combinations of jobs that have low interference, the performance of a system can be increased. In this paper, we investigate the impact of using symbiotic job scheduling for increasing throughput. We find that even for a theoretically optimal scheduler, this impact is very low, despite the substantial sensitivity of per job performance to which other jobs are coscheduled: for example, our experiments on a 4-thread SMT processor show that, on average, the job IPC varies by 37% depending on coscheduled jobs, the per-coschedule throughput varies by 69%, and yet the average throughput gain brought by optimal symbiotic scheduling is only 3%. This small margin of improvement can be explained by the observation that all the jobs need to be eventually executed, restricting the job combinations a symbiotic job scheduler can select to optimize throughput. We explain why previous work reported a substantial gain from symbiotic job scheduling, and we find that (only) reporting turnaround time can lead to misleading conclusions. Furthermore, we show how the impact of scheduling can be evaluated in microarchitectural studies, without having to implement a scheduler.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132953841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095805
G. Oxman, S. Weiss
We present DNOC, a network-on-chip simulator. DNOC simulates custom network topologies with detailed router models. Both classic virtual channel (VC) based router models and deflection routing models are supported. We validate the simulation models against hardware RTL router models. DNOC can generate various statistics, such as network latency and power. We evaluate the simulator in three typical use cases. In stand-alone simulation, synthetic traffic generators are used to offer load to the network. In synchronous co-simulation, the simulator is integrated as a module within a larger system simulator with synchronization every simulated cycle. In the faster model based co-simulation mode, a latency model is built, and re-tuned periodically at longer time intervals. We demonstrate co-simulation by running applications from the Rodinia and SPLASH-2 benchmark sets on mesh variants. DNOC is also able to run on multiple x86 cores in parallel, speeding up the simulation of large networks.
{"title":"DNOC: an accurate and fast virtual channel and deflection routing network-on-chip simulator","authors":"G. Oxman, S. Weiss","doi":"10.1109/ISPASS.2015.7095805","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095805","url":null,"abstract":"We present DNOC, a network-on-chip simulator. DNOC simulates custom network topologies with detailed router models. Both classic virtual channel (VC) based router models and deflection routing models are supported. We validate the simulation models against hardware RTL router models. DNOC can generate various statistics, such as network latency and power. We evaluate the simulator in three typical use cases. In stand-alone simulation, synthetic traffic generators are used to offer load to the network. In synchronous co-simulation, the simulator is integrated as a module within a larger system simulator with synchronization every simulated cycle. In the faster model based co-simulation mode, a latency model is built, and re-tuned periodically at longer time intervals. We demonstrate co-simulation by running applications from the Rodinia and SPLASH-2 benchmark sets on mesh variants. DNOC is also able to run on multiple x86 cores in parallel, speeding up the simulation of large networks.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"212 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127406953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095782
S. V. D. Steen, S. D. Pestel, Moncef Mechri, Stijn Eyerman, Trevor E. Carlson, D. Black-Schaffer, Erik Hagersten, L. Eeckhout
Optimizing processors for specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energyefficiency gains from technology scaling, such approaches may become increasingly important. However, designing applicationspecific processors require fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance estimations and insight into the interaction between an application's characteristics and the micro-architecture of a processor. Unfortunately, current analytical models require some microarchitecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration, which is far more time-consuming than evaluating the actual performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large design space almost instantaneously. We show that using a micro-architecture independent profile leads to a speedup of 25× for our evaluated design space, compared to an analytical model that uses micro-architecture dependent profiles. Over a large design space, the model has a 13% error for performance and a 7% error for power, compared to cycle-level simulation. The model is able to accurately determine the optimal processor configuration for different applications under power or performance constraints, and it can provide insight into performance through cycle stacks.
{"title":"Micro-architecture independent analytical processor performance and power modeling","authors":"S. V. D. Steen, S. D. Pestel, Moncef Mechri, Stijn Eyerman, Trevor E. Carlson, D. Black-Schaffer, Erik Hagersten, L. Eeckhout","doi":"10.1109/ISPASS.2015.7095782","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095782","url":null,"abstract":"Optimizing processors for specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energyefficiency gains from technology scaling, such approaches may become increasingly important. However, designing applicationspecific processors require fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance estimations and insight into the interaction between an application's characteristics and the micro-architecture of a processor. Unfortunately, current analytical models require some microarchitecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration, which is far more time-consuming than evaluating the actual performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large design space almost instantaneously. We show that using a micro-architecture independent profile leads to a speedup of 25× for our evaluated design space, compared to an analytical model that uses micro-architecture dependent profiles. Over a large design space, the model has a 13% error for performance and a 7% error for power, compared to cycle-level simulation. The model is able to accurately determine the optimal processor configuration for different applications under power or performance constraints, and it can provide insight into performance through cycle stacks.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126076320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095815
Yipeng Wang, Yan Solihin
Computer system designers need a deep understanding of end users' workload in order to arrive at an optimum design. Unfortunately, many end users will not share their software to designers due to the proprietary or confidential nature of their software. Researchers have proposed workload cloning, which is a process of extracting statistics that summarize the behavior of users' workloads through profiling, followed by using them to drive the generation of a representative synthetic workload (clone). Clones can be used in place of the original workloads to evaluate computer system performance, helping designers to understand the behavior of users workload on the simulated machine models without the users having to disclose proprietary or sensitive information about the original workload. In this paper, we propose infusing environment-specific information into the clone. This Environment-Specific Clone (ESC) enables the simulation of hypothetical cache configurations directly on a machine with a different cache configuration. We validate ESC on both real systems as well as cache simulations. Furthermore, we present a case study of how page mapping affects cache performance. ESC enables such a study at native machine speed by infusing the page mapping information into clones, without needing to modify the OS or hardware. We then analyze the factors that determine how page mapping impact cache performance, and how various applications are affected differently.
{"title":"Emulating cache organizations on real hardware using performance cloning","authors":"Yipeng Wang, Yan Solihin","doi":"10.1109/ISPASS.2015.7095815","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095815","url":null,"abstract":"Computer system designers need a deep understanding of end users' workload in order to arrive at an optimum design. Unfortunately, many end users will not share their software to designers due to the proprietary or confidential nature of their software. Researchers have proposed workload cloning, which is a process of extracting statistics that summarize the behavior of users' workloads through profiling, followed by using them to drive the generation of a representative synthetic workload (clone). Clones can be used in place of the original workloads to evaluate computer system performance, helping designers to understand the behavior of users workload on the simulated machine models without the users having to disclose proprietary or sensitive information about the original workload. In this paper, we propose infusing environment-specific information into the clone. This Environment-Specific Clone (ESC) enables the simulation of hypothetical cache configurations directly on a machine with a different cache configuration. We validate ESC on both real systems as well as cache simulations. Furthermore, we present a case study of how page mapping affects cache performance. ESC enables such a study at native machine speed by infusing the page mapping information into clones, without needing to modify the OS or hardware. We then analyze the factors that determine how page mapping impact cache performance, and how various applications are affected differently.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121152611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095795
Yu-Ting Chen, J. Cong, Bingjun Xiao
Accelerator-rich architectures (ARAs) provide energy-efficient solutions for domain-specific computing in the age of dark silicon. However, due to the complex interaction between the general-purpose cores, accelerators, customized onchip interconnects, customized memory systems, and operating systems, it has been difficult to get detailed and accurate evaluations and analyses of ARAs on complex real-life benchmarks using the existing full-system simulators. In this paper we develop the ARACompiler, which is a highly automated design flow for prototyping ARAs and performing evaluation on FPGAs. An efficient system software stack is generated automatically to handle resource management and TLB misses.We further provide application programming interfaces (APIs) for users to develop their applications using accelerators. The flow can provide 2.9x to 42.6x evaluation time saving over the full-system simulations.
{"title":"ARACompiler: a prototyping flow and evaluation framework for accelerator-rich architectures","authors":"Yu-Ting Chen, J. Cong, Bingjun Xiao","doi":"10.1109/ISPASS.2015.7095795","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095795","url":null,"abstract":"Accelerator-rich architectures (ARAs) provide energy-efficient solutions for domain-specific computing in the age of dark silicon. However, due to the complex interaction between the general-purpose cores, accelerators, customized onchip interconnects, customized memory systems, and operating systems, it has been difficult to get detailed and accurate evaluations and analyses of ARAs on complex real-life benchmarks using the existing full-system simulators. In this paper we develop the ARACompiler, which is a highly automated design flow for prototyping ARAs and performing evaluation on FPGAs. An efficient system software stack is generated automatically to handle resource management and TLB misses.We further provide application programming interfaces (APIs) for users to develop their applications using accelerators. The flow can provide 2.9x to 42.6x evaluation time saving over the full-system simulations.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121233289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095783
Seung-Hwan Lim, S. Lee, Gautam Ganesh, Tyler C. Brown, S. Sukumar
Graph analysis has revealed patterns and relationships hidden in data from a variety of domains such as transportation networks, social networks, clinical pathways, and collaboration networks. As these networks grow in size, variety and complexity, it is a challenge to find the right combination of tools and implementation of algorithms to discover new insights from the data. Addressing this challenge, our study presents an extensive empirical evaluation of three representative graph processing platforms: Pegasus, GraphX, and Urika. Each system represents a combination of options in data model, processing paradigm, and infrastructure. We benchmark each platform using three popular graph mining operations, degree distribution, connected components, and PageRank over real-world graphs. Our experiments show that each graph processing platform owns a particular strength for different types of graph operations. While Urika performs the best in non-iterative graph operations like degree distribution, GraphX outperforms iterative operations like connected components and PageRank. We conclude this paper by discussing options to optimize the performance of a graph-theoretic operation on each platform for large-scale real world graphs.
{"title":"Graph Processing Platforms at Scale: Practices and Experiences","authors":"Seung-Hwan Lim, S. Lee, Gautam Ganesh, Tyler C. Brown, S. Sukumar","doi":"10.1109/ISPASS.2015.7095783","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095783","url":null,"abstract":"Graph analysis has revealed patterns and relationships hidden in data from a variety of domains such as transportation networks, social networks, clinical pathways, and collaboration networks. As these networks grow in size, variety and complexity, it is a challenge to find the right combination of tools and implementation of algorithms to discover new insights from the data. Addressing this challenge, our study presents an extensive empirical evaluation of three representative graph processing platforms: Pegasus, GraphX, and Urika. Each system represents a combination of options in data model, processing paradigm, and infrastructure. We benchmark each platform using three popular graph mining operations, degree distribution, connected components, and PageRank over real-world graphs. Our experiments show that each graph processing platform owns a particular strength for different types of graph operations. While Urika performs the best in non-iterative graph operations like degree distribution, GraphX outperforms iterative operations like connected components and PageRank. We conclude this paper by discussing options to optimize the performance of a graph-theoretic operation on each platform for large-scale real world graphs.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"55 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113972443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-29DOI: 10.1109/ISPASS.2015.7095814
D. Guttman, M. Kandemir, Meenakshi Arunachalam, V. Calina
There is an urgent need to evaluate the existing parallelism and data locality-oriented techniques on emerging manycore machines using multithreaded applications. Data prefetching is a well-known latency hiding technique that comes with various hardware- and software-based implementations in almost all commercial machines. A well-tuned prefetcher can reduce the observed data access latencies significantly by bringing the soonto- be-requested data into the cache ahead of time, eventually improving application execution time. Motivated by this, we present in this paper a detailed performance and power characterization of software (compiler-guided) and hardware data prefetching on an Intel Xeon Phi-based system. Our main contributions are (i) an analysis of the interactions between hardware and software prefetching, showing how hardware prefetching can throttle itself in response to software; (ii) results on the power and energy behavior of prefetching, showing how performance and energy gains outweigh the increased power cost of prefetching; and (iii) an evaluation of the use of intrinsic prefetch instructions to prefetch for applications with difficult-to-detect access patterns.
{"title":"Performance and energy evaluation of data prefetching on intel Xeon Phi","authors":"D. Guttman, M. Kandemir, Meenakshi Arunachalam, V. Calina","doi":"10.1109/ISPASS.2015.7095814","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095814","url":null,"abstract":"There is an urgent need to evaluate the existing parallelism and data locality-oriented techniques on emerging manycore machines using multithreaded applications. Data prefetching is a well-known latency hiding technique that comes with various hardware- and software-based implementations in almost all commercial machines. A well-tuned prefetcher can reduce the observed data access latencies significantly by bringing the soonto- be-requested data into the cache ahead of time, eventually improving application execution time. Motivated by this, we present in this paper a detailed performance and power characterization of software (compiler-guided) and hardware data prefetching on an Intel Xeon Phi-based system. Our main contributions are (i) an analysis of the interactions between hardware and software prefetching, showing how hardware prefetching can throttle itself in response to software; (ii) results on the power and energy behavior of prefetching, showing how performance and energy gains outweigh the increased power cost of prefetching; and (iii) an evaluation of the use of intrinsic prefetch instructions to prefetch for applications with difficult-to-detect access patterns.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124386456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}