Traditional techniques for performance analysis provide a means for extracting and analyzing raw performance information from applications. Users then compare this raw data to their performance expectations for application constructs. This comparison can be tedious for the scale of today's architectures and software systems. To address this situation, we present a methodology and prototype that allows users to assert performance expectations explicitly in their source code using performance assertions. As the application executes, each performance assertion in the application collects data implicitly to verify the assertion. By allowing the user to specify a performance expectation with individual code segments, the runtime system can jettison raw data for measurements that pass their expectation, while reacting to failures with a variety of responses. We present several compelling uses of performance assertions with our operational prototype, including raising a performance exception, validating a performance model, and adapting an algorithm empirically at runtime.
{"title":"Asserting Performance Expectations","authors":"J. Vetter, P. Worley","doi":"10.1109/SC.2002.10046","DOIUrl":"https://doi.org/10.1109/SC.2002.10046","url":null,"abstract":"Traditional techniques for performance analysis provide a means for extracting and analyzing raw performance information from applications. Users then compare this raw data to their performance expectations for application constructs. This comparison can be tedious for the scale of today's architectures and software systems. To address this situation, we present a methodology and prototype that allows users to assert performance expectations explicitly in their source code using performance assertions. As the application executes, each performance assertion in the application collects data implicitly to verify the assertion. By allowing the user to specify a performance expectation with individual code segments, the runtime system can jettison raw data for measurements that pass their expectation, while reacting to failures with a variety of responses. We present several compelling uses of performance assertions with our operational prototype, including raising a performance exception, validating a performance model, and adapting an algorithm empirically at runtime.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133690683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Grid environment facilitates collaborative work and allows many users to query and process data over geographically dispersed data repositories. Over the past several years, there has been a growing interest in developing applications that interactively analyze datasets, potentially in a collaborative setting. We describe the Active Proxy-G service that is able to cache query results, use those results for answering new incoming queries, generate subqueries for the parts of a query that cannot be produced from the cache, and submit the subqueries for final processing at application servers that store the raw datasets. We present an experimental evaluation to illustrate the effects of various design tradeoffs. We also show the benefits that two real applications gain from using the middleware.
{"title":"Active Proxy-G: Optimizing the Query Execution Process in the Grid","authors":"H. Andrade, T. Kurç, A. Sussman, J. Saltz","doi":"10.1109/SC.2002.10031","DOIUrl":"https://doi.org/10.1109/SC.2002.10031","url":null,"abstract":"The Grid environment facilitates collaborative work and allows many users to query and process data over geographically dispersed data repositories. Over the past several years, there has been a growing interest in developing applications that interactively analyze datasets, potentially in a collaborative setting. We describe the Active Proxy-G service that is able to cache query results, use those results for answering new incoming queries, generate subqueries for the parts of a query that cannot be produced from the cache, and submit the subqueries for final processing at application servers that store the raw datasets. We present an experimental evaluation to illustrate the effects of various design tradeoffs. We also show the benefits that two real applications gain from using the middleware.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134396921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Common to computational grids and pervasive computing is the need for an expressive, efficient, and scalable directory service that provides information about objects in the environment. We argue that a directory interface that ‘pushes’ information to clients about changes to objects can significantly improve scalability. This paper describes the design, implementation, and evaluation of the Proactive Directory Service (PDS). PDS’ interface supports a customizable ‘proactive’ mode through which clients can subscribe to be notified about changes to their objects of interest. Clients can dynamically tune the detail and granularity of these notifications through filter functions instantiated at the server or at the object’s owner, and by remotely tuning the functionality of those filters. We compare PDS’ performance against off-the-shelf implementations of DNS and the Lightweight Directory Access Protocol. Our evaluation results confirm the expected performance advantages of this approach and demonstrate that customized notification through filter functions can reduce bandwidth utilization while improving the performance of both clients and directory servers.
{"title":"Scalable Directory Services Using Proactivity","authors":"F. Bustamante, Patrick M. Widener, K. Schwan","doi":"10.5555/762761.762786","DOIUrl":"https://doi.org/10.5555/762761.762786","url":null,"abstract":"Common to computational grids and pervasive computing is the need for an expressive, efficient, and scalable directory service that provides information about objects in the environment. We argue that a directory interface that ‘pushes’ information to clients about changes to objects can significantly improve scalability. This paper describes the design, implementation, and evaluation of the Proactive Directory Service (PDS). PDS’ interface supports a customizable ‘proactive’ mode through which clients can subscribe to be notified about changes to their objects of interest. Clients can dynamically tune the detail and granularity of these notifications through filter functions instantiated at the server or at the object’s owner, and by remotely tuning the functionality of those filters. We compare PDS’ performance against off-the-shelf implementations of DNS and the Lightweight Directory Access Protocol. Our evaluation results confirm the expected performance advantages of this approach and demonstrate that customized notification through filter functions can reduce bandwidth utilization while improving the performance of both clients and directory servers.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134073723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, Benjamin C. Lee
We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).
{"title":"Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply","authors":"R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, Benjamin C. Lee","doi":"10.1109/SC.2002.10025","DOIUrl":"https://doi.org/10.1109/SC.2002.10025","url":null,"abstract":"We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117330874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding the dynamic behavior of parallel programs is key to developing efficient system software and runtime environments; this is even more true on emerging computational Grids where resource availability and performance can change in unpredictable ways. Event tracing provides details on behavioral dynamics, albeit often at great cost. We describe an intermediate approach, based on curve fitting, that retains many of the advantages of event tracing but with lower overhead. These compact "application signatures" summarize the time-varying resource needs of scientific codes from historical trace data. We also developed a comparison scheme that measures similarity between two signatures, both across executions and across execution environments.
{"title":"Compact Application Signatures for Parallel and Distributed Scientific Codes","authors":"Charng-Da Lu, D. Reed","doi":"10.1109/SC.2002.10059","DOIUrl":"https://doi.org/10.1109/SC.2002.10059","url":null,"abstract":"Understanding the dynamic behavior of parallel programs is key to developing efficient system software and runtime environments; this is even more true on emerging computational Grids where resource availability and performance can change in unpredictable ways. Event tracing provides details on behavioral dynamics, albeit often at great cost. We describe an intermediate approach, based on curve fitting, that retains many of the advantages of event tracing but with lower overhead. These compact \"application signatures\" summarize the time-varying resource needs of scientific codes from historical trace data. We also developed a comparison scheme that measures similarity between two signatures, both across executions and across execution environments.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131219376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Spencer, R. Ferreira, M. Beynon, T. Kurç, Ümit V. Çatalyürek, A. Sussman, J. Saltz
Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We define the scheduling problem for effectively placing components onto Grid resources, and propose two scheduling algorithms. Experimental results are presented using a visualization application.
{"title":"Executing Multiple Pipelined Data Analysis Operations in the Grid","authors":"M. Spencer, R. Ferreira, M. Beynon, T. Kurç, Ümit V. Çatalyürek, A. Sussman, J. Saltz","doi":"10.1109/SC.2002.10015","DOIUrl":"https://doi.org/10.1109/SC.2002.10015","url":null,"abstract":"Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We define the scheduling problem for effectively placing components onto Grid resources, and propose two scheduling algorithms. Experimental results are presented using a visualization application.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128624361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Pierce, G. Fox, Choon-Han Youn, S. Mock, K. Mueller, Ozgur Balsoy
Computational web portals are designed to simplify access to diverse sets of high performance computing resources, typically through an interface to computational Grid tools. An important shortcoming of these portals is their lack of interoperable and reusable services. This paper presents an overview of research efforts undertaken by our group to build interoperating portal services around a Web Services model. We present a comprehensive view of an interoperable portal architecture, beginning with core portal services that can be used to build Application Web Services, which in turn may be aggregated and managed through portlet containers.
计算web门户旨在简化对各种高性能计算资源集的访问,通常通过计算网格工具的接口。这些门户的一个重要缺点是缺乏可互操作和可重用的服务。本文概述了我们小组围绕Web服务模型构建互操作门户服务所做的研究工作。我们将全面介绍可互操作的门户体系结构,首先介绍可用于构建Application Web services的核心门户服务,这些服务可以通过portlet容器进行聚合和管理。
{"title":"Interoperable Web Services for Computational Portals","authors":"M. Pierce, G. Fox, Choon-Han Youn, S. Mock, K. Mueller, Ozgur Balsoy","doi":"10.1109/SC.2002.10030","DOIUrl":"https://doi.org/10.1109/SC.2002.10030","url":null,"abstract":"Computational web portals are designed to simplify access to diverse sets of high performance computing resources, typically through an interface to computational Grid tools. An important shortcoming of these portals is their lack of interoperable and reusable services. This paper presents an overview of research efforts undertaken by our group to build interoperating portal services around a Web Services model. We present a comprehensive view of an interoperable portal architecture, beginning with core portal services that can be used to build Application Web Services, which in turn may be aggregated and managed through portlet containers.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"18 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114021360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. D. Rose, K. Ekanadham, J. Hollingsworth, S. Sbaraglia
In this paper we present SIGMA (Simulation Infrastructure to Guide Memory Analysis), a new data collection framework and family of cache analysis tools. The SIGMA environment provides detailed cache information by gathering memory reference data using software-based instrumentation. This infrastructure can facilitate quick probing into the factors that influence the performance of an application by highlighting bottleneck scenarios including: excessive cache/TLB misses and inefficient data layouts. The tool can also assist in perturbation analysis to determine performance variations caused by changes to architecture or program. Our validation tests using the SPEC Swim benchmark show that most of the performance metrics obtained with SIGMA are within 1% of the metrics obtained with hardware performance counters, with the advantage that SIGMA provides performance data on a data structure level, as specified by the programmer.
{"title":"SIGMA: A Simulator Infrastructure to Guide Memory Analysis","authors":"L. D. Rose, K. Ekanadham, J. Hollingsworth, S. Sbaraglia","doi":"10.1109/SC.2002.10055","DOIUrl":"https://doi.org/10.1109/SC.2002.10055","url":null,"abstract":"In this paper we present SIGMA (Simulation Infrastructure to Guide Memory Analysis), a new data collection framework and family of cache analysis tools. The SIGMA environment provides detailed cache information by gathering memory reference data using software-based instrumentation. This infrastructure can facilitate quick probing into the factors that influence the performance of an application by highlighting bottleneck scenarios including: excessive cache/TLB misses and inefficient data layouts. The tool can also assist in perturbation analysis to determine performance variations caused by changes to architecture or program. Our validation tests using the SPEC Swim benchmark show that most of the performance metrics obtained with SIGMA are within 1% of the metrics obtained with hardware performance counters, with the advantage that SIGMA provides performance data on a data structure level, as specified by the programmer.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123988907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of LT and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU_DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.
{"title":"A New Scheduling Algorithm for Parallel Sparse LU Factorization with Static Pivoting","authors":"L. Grigori, X. Li","doi":"10.1109/SC.2002.10032","DOIUrl":"https://doi.org/10.1109/SC.2002.10032","url":null,"abstract":"In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of LT and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU_DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131004115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Hiraki, M. Inaba, J. Tamatsukuri, Ryutaro Kurusu, Yukichi Ikuta, Hisashi Koga, A. Zinzaki
We propose data sharing facility for data intensive scientific research, "Data Reservoir"; which is optimized to transfer huge amount of data files between distant places fully Utilizing multi-gigabit backbone network. In addition, "Data Reservoir" can be used as an ordinary UNIX server in local network without any modification of server softwares. We use low-level protocol and hierarchical striping to realize (1) separation of bulk data transfer and local accesses by cashing, (2) file-system transparency, I.e. interoperable whatever in higher layer than disk driver, including file system. (3) scalability for network and storage. This paper shows our design, implementation using iSCSI protocol [1] and their performances for both 1Gbps model in the real network and 10Gbps model in our laboratory.
{"title":"Data Reservoir: Utilization of Multi-Gigabit Backbone Network for Data-Intensive Research","authors":"K. Hiraki, M. Inaba, J. Tamatsukuri, Ryutaro Kurusu, Yukichi Ikuta, Hisashi Koga, A. Zinzaki","doi":"10.5555/762761.762826","DOIUrl":"https://doi.org/10.5555/762761.762826","url":null,"abstract":"We propose data sharing facility for data intensive scientific research, \"Data Reservoir\"; which is optimized to transfer huge amount of data files between distant places fully Utilizing multi-gigabit backbone network. In addition, \"Data Reservoir\" can be used as an ordinary UNIX server in local network without any modification of server softwares. We use low-level protocol and hierarchical striping to realize (1) separation of bulk data transfer and local accesses by cashing, (2) file-system transparency, I.e. interoperable whatever in higher layer than disk driver, including file system. (3) scalability for network and storage. This paper shows our design, implementation using iSCSI protocol [1] and their performances for both 1Gbps model in the real network and 10Gbps model in our laboratory.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128894043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}