Retrograde Analysis (RA) is an AI search technique used to compute endgame databases, which contain optimal solutions for part of the search space of a game. RA has been applied successfully to several games, but its usefulness is restricted by the huge amount of CPU time and internal memory it requires. We present a parallel distributed algorithm for RA that addresses these problems. RA is hard to parallelize efficiently, because the communication overhead potentially is enormous. We show that the overhead can be reduced drastically using message combining. We implemented the algorithm on an Ethernet-based distributed system. For one example game (awari), we have computed a large database in 50 minutes on 64 processors, whereas one machine took 40 hours (a speedup of 48). An even larger database (computed in 20 hours) would have required over 600 MByte of internal memory on a uniprocessor and would compute for many weeks.
{"title":"Parallel Retrograde Analysis on a Distributed System","authors":"H. Bal, L. Allis","doi":"10.1145/224170.224470","DOIUrl":"https://doi.org/10.1145/224170.224470","url":null,"abstract":"Retrograde Analysis (RA) is an AI search technique used to compute endgame databases, which contain optimal solutions for part of the search space of a game. RA has been applied successfully to several games, but its usefulness is restricted by the huge amount of CPU time and internal memory it requires. We present a parallel distributed algorithm for RA that addresses these problems. RA is hard to parallelize efficiently, because the communication overhead potentially is enormous. We show that the overhead can be reduced drastically using message combining. We implemented the algorithm on an Ethernet-based distributed system. For one example game (awari), we have computed a large database in 50 minutes on 64 processors, whereas one machine took 40 hours (a speedup of 48). An even larger database (computed in 20 hours) would have required over 600 MByte of internal memory on a uniprocessor and would compute for many weeks.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130664042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe the design of a runtime system for a fine-grained concurrent object-oriented (actor) language and its performance. The runtime system provides considerable flexibility to users; specifically, it supports location transparency, actor creation and dynamic placement, and migration. The runtime system includes an efficient distributed name server, a latency hiding scheme for remote actor creation, and a compiler-controlled intra-node scheduling mechanism for local messages and dynamic load balancing. Our preliminary evaluation results suggest that the efficiency that is lost by the greater flexibility of actors can be restored by an efficient runtime system which provides an open interface that can be used by a compiler to allow optimizations. On several standard algorithms, the performance results for our system are comparable to efficient C implementations.
{"title":"Efficient Support of Location Transparency in Concurrent Object-Oriented Programming Languages","authors":"Wooyoung Kim, G. Agha","doi":"10.1145/224170.224297","DOIUrl":"https://doi.org/10.1145/224170.224297","url":null,"abstract":"We describe the design of a runtime system for a fine-grained concurrent object-oriented (actor) language and its performance. The runtime system provides considerable flexibility to users; specifically, it supports location transparency, actor creation and dynamic placement, and migration. The runtime system includes an efficient distributed name server, a latency hiding scheme for remote actor creation, and a compiler-controlled intra-node scheduling mechanism for local messages and dynamic load balancing. Our preliminary evaluation results suggest that the efficiency that is lost by the greater flexibility of actors can be restored by an efficient runtime system which provides an open interface that can be used by a compiler to allow optimizations. On several standard algorithms, the performance results for our system are comparable to efficient C implementations.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125950004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Miller, D. G. Payne, T. Phung, H. Siegel, Roy D. Williams
We discuss the results of a collaborative project on parallel processing of Synthetic Aperture Radar (SAR) data, carried out between the NASA/Jet Propulsion Laboratory (JPL), the California Institute of Technology (Caltech) and Intel Scalable Systems Division (SSD). Through this collaborative effort, we have successfully parallelized the most compute-intensive SAR correlator phase of the Spaceborne Shuttle Imaging Radar-C/X-Band SAR (SIR-C/X-SAR) code, for the Intel Paragon. We describe the data decomposition, the scalable high-performance I/O model, and the node-level optimizations which enable us to obtain efficient processing throughput. In particular, we point out an interesting double level of parallelization arising in the data decomposition which increases substantially our ability to support ''high volume'' SAR. Results are presented from this code running in parallel on the Intel Paragon. A representative set of SAR data, of size 800 Megabytes, which was collected by the SIR-C/X-SAR instrument aboard NASA's Space Shuttle in 15 seconds, is processed in 55 seconds on the Concurrent Supercomputing Consortium's Paragon XP/S 35+. This compares well with a time of 12 minutes for the current SIR-C/X-SAR processing system at JPL. For the first time, a commercial system can process SIR-C/X-SAR data at a rate which is approaching the rate at which the SIR-C/X-SAR instrument can collect the data. This work has successfully demonstrated the viability of the Intel Paragon supercomputer for processing ''high volume" Synthetic Aperture Radar data in near real-time.
{"title":"Parallel Processing of Spaceborne Imaging Radar Data","authors":"C. Miller, D. G. Payne, T. Phung, H. Siegel, Roy D. Williams","doi":"10.1145/224170.224281","DOIUrl":"https://doi.org/10.1145/224170.224281","url":null,"abstract":"We discuss the results of a collaborative project on parallel processing of Synthetic Aperture Radar (SAR) data, carried out between the NASA/Jet Propulsion Laboratory (JPL), the California Institute of Technology (Caltech) and Intel Scalable Systems Division (SSD). Through this collaborative effort, we have successfully parallelized the most compute-intensive SAR correlator phase of the Spaceborne Shuttle Imaging Radar-C/X-Band SAR (SIR-C/X-SAR) code, for the Intel Paragon. We describe the data decomposition, the scalable high-performance I/O model, and the node-level optimizations which enable us to obtain efficient processing throughput. In particular, we point out an interesting double level of parallelization arising in the data decomposition which increases substantially our ability to support ''high volume'' SAR. Results are presented from this code running in parallel on the Intel Paragon. A representative set of SAR data, of size 800 Megabytes, which was collected by the SIR-C/X-SAR instrument aboard NASA's Space Shuttle in 15 seconds, is processed in 55 seconds on the Concurrent Supercomputing Consortium's Paragon XP/S 35+. This compares well with a time of 12 minutes for the current SIR-C/X-SAR processing system at JPL. For the first time, a commercial system can process SIR-C/X-SAR data at a rate which is approaching the rate at which the SIR-C/X-SAR instrument can collect the data. This work has successfully demonstrated the viability of the Intel Paragon supercomputer for processing ''high volume\" Synthetic Aperture Radar data in near real-time.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125007360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users’ requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users’ access to many kinds of textual materials, or to documents and services for which textual descriptions are available. A survey of the computational requirements for managing LSI-encoded databases as well as current and future applications of LSI is presented.
{"title":"Computational Methods for Intelligent Information Access","authors":"M. Berry, S. Dumais, Todd A. Letsche","doi":"10.1145/224170.285569","DOIUrl":"https://doi.org/10.1145/224170.285569","url":null,"abstract":"Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users’ requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users’ access to many kinds of textual materials, or to documents and services for which textual descriptions are available. A survey of the computational requirements for managing LSI-encoded databases as well as current and future applications of LSI is presented.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121619459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing use of massively parallel supercomputers to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require substantial manual effort to represent an application in the model's format. The MK toolkit described in this paper is the result of an on-going effort to automate the formation of analytic expressions of program execution time, with a minimum of programmer assistance. In this paper we demonstrate the feasibility of our approach, by extending previous work to detect and model communication patterns automatically, with and without overlapped computations. The predictions derived from these models agree, within reasonable limits, with execution times of programs measured on the Intel iPSC/860 and Paragon. Further, we demonstrate the use of MK in selecting optimal computational grain size and studying various scalability metrics.
{"title":"Automated Performance Prediction of Message-Passing Parallel Programs","authors":"R. Block, S. Sarukkai, P. Mehra","doi":"10.1145/224170.224273","DOIUrl":"https://doi.org/10.1145/224170.224273","url":null,"abstract":"The increasing use of massively parallel supercomputers to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require substantial manual effort to represent an application in the model's format. The MK toolkit described in this paper is the result of an on-going effort to automate the formation of analytic expressions of program execution time, with a minimum of programmer assistance. In this paper we demonstrate the feasibility of our approach, by extending previous work to detect and model communication patterns automatically, with and without overlapped computations. The predictions derived from these models agree, within reasonable limits, with execution times of programs measured on the Intel iPSC/860 and Paragon. Further, we demonstrate the use of MK in selecting optimal computational grain size and studying various scalability metrics.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121639635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As network capacities increase, wide-area distributed parallel computing may become feasible. This paper addresses one of the issues involved in using an asynchronous transfer mode (ATM) network for such a purpose — that of developing an appropriate call admission control (CAC) procedure for such applications given the special nature of their traffic. In this proposal, connections belonging to the same application and sharing the same link are allowed to utilize the linkbandwidth in a strongly correlated manner. However, connections belonging to different applications are still assumed to be independent. This allows the development of a tabular approach for keeping track of the aggregate bandwidth demand of the applications sharing the same link. The proposed approach is compared with two related approaches (one more conservative and another more aggressive) and is shown to strike a balance between utilization and loss rate.
{"title":"Model and Call Admission Control for Distributed Applications with Correlated Bursty Traffic","authors":"Jose Roberto Fernandez, M. Mutka","doi":"10.1145/224170.224190","DOIUrl":"https://doi.org/10.1145/224170.224190","url":null,"abstract":"As network capacities increase, wide-area distributed parallel computing may become feasible. This paper addresses one of the issues involved in using an asynchronous transfer mode (ATM) network for such a purpose — that of developing an appropriate call admission control (CAC) procedure for such applications given the special nature of their traffic. In this proposal, connections belonging to the same application and sharing the same link are allowed to utilize the linkbandwidth in a strongly correlated manner. However, connections belonging to different applications are still assumed to be independent. This allows the development of a tabular approach for keeping track of the aggregate bandwidth demand of the applications sharing the same link. The proposed approach is compared with two related approaches (one more conservative and another more aggressive) and is shown to strike a balance between utilization and loss rate.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133008198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given the large communication overheads characteristic of modern parallel machines, optimizations that eliminate, hide or parallelize communication may improve the performance of parallel computations. This paper describes our experience automatically applying communication optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply communication optimizations. Jade implementations exist for both shared memory and message passing machines; each Jade implementation applies communication optimizations appropriate for the machine on which it runs. We present performance results for several Jade applications running on both a shared memory machine (the Stanford DASH machine) and a message passing machine (the Intel iPSC/860). We use these results to characterize the overall performance impact of the communication optimizations. For our application set replicating data for concurrent read access and improving the locality of the computation by placing tasks close to the data that they access are the most important optimizations. Broadcasting widely accessed data has a significant performance impact on one application; other optimizations such as concurrently fetching remote data and overlapping computation with communication have no effect.
{"title":"Communication Optimizations for Parallel Computing Using Data Access Information","authors":"M. Rinard","doi":"10.1145/224170.224413","DOIUrl":"https://doi.org/10.1145/224170.224413","url":null,"abstract":"Given the large communication overheads characteristic of modern parallel machines, optimizations that eliminate, hide or parallelize communication may improve the performance of parallel computations. This paper describes our experience automatically applying communication optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply communication optimizations. Jade implementations exist for both shared memory and message passing machines; each Jade implementation applies communication optimizations appropriate for the machine on which it runs. We present performance results for several Jade applications running on both a shared memory machine (the Stanford DASH machine) and a message passing machine (the Intel iPSC/860). We use these results to characterize the overall performance impact of the communication optimizations. For our application set replicating data for concurrent read access and improving the locality of the computation by placing tasks close to the data that they access are the most important optimizations. Broadcasting widely accessed data has a significant performance impact on one application; other optimizations such as concurrently fetching remote data and overlapping computation with communication have no effect.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124989920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rapid increases in computing and communication performance are exacerbating the long-standing problem of performance-limited input/output. Indeed, for many otherwise scalable parallel applications. input/output is emerging as a major performance bottleneck. The design of scalable input/output systems depends critically on the input/output requirements and access patterns for this emerging class of large-scale parallel applications. However, hard data on the behavior of such applications is only now becoming available. In this paper, we describe the input-output requirements of three scalable parallel applications (electron scattering, terrain rendering, and quantum chemistry, on the Intel Paragon XP/S. As part of an ongoing parallel input/output characterization effort, we used instrumented versions of the application codes to capture and analyze input/output volume, request size distributions, and temporal request structure. Because complete traces of individual application input/output requests were captured, in-depth, off-line analyses were possible. In addition, we conducted informal interviews of the application developers to understand the relation between the codes' current and desired input/output structure. The results of our studies show a wide variety of temporal and spatial access patterns, including highly read-intensive and write-intensive phases, extremely large and extremely small request sizes, and both sequential and highly irregular access patterns. We conclude with a discussion of the broad spectrum of access patterns and their profound implications for parallel file caching and prefetching schemes.
{"title":"Input/Output Characteristics of Scalable Parallel Applications","authors":"Phyllis E. Crandall, R. Aydt, A. Chien, D. Reed","doi":"10.1145/224170.224396","DOIUrl":"https://doi.org/10.1145/224170.224396","url":null,"abstract":"Rapid increases in computing and communication performance are exacerbating the long-standing problem of performance-limited input/output. Indeed, for many otherwise scalable parallel applications. input/output is emerging as a major performance bottleneck. The design of scalable input/output systems depends critically on the input/output requirements and access patterns for this emerging class of large-scale parallel applications. However, hard data on the behavior of such applications is only now becoming available. In this paper, we describe the input-output requirements of three scalable parallel applications (electron scattering, terrain rendering, and quantum chemistry, on the Intel Paragon XP/S. As part of an ongoing parallel input/output characterization effort, we used instrumented versions of the application codes to capture and analyze input/output volume, request size distributions, and temporal request structure. Because complete traces of individual application input/output requests were captured, in-depth, off-line analyses were possible. In addition, we conducted informal interviews of the application developers to understand the relation between the codes' current and desired input/output structure. The results of our studies show a wide variety of temporal and spatial access patterns, including highly read-intensive and write-intensive phases, extremely large and extremely small request sizes, and both sequential and highly irregular access patterns. We conclude with a discussion of the broad spectrum of access patterns and their profound implications for parallel file caching and prefetching schemes.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115133499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vikram S. Adve, J. Mellor-Crummey, Mark Anderson, K. Kennedy, Jhy-Chun Wang, D. Reed
Supporting source-level performance analysis of programs written in data-parallel languages requires a unique degree of integration between compilers and performance analysis tools. Compilers for languages such as High Performance Fortran infer parallelism and communication from data distribution directives, thus, performance tools cannot meaningfully relate measurements about these key aspects of execution performance to source-level constructs without substantial compiler support. This paper describes an integrated system for performance analysis of data-parallel programs based on the Rice Fortran 77D compiler and the Illinois Pablo performance analysis toolkit. During code generation, the Fortran D compiler records mapping information and semantic analysis results describing the relationship between performance instrumentation and the original source program. An integrated performance analysis system based on the Pablo toolkit uses this information to correlate the program's dynamic behavior with the data parallel source code. The integrated system provides detailed source-level performance feedback to programmers via a pair of graphical interfaces. Our strategy serves as a model for integration of data-parallel compilers and performance tools.
{"title":"An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs","authors":"Vikram S. Adve, J. Mellor-Crummey, Mark Anderson, K. Kennedy, Jhy-Chun Wang, D. Reed","doi":"10.1145/224170.224340","DOIUrl":"https://doi.org/10.1145/224170.224340","url":null,"abstract":"Supporting source-level performance analysis of programs written in data-parallel languages requires a unique degree of integration between compilers and performance analysis tools. Compilers for languages such as High Performance Fortran infer parallelism and communication from data distribution directives, thus, performance tools cannot meaningfully relate measurements about these key aspects of execution performance to source-level constructs without substantial compiler support. This paper describes an integrated system for performance analysis of data-parallel programs based on the Rice Fortran 77D compiler and the Illinois Pablo performance analysis toolkit. During code generation, the Fortran D compiler records mapping information and semantic analysis results describing the relationship between performance instrumentation and the original source program. An integrated performance analysis system based on the Pablo toolkit uses this information to correlate the program's dynamic behavior with the data parallel source code. The integrated system provides detailed source-level performance feedback to programmers via a pair of graphical interfaces. Our strategy serves as a model for integration of data-parallel compilers and performance tools.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127310014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
U. Ramachandran, Gautam Shah, A. Sivasubramaniam, A. Singla, I. Yanasak
The goal of this work is to explore architectural mechanisms for supporting explicit communication in cache-coherent shared memory multiprocessors. The motivation stems from the observation that applications display wide diversity in terms of sharing characteristics and hence impose different communication requirements on the system. Explicit communication mechanisms would allow tailoring the coherence management under software control to match these differing needs and strive to provide a close approximation to a zero overhead machine from the application perspective. Toward achieving these goals, we first analyze the characteristics of sharing observed in certain specific applications. We then use these characteristics to synthesize explicit communication primitives. The proposed primitives allow selectively updating a set of processors, or requesting a stream of data ahead of its intended use. These primitives are essentially generalizations of prefetch and poststore, with the ability to specify the sharer set for poststore either statically or dynamically. The proposed primitives are to be used in conjunction with an underlying invalidation based protocol. Used in this manner, the resulting memory system can dynamically adapt itself to performing either invalidations or updates to match the communication needs. Through application driven performance study we show the utility of these mechanisms in being able to reduce and tolerate communication latencies.
{"title":"Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors","authors":"U. Ramachandran, Gautam Shah, A. Sivasubramaniam, A. Singla, I. Yanasak","doi":"10.1145/224170.224399","DOIUrl":"https://doi.org/10.1145/224170.224399","url":null,"abstract":"The goal of this work is to explore architectural mechanisms for supporting explicit communication in cache-coherent shared memory multiprocessors. The motivation stems from the observation that applications display wide diversity in terms of sharing characteristics and hence impose different communication requirements on the system. Explicit communication mechanisms would allow tailoring the coherence management under software control to match these differing needs and strive to provide a close approximation to a zero overhead machine from the application perspective. Toward achieving these goals, we first analyze the characteristics of sharing observed in certain specific applications. We then use these characteristics to synthesize explicit communication primitives. The proposed primitives allow selectively updating a set of processors, or requesting a stream of data ahead of its intended use. These primitives are essentially generalizations of prefetch and poststore, with the ability to specify the sharer set for poststore either statically or dynamically. The proposed primitives are to be used in conjunction with an underlying invalidation based protocol. Used in this manner, the resulting memory system can dynamically adapt itself to performing either invalidations or updates to match the communication needs. Through application driven performance study we show the utility of these mechanisms in being able to reduce and tolerate communication latencies.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121474251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}