首页 > 最新文献

Proceedings of the IEEE/ACM SC95 Conference最新文献

英文 中文
Parallel Retrograde Analysis on a Distributed System 分布式系统的并行逆行分析
Pub Date : 1995-12-08 DOI: 10.1145/224170.224470
H. Bal, L. Allis
Retrograde Analysis (RA) is an AI search technique used to compute endgame databases, which contain optimal solutions for part of the search space of a game. RA has been applied successfully to several games, but its usefulness is restricted by the huge amount of CPU time and internal memory it requires. We present a parallel distributed algorithm for RA that addresses these problems. RA is hard to parallelize efficiently, because the communication overhead potentially is enormous. We show that the overhead can be reduced drastically using message combining. We implemented the algorithm on an Ethernet-based distributed system. For one example game (awari), we have computed a large database in 50 minutes on 64 processors, whereas one machine took 40 hours (a speedup of 48). An even larger database (computed in 20 hours) would have required over 600 MByte of internal memory on a uniprocessor and would compute for many weeks.
逆行分析(RA)是一种用于计算终局数据库的AI搜索技术,其中包含游戏部分搜索空间的最佳解决方案。RA已经成功地应用于几款游戏中,但它的实用性受到大量CPU时间和内部内存的限制。我们提出了一种针对RA的并行分布式算法来解决这些问题。RA很难有效地并行化,因为通信开销可能是巨大的。我们展示了使用消息组合可以大大减少开销。我们在一个基于以太网的分布式系统上实现了该算法。以游戏《awari》为例,我们在64个处理器上用50分钟计算了一个大型数据库,而一台机器花了40小时(加速了48小时)。一个更大的数据库(在20小时内计算)在单处理器上需要超过600mbyte的内部内存,并且要计算好几个星期。
{"title":"Parallel Retrograde Analysis on a Distributed System","authors":"H. Bal, L. Allis","doi":"10.1145/224170.224470","DOIUrl":"https://doi.org/10.1145/224170.224470","url":null,"abstract":"Retrograde Analysis (RA) is an AI search technique used to compute endgame databases, which contain optimal solutions for part of the search space of a game. RA has been applied successfully to several games, but its usefulness is restricted by the huge amount of CPU time and internal memory it requires. We present a parallel distributed algorithm for RA that addresses these problems. RA is hard to parallelize efficiently, because the communication overhead potentially is enormous. We show that the overhead can be reduced drastically using message combining. We implemented the algorithm on an Ethernet-based distributed system. For one example game (awari), we have computed a large database in 50 minutes on 64 processors, whereas one machine took 40 hours (a speedup of 48). An even larger database (computed in 20 hours) would have required over 600 MByte of internal memory on a uniprocessor and would compute for many weeks.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130664042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Efficient Support of Location Transparency in Concurrent Object-Oriented Programming Languages 面向对象并发编程语言中位置透明性的有效支持
Pub Date : 1995-12-08 DOI: 10.1145/224170.224297
Wooyoung Kim, G. Agha
We describe the design of a runtime system for a fine-grained concurrent object-oriented (actor) language and its performance. The runtime system provides considerable flexibility to users; specifically, it supports location transparency, actor creation and dynamic placement, and migration. The runtime system includes an efficient distributed name server, a latency hiding scheme for remote actor creation, and a compiler-controlled intra-node scheduling mechanism for local messages and dynamic load balancing. Our preliminary evaluation results suggest that the efficiency that is lost by the greater flexibility of actors can be restored by an efficient runtime system which provides an open interface that can be used by a compiler to allow optimizations. On several standard algorithms, the performance results for our system are comparable to efficient C implementations.
我们描述了细粒度并发面向对象(参与者)语言的运行时系统的设计及其性能。运行时系统为用户提供了相当大的灵活性;具体来说,它支持位置透明、角色创建和动态放置以及迁移。运行时系统包括一个高效的分布式名称服务器,一个用于远程参与者创建的延迟隐藏方案,以及一个用于本地消息和动态负载平衡的编译器控制的节点内调度机制。我们的初步评估结果表明,由于演员的更大灵活性而失去的效率可以通过一个有效的运行时系统来恢复,该系统提供了一个开放的接口,编译器可以使用该接口来进行优化。在一些标准算法上,我们系统的性能结果与高效的C实现相当。
{"title":"Efficient Support of Location Transparency in Concurrent Object-Oriented Programming Languages","authors":"Wooyoung Kim, G. Agha","doi":"10.1145/224170.224297","DOIUrl":"https://doi.org/10.1145/224170.224297","url":null,"abstract":"We describe the design of a runtime system for a fine-grained concurrent object-oriented (actor) language and its performance. The runtime system provides considerable flexibility to users; specifically, it supports location transparency, actor creation and dynamic placement, and migration. The runtime system includes an efficient distributed name server, a latency hiding scheme for remote actor creation, and a compiler-controlled intra-node scheduling mechanism for local messages and dynamic load balancing. Our preliminary evaluation results suggest that the efficiency that is lost by the greater flexibility of actors can be restored by an efficient runtime system which provides an open interface that can be used by a compiler to allow optimizations. On several standard algorithms, the performance results for our system are comparable to efficient C implementations.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125950004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Parallel Processing of Spaceborne Imaging Radar Data 星载成像雷达数据的并行处理
Pub Date : 1995-12-08 DOI: 10.1145/224170.224281
C. Miller, D. G. Payne, T. Phung, H. Siegel, Roy D. Williams
We discuss the results of a collaborative project on parallel processing of Synthetic Aperture Radar (SAR) data, carried out between the NASA/Jet Propulsion Laboratory (JPL), the California Institute of Technology (Caltech) and Intel Scalable Systems Division (SSD). Through this collaborative effort, we have successfully parallelized the most compute-intensive SAR correlator phase of the Spaceborne Shuttle Imaging Radar-C/X-Band SAR (SIR-C/X-SAR) code, for the Intel Paragon. We describe the data decomposition, the scalable high-performance I/O model, and the node-level optimizations which enable us to obtain efficient processing throughput. In particular, we point out an interesting double level of parallelization arising in the data decomposition which increases substantially our ability to support ''high volume'' SAR. Results are presented from this code running in parallel on the Intel Paragon. A representative set of SAR data, of size 800 Megabytes, which was collected by the SIR-C/X-SAR instrument aboard NASA's Space Shuttle in 15 seconds, is processed in 55 seconds on the Concurrent Supercomputing Consortium's Paragon XP/S 35+. This compares well with a time of 12 minutes for the current SIR-C/X-SAR processing system at JPL. For the first time, a commercial system can process SIR-C/X-SAR data at a rate which is approaching the rate at which the SIR-C/X-SAR instrument can collect the data. This work has successfully demonstrated the viability of the Intel Paragon supercomputer for processing ''high volume" Synthetic Aperture Radar data in near real-time.
我们讨论了NASA/喷气推进实验室(JPL)、加州理工学院(Caltech)和英特尔可扩展系统部门(SSD)之间开展的一个关于合成孔径雷达(SAR)数据并行处理的合作项目的结果。通过这种合作努力,我们已经成功地为英特尔Paragon并行化了星载航天飞机成像雷达c / x波段SAR (SIR-C/X-SAR)代码中计算最密集的SAR相关器相位。我们描述了数据分解、可扩展的高性能I/O模型以及使我们能够获得高效处理吞吐量的节点级优化。特别地,我们指出了数据分解中出现的有趣的双重并行化,这大大提高了我们支持“大容量”SAR的能力。结果来自于在英特尔Paragon上并行运行的代码。由NASA航天飞机上的SIR-C/X-SAR仪器在15秒内收集的一组具有代表性的800兆字节的SAR数据,在并发超级计算联盟的Paragon XP/S 35+上处理只需55秒。这与喷气推进实验室当前的SIR-C/X-SAR处理系统的12分钟时间相比要好得多。商用系统处理SIR-C/X-SAR数据的速度首次接近SIR-C/X-SAR仪器采集数据的速度。这项工作成功地证明了英特尔Paragon超级计算机在近实时处理“大容量”合成孔径雷达数据方面的可行性。
{"title":"Parallel Processing of Spaceborne Imaging Radar Data","authors":"C. Miller, D. G. Payne, T. Phung, H. Siegel, Roy D. Williams","doi":"10.1145/224170.224281","DOIUrl":"https://doi.org/10.1145/224170.224281","url":null,"abstract":"We discuss the results of a collaborative project on parallel processing of Synthetic Aperture Radar (SAR) data, carried out between the NASA/Jet Propulsion Laboratory (JPL), the California Institute of Technology (Caltech) and Intel Scalable Systems Division (SSD). Through this collaborative effort, we have successfully parallelized the most compute-intensive SAR correlator phase of the Spaceborne Shuttle Imaging Radar-C/X-Band SAR (SIR-C/X-SAR) code, for the Intel Paragon. We describe the data decomposition, the scalable high-performance I/O model, and the node-level optimizations which enable us to obtain efficient processing throughput. In particular, we point out an interesting double level of parallelization arising in the data decomposition which increases substantially our ability to support ''high volume'' SAR. Results are presented from this code running in parallel on the Intel Paragon. A representative set of SAR data, of size 800 Megabytes, which was collected by the SIR-C/X-SAR instrument aboard NASA's Space Shuttle in 15 seconds, is processed in 55 seconds on the Concurrent Supercomputing Consortium's Paragon XP/S 35+. This compares well with a time of 12 minutes for the current SIR-C/X-SAR processing system at JPL. For the first time, a commercial system can process SIR-C/X-SAR data at a rate which is approaching the rate at which the SIR-C/X-SAR instrument can collect the data. This work has successfully demonstrated the viability of the Intel Paragon supercomputer for processing ''high volume\" Synthetic Aperture Radar data in near real-time.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125007360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Computational Methods for Intelligent Information Access 智能信息访问的计算方法
Pub Date : 1995-12-08 DOI: 10.1145/224170.285569
M. Berry, S. Dumais, Todd A. Letsche
Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users’ requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users’ access to many kinds of textual materials, or to documents and services for which textual descriptions are available. A survey of the computational requirements for managing LSI-encoded databases as well as current and future applications of LSI is presented.
目前,从科学数据库中检索文本材料的大多数方法依赖于用户请求中的单词与数据库中文档中的单词或分配给文档的单词之间的词汇匹配。由于人们用来描述同一份文件的词汇千差万别,词汇法必然是不完整和不精确的。使用奇异值分解(SVD),可以通过文档矩阵确定大型稀疏项的SVD来利用术语与文档关联中的隐式高阶结构。然后根据用户查询匹配由200-300个最大奇异向量表示的术语和文档。我们称这种检索方法为潜在语义索引(LSI),因为子空间表示术语和文档之间的重要关联关系,而这些关系在单个文档中并不明显。大规模集成电路是一种完全自动化的智能索引方法,广泛适用,并且是一种有前途的方法,可以改善用户对多种文本材料的访问,或者对文本描述可用的文档和服务的访问。概述了管理LSI编码数据库的计算需求,以及LSI当前和未来的应用。
{"title":"Computational Methods for Intelligent Information Access","authors":"M. Berry, S. Dumais, Todd A. Letsche","doi":"10.1145/224170.285569","DOIUrl":"https://doi.org/10.1145/224170.285569","url":null,"abstract":"Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users’ requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users’ access to many kinds of textual materials, or to documents and services for which textual descriptions are available. A survey of the computational requirements for managing LSI-encoded databases as well as current and future applications of LSI is presented.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121619459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 166
Automated Performance Prediction of Message-Passing Parallel Programs 消息传递并行程序的自动性能预测
Pub Date : 1995-12-08 DOI: 10.1145/224170.224273
R. Block, S. Sarukkai, P. Mehra
The increasing use of massively parallel supercomputers to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require substantial manual effort to represent an application in the model's format. The MK toolkit described in this paper is the result of an on-going effort to automate the formation of analytic expressions of program execution time, with a minimum of programmer assistance. In this paper we demonstrate the feasibility of our approach, by extending previous work to detect and model communication patterns automatically, with and without overlapped computations. The predictions derived from these models agree, within reasonable limits, with execution times of programs measured on the Intel iPSC/860 and Paragon. Further, we demonstrate the use of MK in selecting optimal computational grain size and studying various scalability metrics.
越来越多地使用大规模并行超级计算机来解决大规模科学问题,这产生了对能够预测为这些机器编写的应用程序的可伸缩性趋势的工具的需求。为了创建简单的模型来表示并行程序的重要特征,例如延迟、网络争用和通信量,已经做了很多工作。但是这些方法中的许多仍然需要大量的手工工作来用模型的格式表示应用程序。本文中描述的MK工具包是一项持续努力的结果,该努力使程序执行时间的解析表达式的形成自动化,并且只需要最少的程序员帮助。在本文中,我们证明了我们的方法的可行性,通过扩展以前的工作来自动检测和建模通信模式,有或没有重叠的计算。在合理的范围内,由这些模型得出的预测与在Intel iPSC/860和Paragon上测量的程序执行时间一致。此外,我们展示了MK在选择最佳计算粒度和研究各种可扩展性指标方面的使用。
{"title":"Automated Performance Prediction of Message-Passing Parallel Programs","authors":"R. Block, S. Sarukkai, P. Mehra","doi":"10.1145/224170.224273","DOIUrl":"https://doi.org/10.1145/224170.224273","url":null,"abstract":"The increasing use of massively parallel supercomputers to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require substantial manual effort to represent an application in the model's format. The MK toolkit described in this paper is the result of an on-going effort to automate the formation of analytic expressions of program execution time, with a minimum of programmer assistance. In this paper we demonstrate the feasibility of our approach, by extending previous work to detect and model communication patterns automatically, with and without overlapped computations. The predictions derived from these models agree, within reasonable limits, with execution times of programs measured on the Intel iPSC/860 and Paragon. Further, we demonstrate the use of MK in selecting optimal computational grain size and studying various scalability metrics.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121639635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Model and Call Admission Control for Distributed Applications with Correlated Bursty Traffic 具有相关突发业务的分布式应用的模型和呼叫接纳控制
Pub Date : 1995-12-08 DOI: 10.1145/224170.224190
Jose Roberto Fernandez, M. Mutka
As network capacities increase, wide-area distributed parallel computing may become feasible. This paper addresses one of the issues involved in using an asynchronous transfer mode (ATM) network for such a purpose — that of developing an appropriate call admission control (CAC) procedure for such applications given the special nature of their traffic. In this proposal, connections belonging to the same application and sharing the same link are allowed to utilize the linkbandwidth in a strongly correlated manner. However, connections belonging to different applications are still assumed to be independent. This allows the development of a tabular approach for keeping track of the aggregate bandwidth demand of the applications sharing the same link. The proposed approach is compared with two related approaches (one more conservative and another more aggressive) and is shown to strike a balance between utilization and loss rate.
随着网络容量的增加,广域分布式并行计算将成为可能。本文讨论了使用异步传输模式(ATM)网络实现这一目的所涉及的一个问题,即为这些应用程序开发一个适当的呼叫允许控制(CAC)程序,以给定其流量的特殊性质。在这个提议中,允许属于同一应用程序并共享同一链路的连接以强相关的方式利用链路带宽。但是,属于不同应用程序的连接仍然假定为独立的。这允许开发一种表格方法来跟踪共享同一链路的应用程序的总带宽需求。将提出的方法与两种相关的方法(一种更保守,另一种更激进)进行比较,结果表明在利用率和损失率之间取得了平衡。
{"title":"Model and Call Admission Control for Distributed Applications with Correlated Bursty Traffic","authors":"Jose Roberto Fernandez, M. Mutka","doi":"10.1145/224170.224190","DOIUrl":"https://doi.org/10.1145/224170.224190","url":null,"abstract":"As network capacities increase, wide-area distributed parallel computing may become feasible. This paper addresses one of the issues involved in using an asynchronous transfer mode (ATM) network for such a purpose — that of developing an appropriate call admission control (CAC) procedure for such applications given the special nature of their traffic. In this proposal, connections belonging to the same application and sharing the same link are allowed to utilize the linkbandwidth in a strongly correlated manner. However, connections belonging to different applications are still assumed to be independent. This allows the development of a tabular approach for keeping track of the aggregate bandwidth demand of the applications sharing the same link. The proposed approach is compared with two related approaches (one more conservative and another more aggressive) and is shown to strike a balance between utilization and loss rate.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133008198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Communication Optimizations for Parallel Computing Using Data Access Information 基于数据访问信息的并行计算通信优化
Pub Date : 1995-12-08 DOI: 10.1145/224170.224413
M. Rinard
Given the large communication overheads characteristic of modern parallel machines, optimizations that eliminate, hide or parallelize communication may improve the performance of parallel computations. This paper describes our experience automatically applying communication optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply communication optimizations. Jade implementations exist for both shared memory and message passing machines; each Jade implementation applies communication optimizations appropriate for the machine on which it runs. We present performance results for several Jade applications running on both a shared memory machine (the Stanford DASH machine) and a message passing machine (the Intel iPSC/860). We use these results to characterize the overall performance impact of the communication optimizations. For our application set replicating data for concurrent read access and improving the locality of the computation by placing tasks close to the data that they access are the most important optimizations. Broadcasting widely accessed data has a significant performance impact on one application; other optimizations such as concurrently fetching remote data and overlapping computation with communication have no effect.
考虑到现代并行机器的大通信开销特征,消除、隐藏或并行化通信的优化可能会提高并行计算的性能。本文描述了我们在Jade上下文中自动应用通信优化的经验,Jade是一种可移植的隐式并行编程语言,旨在利用任务级并发性。Jade程序员从用标准的串行命令式语言编写程序开始,然后使用Jade结构来声明程序的各个部分如何访问数据。Jade实现使用此数据访问信息自动提取并发性并应用通信优化。共享内存和消息传递机器都有Jade实现;每个Jade实现都应用适合其运行的机器的通信优化。我们展示了在共享内存机器(Stanford DASH机器)和消息传递机器(Intel iPSC/860)上运行的几个Jade应用程序的性能结果。我们使用这些结果来描述通信优化的总体性能影响。对于我们的应用程序,最重要的优化是为并发读访问复制数据,并通过将任务放置在它们访问的数据附近来提高计算的局部性。广播广泛访问的数据对一个应用程序有显著的性能影响;其他优化,如并发获取远程数据和通信重叠计算没有效果。
{"title":"Communication Optimizations for Parallel Computing Using Data Access Information","authors":"M. Rinard","doi":"10.1145/224170.224413","DOIUrl":"https://doi.org/10.1145/224170.224413","url":null,"abstract":"Given the large communication overheads characteristic of modern parallel machines, optimizations that eliminate, hide or parallelize communication may improve the performance of parallel computations. This paper describes our experience automatically applying communication optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply communication optimizations. Jade implementations exist for both shared memory and message passing machines; each Jade implementation applies communication optimizations appropriate for the machine on which it runs. We present performance results for several Jade applications running on both a shared memory machine (the Stanford DASH machine) and a message passing machine (the Intel iPSC/860). We use these results to characterize the overall performance impact of the communication optimizations. For our application set replicating data for concurrent read access and improving the locality of the computation by placing tasks close to the data that they access are the most important optimizations. Broadcasting widely accessed data has a significant performance impact on one application; other optimizations such as concurrently fetching remote data and overlapping computation with communication have no effect.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124989920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Input/Output Characteristics of Scalable Parallel Applications 可扩展并行应用程序的输入/输出特性
Pub Date : 1995-12-08 DOI: 10.1145/224170.224396
Phyllis E. Crandall, R. Aydt, A. Chien, D. Reed
Rapid increases in computing and communication performance are exacerbating the long-standing problem of performance-limited input/output. Indeed, for many otherwise scalable parallel applications. input/output is emerging as a major performance bottleneck. The design of scalable input/output systems depends critically on the input/output requirements and access patterns for this emerging class of large-scale parallel applications. However, hard data on the behavior of such applications is only now becoming available. In this paper, we describe the input-output requirements of three scalable parallel applications (electron scattering, terrain rendering, and quantum chemistry, on the Intel Paragon XP/S. As part of an ongoing parallel input/output characterization effort, we used instrumented versions of the application codes to capture and analyze input/output volume, request size distributions, and temporal request structure. Because complete traces of individual application input/output requests were captured, in-depth, off-line analyses were possible. In addition, we conducted informal interviews of the application developers to understand the relation between the codes' current and desired input/output structure. The results of our studies show a wide variety of temporal and spatial access patterns, including highly read-intensive and write-intensive phases, extremely large and extremely small request sizes, and both sequential and highly irregular access patterns. We conclude with a discussion of the broad spectrum of access patterns and their profound implications for parallel file caching and prefetching schemes.
计算和通信性能的快速增长加剧了长期存在的性能限制输入/输出的问题。事实上,对于许多其他可伸缩的并行应用程序来说。输入/输出正在成为主要的性能瓶颈。可伸缩输入/输出系统的设计主要取决于这类新兴的大规模并行应用程序的输入/输出需求和访问模式。然而,关于这类应用程序的行为的硬数据直到现在才变得可用。在本文中,我们描述了三个可扩展的并行应用程序(电子散射、地形渲染和量子化学)在英特尔Paragon XP/S上的输入输出需求。作为正在进行的并行输入/输出表征工作的一部分,我们使用应用程序代码的仪器化版本来捕获和分析输入/输出量、请求大小分布和临时请求结构。由于捕获了单个应用程序输入/输出请求的完整跟踪,因此可以进行深入的脱机分析。此外,我们对应用程序开发人员进行了非正式访谈,以了解代码当前和期望的输入/输出结构之间的关系。我们的研究结果显示了各种各样的时间和空间访问模式,包括高度读密集型和写密集型阶段,非常大和非常小的请求大小,顺序和高度不规则的访问模式。最后,我们讨论了广泛的访问模式及其对并行文件缓存和预取方案的深刻影响。
{"title":"Input/Output Characteristics of Scalable Parallel Applications","authors":"Phyllis E. Crandall, R. Aydt, A. Chien, D. Reed","doi":"10.1145/224170.224396","DOIUrl":"https://doi.org/10.1145/224170.224396","url":null,"abstract":"Rapid increases in computing and communication performance are exacerbating the long-standing problem of performance-limited input/output. Indeed, for many otherwise scalable parallel applications. input/output is emerging as a major performance bottleneck. The design of scalable input/output systems depends critically on the input/output requirements and access patterns for this emerging class of large-scale parallel applications. However, hard data on the behavior of such applications is only now becoming available. In this paper, we describe the input-output requirements of three scalable parallel applications (electron scattering, terrain rendering, and quantum chemistry, on the Intel Paragon XP/S. As part of an ongoing parallel input/output characterization effort, we used instrumented versions of the application codes to capture and analyze input/output volume, request size distributions, and temporal request structure. Because complete traces of individual application input/output requests were captured, in-depth, off-line analyses were possible. In addition, we conducted informal interviews of the application developers to understand the relation between the codes' current and desired input/output structure. The results of our studies show a wide variety of temporal and spatial access patterns, including highly read-intensive and write-intensive phases, extremely large and extremely small request sizes, and both sequential and highly irregular access patterns. We conclude with a discussion of the broad spectrum of access patterns and their profound implications for parallel file caching and prefetching schemes.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115133499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 210
An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs 数据并行程序的集成编译和性能分析环境
Pub Date : 1995-12-08 DOI: 10.1145/224170.224340
Vikram S. Adve, J. Mellor-Crummey, Mark Anderson, K. Kennedy, Jhy-Chun Wang, D. Reed
Supporting source-level performance analysis of programs written in data-parallel languages requires a unique degree of integration between compilers and performance analysis tools. Compilers for languages such as High Performance Fortran infer parallelism and communication from data distribution directives, thus, performance tools cannot meaningfully relate measurements about these key aspects of execution performance to source-level constructs without substantial compiler support. This paper describes an integrated system for performance analysis of data-parallel programs based on the Rice Fortran 77D compiler and the Illinois Pablo performance analysis toolkit. During code generation, the Fortran D compiler records mapping information and semantic analysis results describing the relationship between performance instrumentation and the original source program. An integrated performance analysis system based on the Pablo toolkit uses this information to correlate the program's dynamic behavior with the data parallel source code. The integrated system provides detailed source-level performance feedback to programmers via a pair of graphical interfaces. Our strategy serves as a model for integration of data-parallel compilers and performance tools.
支持用数据并行语言编写的程序的源代码级性能分析需要编译器和性能分析工具之间的独特集成程度。诸如高性能Fortran等语言的编译器从数据分布指令中推断并行性和通信,因此,如果没有大量编译器的支持,性能工具无法将执行性能的这些关键方面的测量与源代码级结构有意义地联系起来。本文介绍了一个基于Rice Fortran 77D编译器和Illinois Pablo性能分析工具包的数据并行程序性能分析集成系统。在代码生成过程中,Fortran D编译器记录映射信息和描述性能检测与原始源程序之间关系的语义分析结果。基于Pablo工具包的集成性能分析系统使用这些信息将程序的动态行为与数据并行源代码关联起来。集成系统通过一对图形界面向程序员提供详细的源代码级性能反馈。我们的策略可以作为数据并行编译器和性能工具集成的模型。
{"title":"An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs","authors":"Vikram S. Adve, J. Mellor-Crummey, Mark Anderson, K. Kennedy, Jhy-Chun Wang, D. Reed","doi":"10.1145/224170.224340","DOIUrl":"https://doi.org/10.1145/224170.224340","url":null,"abstract":"Supporting source-level performance analysis of programs written in data-parallel languages requires a unique degree of integration between compilers and performance analysis tools. Compilers for languages such as High Performance Fortran infer parallelism and communication from data distribution directives, thus, performance tools cannot meaningfully relate measurements about these key aspects of execution performance to source-level constructs without substantial compiler support. This paper describes an integrated system for performance analysis of data-parallel programs based on the Rice Fortran 77D compiler and the Illinois Pablo performance analysis toolkit. During code generation, the Fortran D compiler records mapping information and semantic analysis results describing the relationship between performance instrumentation and the original source program. An integrated performance analysis system based on the Pablo toolkit uses this information to correlate the program's dynamic behavior with the data parallel source code. The integrated system provides detailed source-level performance feedback to programmers via a pair of graphical interfaces. Our strategy serves as a model for integration of data-parallel compilers and performance tools.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127310014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 136
Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors 共享内存多处理器中显式通信的架构机制
Pub Date : 1995-12-08 DOI: 10.1145/224170.224399
U. Ramachandran, Gautam Shah, A. Sivasubramaniam, A. Singla, I. Yanasak
The goal of this work is to explore architectural mechanisms for supporting explicit communication in cache-coherent shared memory multiprocessors. The motivation stems from the observation that applications display wide diversity in terms of sharing characteristics and hence impose different communication requirements on the system. Explicit communication mechanisms would allow tailoring the coherence management under software control to match these differing needs and strive to provide a close approximation to a zero overhead machine from the application perspective. Toward achieving these goals, we first analyze the characteristics of sharing observed in certain specific applications. We then use these characteristics to synthesize explicit communication primitives. The proposed primitives allow selectively updating a set of processors, or requesting a stream of data ahead of its intended use. These primitives are essentially generalizations of prefetch and poststore, with the ability to specify the sharer set for poststore either statically or dynamically. The proposed primitives are to be used in conjunction with an underlying invalidation based protocol. Used in this manner, the resulting memory system can dynamically adapt itself to performing either invalidations or updates to match the communication needs. Through application driven performance study we show the utility of these mechanisms in being able to reduce and tolerate communication latencies.
这项工作的目标是探索在缓存一致共享内存多处理器中支持显式通信的体系结构机制。其动机源于观察到应用程序在共享特性方面表现出广泛的多样性,因此对系统施加了不同的通信需求。显式通信机制将允许在软件控制下裁剪一致性管理,以匹配这些不同的需求,并努力从应用程序的角度提供接近零开销机器的近似。为了实现这些目标,我们首先分析在某些特定应用程序中观察到的共享特征。然后,我们使用这些特征来合成显式通信原语。提议的原语允许有选择地更新一组处理器,或者在预期使用之前请求数据流。这些原语本质上是预取和后存储的一般化,能够静态或动态地指定后存储的共享器集。建议的原语将与底层基于无效的协议一起使用。以这种方式使用,生成的内存系统可以动态地适应执行无效或更新,以匹配通信需求。通过应用程序驱动的性能研究,我们展示了这些机制在减少和容忍通信延迟方面的效用。
{"title":"Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors","authors":"U. Ramachandran, Gautam Shah, A. Sivasubramaniam, A. Singla, I. Yanasak","doi":"10.1145/224170.224399","DOIUrl":"https://doi.org/10.1145/224170.224399","url":null,"abstract":"The goal of this work is to explore architectural mechanisms for supporting explicit communication in cache-coherent shared memory multiprocessors. The motivation stems from the observation that applications display wide diversity in terms of sharing characteristics and hence impose different communication requirements on the system. Explicit communication mechanisms would allow tailoring the coherence management under software control to match these differing needs and strive to provide a close approximation to a zero overhead machine from the application perspective. Toward achieving these goals, we first analyze the characteristics of sharing observed in certain specific applications. We then use these characteristics to synthesize explicit communication primitives. The proposed primitives allow selectively updating a set of processors, or requesting a stream of data ahead of its intended use. These primitives are essentially generalizations of prefetch and poststore, with the ability to specify the sharer set for poststore either statically or dynamically. The proposed primitives are to be used in conjunction with an underlying invalidation based protocol. Used in this manner, the resulting memory system can dynamically adapt itself to performing either invalidations or updates to match the communication needs. Through application driven performance study we show the utility of these mechanisms in being able to reduce and tolerate communication latencies.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121474251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
期刊
Proceedings of the IEEE/ACM SC95 Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1