首页 > 最新文献

Proceedings 11th International Parallel Processing Symposium最新文献

英文 中文
External adjustment of runtime parameters in Time Warp synchronized parallel simulators 时间扭曲同步并行模拟器运行时参数的外部调整
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580905
R. Radhakrishnan, L. Moore, P. Wilsey
Several optimizations to the Time Warp synchronization protocol for parallel discrete event simulation have been proposed and studied. Many of these optimizations have included some form of dynamic adjustment (or control) of the operating parameters of the simulation (e.g. checkpoint interval, cancellation strategy). Traditionally dynamic parameter adjustment has been performed at the simulation object level; each simulation object collects measures of its operating behaviors (e.g. rollback frequency, rollback length, etc.) and uses them to adjust its operating parameters. The performance data collection functions and parameter adjustment are overhead costs that are incurred in the expectation of higher throughput. The paper presents a method of eliminating some of these overheads through the use of an external object to adjust the control parameters. That is, instead of inserting code for adjusting simulation parameters in the simulation object, an external control object is defined to periodically analyze each simulation object's performance data and revise that object's operating parameters. An implementation of an external control object in the WARPED Time Warp simulation kernel has been completed. The simulation parameters updated by the implemented control system are: checkpoint interval, and cancellation strategy (lazy or aggressive). A comparative analysis of three test cases shows that the external control mechanism provides speedups between 5%-17% over the best performing embedded dynamic adjustment algorithms.
本文对并行离散事件仿真中的时间扭曲同步协议进行了几种优化。许多这些优化都包括了对模拟操作参数的某种形式的动态调整(或控制)(例如检查点间隔,取消策略)。传统的动态参数调整是在仿真对象层面进行的;每个仿真对象收集其运行行为的度量(如回滚频率、回滚长度等),并使用这些度量来调整其运行参数。性能数据收集功能和参数调整是在期望更高吞吐量时产生的开销。本文提出了一种通过使用外部对象来调整控制参数来消除这些开销的方法。即不是在仿真对象中插入调整仿真参数的代码,而是定义一个外部控制对象来定期分析每个仿真对象的性能数据并修改该对象的运行参数。在Warp Time Warp仿真内核中完成了一个外部控制对象的实现。所实现的控制系统更新的仿真参数为:检查点间隔和取消策略(惰性或主动)。对三个测试用例的对比分析表明,外部控制机制比性能最好的嵌入式动态调整算法提供了5%-17%的速度提升。
{"title":"External adjustment of runtime parameters in Time Warp synchronized parallel simulators","authors":"R. Radhakrishnan, L. Moore, P. Wilsey","doi":"10.1109/IPPS.1997.580905","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580905","url":null,"abstract":"Several optimizations to the Time Warp synchronization protocol for parallel discrete event simulation have been proposed and studied. Many of these optimizations have included some form of dynamic adjustment (or control) of the operating parameters of the simulation (e.g. checkpoint interval, cancellation strategy). Traditionally dynamic parameter adjustment has been performed at the simulation object level; each simulation object collects measures of its operating behaviors (e.g. rollback frequency, rollback length, etc.) and uses them to adjust its operating parameters. The performance data collection functions and parameter adjustment are overhead costs that are incurred in the expectation of higher throughput. The paper presents a method of eliminating some of these overheads through the use of an external object to adjust the control parameters. That is, instead of inserting code for adjusting simulation parameters in the simulation object, an external control object is defined to periodically analyze each simulation object's performance data and revise that object's operating parameters. An implementation of an external control object in the WARPED Time Warp simulation kernel has been completed. The simulation parameters updated by the implemented control system are: checkpoint interval, and cancellation strategy (lazy or aggressive). A comparative analysis of three test cases shows that the external control mechanism provides speedups between 5%-17% over the best performing embedded dynamic adjustment algorithms.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115713743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Lower bounds on systolic gossip 收缩闲散的下界
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580949
M. Flammini, S. Pérennes
Gossiping is an information dissemination process in which each processor has a distinct item of information and has to collect all the items possessed by the other processors. We derive lower bounds on the gossiping time of systolic protocols, i.e. constituted by a periodic repetition of simple communication steps. In particular if we denote by n the number of processors in the network, then for directed networks and for undirected networks in the half-duplex mode any s-systolic gossip protocol takes at least g(s) log/sub 2/ n time steps, where g(4)=1.8133, g(6)=1.5310 and g(8)=1.4721. For the case s=4 this result is improved to 2.0218 log/sub 2/ n for directed butterflies of degree 2 and we show that the 2.0218 log/sub 2/ n and 1.8133 log/sub 2/ n lower bounds hold also respectively for undirected Butterfly and de Bruijn networks of degree 2 in the full-duplex case. Our results are obtained by means of new technique relying on two novel concepts in the field: the notion of delay digraph of a systolic protocol and the use of matrix norm methods.
八卦是一种信息传播过程,在这种过程中,每个处理器都有一个独特的信息项目,并且必须收集其他处理器拥有的所有项目。我们推导了收缩协议的八卦时间的下界,即由周期性重复的简单通信步骤构成。特别是,如果我们用n表示网络中的处理器数量,那么对于有向网络和半双工模式下的无向网络,任何s-systolic八卦协议至少需要g(s) log/sub 2/ n时间步长,其中g(4)=1.8133, g(6)=1.5310和g(8)=1.4721。对于s=4的情况,对于2度的有向蝴蝶,该结果改进为2.0218 log/sub 2/ n,并且我们证明了在全双工情况下,对于2度的无向蝴蝶和de Bruijn网络,分别保持2.0218 log/sub 2/ n和1.8133 log/sub 2/ n的下界。我们的结果是通过依靠该领域的两个新概念的新技术获得的:收缩协议的延迟有向图的概念和矩阵范数方法的使用。
{"title":"Lower bounds on systolic gossip","authors":"M. Flammini, S. Pérennes","doi":"10.1109/IPPS.1997.580949","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580949","url":null,"abstract":"Gossiping is an information dissemination process in which each processor has a distinct item of information and has to collect all the items possessed by the other processors. We derive lower bounds on the gossiping time of systolic protocols, i.e. constituted by a periodic repetition of simple communication steps. In particular if we denote by n the number of processors in the network, then for directed networks and for undirected networks in the half-duplex mode any s-systolic gossip protocol takes at least g(s) log/sub 2/ n time steps, where g(4)=1.8133, g(6)=1.5310 and g(8)=1.4721. For the case s=4 this result is improved to 2.0218 log/sub 2/ n for directed butterflies of degree 2 and we show that the 2.0218 log/sub 2/ n and 1.8133 log/sub 2/ n lower bounds hold also respectively for undirected Butterfly and de Bruijn networks of degree 2 in the full-duplex case. Our results are obtained by means of new technique relying on two novel concepts in the field: the notion of delay digraph of a systolic protocol and the use of matrix norm methods.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117160888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Parallel simulated annealing: an adaptive approach 并行模拟退火:一种自适应方法
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580950
J. Knopman, J. S. Aude
This paper analyses alternatives for the parallelization of the Simulated Annealing algorithm when applied to the placement of modules in a VLSI circuit considering the use of PVM on an Ethernet cluster of workstations. It is shown that different parallelization approaches have to be used for high and low temperature values of the annealing process. The algorithm used for low temperatures is an adaptive version of the speculative algorithm proposed in the literature. Within this adaptive algorithm, the number of processors allocated to the solution of the placement problem and the number of moves evaluated per processor between synchronization points change with the temperature. At high temperatures, an algorithm based on the parallel evaluation of independent chains of moves has been adopted. It is shown that results with the same quality of those produced by the serial version can be obtained when shorter length chains are used in the parallel implementation.
本文分析了当模拟退火算法应用于VLSI电路中模块放置时,考虑到在以太网工作站集群上使用PVM时的并行化替代方案。结果表明,对于退火过程的高低温值,必须采用不同的并行化方法。用于低温的算法是文献中提出的推测算法的自适应版本。在这种自适应算法中,分配给放置问题解决方案的处理器数量和每个处理器在同步点之间评估的移动次数随着温度的变化而变化。在高温下,采用了一种基于独立步法链并行求值的算法。结果表明,采用较短的链进行并行实现,可以获得与串行版本相同质量的结果。
{"title":"Parallel simulated annealing: an adaptive approach","authors":"J. Knopman, J. S. Aude","doi":"10.1109/IPPS.1997.580950","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580950","url":null,"abstract":"This paper analyses alternatives for the parallelization of the Simulated Annealing algorithm when applied to the placement of modules in a VLSI circuit considering the use of PVM on an Ethernet cluster of workstations. It is shown that different parallelization approaches have to be used for high and low temperature values of the annealing process. The algorithm used for low temperatures is an adaptive version of the speculative algorithm proposed in the literature. Within this adaptive algorithm, the number of processors allocated to the solution of the placement problem and the number of moves evaluated per processor between synchronization points change with the temperature. At high temperatures, an algorithm based on the parallel evaluation of independent chains of moves has been adopted. It is shown that results with the same quality of those produced by the serial version can be obtained when shorter length chains are used in the parallel implementation.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121243197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
The impact of timing on linearizability in counting networks 时序对计数网络线性化的影响
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580978
M. Mavronicolas, M. Papatriantafilou, P. Tsigas
Counting networks form a new class of distributed, low-contention data structures made up of interconnected balancers, and are suitable for solving a variety of multiprocessor synchronization problems that can be expressed as counting problems. A linearizable counting network guarantees that the order of the values it returns respects the real-time order they were requested. Linearizability significantly raises the capabilities of the network, but at a possible price in network size or synchronization support. In this paper, we further pursue the systematic study of the impact of timing on linearizability for counting networks, along a research line initiated by Lynch et al. (1996). We consider two basic timing models: the instantaneous balancer model, in which the transition of a token from an input to an output port of a balancer is modeled as an instantaneous event, and the periodic balancer model, where balancers send out tokens at a fixed rate. We also consider lower and upper bounds on the delays incurred by wires connecting the balancers. We present necessary and sufficient conditions for linearizability in the form of precise inequalities that involve timing parameters and identify structural parameters of the counting network, which may be of more general interest. Our results significantly extend and strengthen previous impossibility and possibility results on linearizability in counting networks (Herlihy et al., 1990; Lynch et al., 1996).
计数网络形成了一类新的分布式、低争用的数据结构,由相互连接的平衡器组成,适用于解决各种可以表示为计数问题的多处理器同步问题。可线性化的计数网络保证它返回的值的顺序符合它们被请求的实时顺序。线性化显著地提高了网络的能力,但可能以网络大小或同步支持为代价。在本文中,我们进一步沿着Lynch等人(1996)发起的研究路线,系统地研究了时序对计数网络线性化的影响。我们考虑了两个基本的定时模型:瞬时平衡器模型,其中令牌从平衡器的输入端口到输出端口的转换被建模为瞬时事件,以及周期性平衡器模型,其中平衡器以固定速率发送令牌。我们还考虑了由连接平衡器的导线引起的延迟的下界和上界。我们以精确不等式的形式给出了线性化的充分必要条件,这些精确不等式涉及定时参数和识别计数网络的结构参数,这可能是更普遍的兴趣。我们的结果显着扩展和加强了先前关于计数网络线性化的不可能性和可能性结果(Herlihy et al., 1990;Lynch et al., 1996)。
{"title":"The impact of timing on linearizability in counting networks","authors":"M. Mavronicolas, M. Papatriantafilou, P. Tsigas","doi":"10.1109/IPPS.1997.580978","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580978","url":null,"abstract":"Counting networks form a new class of distributed, low-contention data structures made up of interconnected balancers, and are suitable for solving a variety of multiprocessor synchronization problems that can be expressed as counting problems. A linearizable counting network guarantees that the order of the values it returns respects the real-time order they were requested. Linearizability significantly raises the capabilities of the network, but at a possible price in network size or synchronization support. In this paper, we further pursue the systematic study of the impact of timing on linearizability for counting networks, along a research line initiated by Lynch et al. (1996). We consider two basic timing models: the instantaneous balancer model, in which the transition of a token from an input to an output port of a balancer is modeled as an instantaneous event, and the periodic balancer model, where balancers send out tokens at a fixed rate. We also consider lower and upper bounds on the delays incurred by wires connecting the balancers. We present necessary and sufficient conditions for linearizability in the form of precise inequalities that involve timing parameters and identify structural parameters of the counting network, which may be of more general interest. Our results significantly extend and strengthen previous impossibility and possibility results on linearizability in counting networks (Herlihy et al., 1990; Lynch et al., 1996).","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127437605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Implementation and results of hypothesis testing from the C/sup 3/I parallel benchmark suite C/sup 3/I并行基准测试套件的假设检验的实现和结果
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580886
B. V. Voorst, Luiz Pires, R. Jha, Mustafa Muhammad
This paper describes the implementation of the hypothesis testing benchmark, one of ten kernels from the C/sup 3/I (Command, Control, Communications and Intelligence) Parallel Benchmark Suite (C/sup 3/IPBS)/sup 1/. The benchmark was implemented and executed on a variety of parallel environments. This paper details the run times obtained with these implementations, and offers an analysis of the results.
本文描述了假设测试基准的实现,它是C/sup 3/I(命令、控制、通信和情报)并行基准套件(C/sup 3/IPBS)/sup 1/中的十个内核之一。基准测试在各种并行环境中实现和执行。本文详细介绍了使用这些实现获得的运行时,并对结果进行了分析。
{"title":"Implementation and results of hypothesis testing from the C/sup 3/I parallel benchmark suite","authors":"B. V. Voorst, Luiz Pires, R. Jha, Mustafa Muhammad","doi":"10.1109/IPPS.1997.580886","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580886","url":null,"abstract":"This paper describes the implementation of the hypothesis testing benchmark, one of ten kernels from the C/sup 3/I (Command, Control, Communications and Intelligence) Parallel Benchmark Suite (C/sup 3/IPBS)/sup 1/. The benchmark was implemented and executed on a variety of parallel environments. This paper details the run times obtained with these implementations, and offers an analysis of the results.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125381210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Performance prediction for complex parallel applications 复杂并行应用程序的性能预测
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580884
J. Brehm, P. Worley
Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the Performance Prediction Tool (PerPreT) to predict performance of the Parallel Spectral Transform Shallow Water Model code (PSTSWM) on the Intel Paragon.
今天的大规模并行机器通常是由数百或数千个处理器组成的消息传递系统。在这种环境中有效地实现并行应用程序是一项具有挑战性的任务,而纠正糟糕的并行设计决策可能代价高昂。允许快速准确地评估不同并行化策略的工具和技术将显著提高应用程序开发人员的生产力,并增加并行架构上的吞吐量。本文研究了构建工具来比较并行化策略的主要问题之一:确定应用程序代码和计算机系统的哪种类型的性能模型足以快速准确地比较不同的策略。本文围绕使用性能预测工具(PerPreT)在Intel Paragon上预测并行谱变换浅水模型代码(PSTSWM)性能的案例研究而展开。
{"title":"Performance prediction for complex parallel applications","authors":"J. Brehm, P. Worley","doi":"10.1109/IPPS.1997.580884","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580884","url":null,"abstract":"Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the Performance Prediction Tool (PerPreT) to predict performance of the Parallel Spectral Transform Shallow Water Model code (PSTSWM) on the Intel Paragon.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123882379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Coherent block data transfer in the FLASH multiprocessor 在FLASH多处理器中进行相干块数据传输
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580836
J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta
A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol.
斯坦福大学FLASH项目的一个关键目标是探索在单个多处理器架构中集成多种通信协议。为了实现这一目标,FLASH包含一个称为MAGIC的可编程节点控制器,该控制器包含一个能够实现多种协议的嵌入式协议处理器。在本文中,我们提出了一个集成了传统缓存一致性协议的块数据传输专用协议。块传输构成了共享内存之上的消息传递实现的基础,发生在数据库等重要工作负载中,并且经常被操作系统使用。我们讨论了在设计一个完全集成的协议及其与缓存一致性的交互时出现的问题。使用微基准测试、MPI通信原语和运行在操作系统上的应用程序,我们将我们的协议与标准bcopy和增强了预取的bcopy进行比较。我们的研究结果表明,集成块传输可以加速节点之间的通信,同时更有效地利用网络从主处理器卸载任务,并减少相关的缓存污染。鉴于FLASH中对预取的积极支持,预取bcopy在许多情况下能够获得具有竞争力的性能,但缺乏我们协议的其他三个优点。
{"title":"Coherent block data transfer in the FLASH multiprocessor","authors":"J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta","doi":"10.1109/IPPS.1997.580836","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580836","url":null,"abstract":"A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123641456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Multiple templates access of trees in parallel memory systems 并行存储系统中树的多模板访问
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580980
V. Auletta, A. D. Vivo, V. Scarano
Studies the problem of mapping the N nodes of a data structure onto M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. In the literature, several algorithms are available for arrays (accessed by rows, columns, diagonals and subarrays) and trees (accessed by subtrees, root-to-leaf paths, etc.). Although some mapping algorithms for arrays allow conflict-free access to several templates at once (e.g. rows and columns), no mapping algorithm is known for efficiently accessing both subtree and root-to-leaf path templates in complete binary trees. We prove that any mapping algorithm that is conflict-free for one of these two templates has /spl Omega/(M/log M) conflicts on the other. Therefore, no mapping algorithm can be found that is conflict-free on both templates. We give an algorithm for mapping complete binary trees with N=2/sup M/-1 nodes on M memory modules in such a way that: (a) the number of conflicts for accessing a subtree template or a root-to-leaf path template is O[/spl radic/(M/logM)], (b) the load (i.e. the ratio between the maximum and minimum number of data items mapped on each module) is 1+o(1), and (c) the time complexity for retrieving the module where a given data item is stored is O(1) if a preprocessing phase of space and time complexity O(log N) is executed, or O(log log N) if no preprocessing is allowed.
研究将一个数据结构的N个节点映射到M个内存模块上,使它们可以被模板(即不同的节点集)并行访问的问题。在文献中,有几种算法可用于数组(通过行、列、对角线和子数组访问)和树(通过子树、根到叶路径等访问)。虽然一些数组的映射算法允许一次无冲突地访问多个模板(例如行和列),但没有一种映射算法可以有效地访问完整二叉树中的子树和根到叶路径模板。我们证明了任何对这两个模板中的一个没有冲突的映射算法在另一个模板上都有/spl Omega/(M/log M)冲突。因此,不可能找到在两个模板上都没有冲突的映射算法。我们给出了在M个内存模块上映射N=2/sup M/-1个节点的完全二叉树的算法:(a)冲突的数量用于访问子树模板或root-to-leaf路径模板是O (spl·拉迪奇/ (M / logM)], (b)负载(即最大和最小之间的比例每个模块的数据项映射)是1 + O(1),和(c)时间复杂度为检索模块在一个给定的数据项存储是O(1)如果一个预处理阶段的时间和空间复杂性执行O (log N),或O (log N)如果没有预处理是被允许的。
{"title":"Multiple templates access of trees in parallel memory systems","authors":"V. Auletta, A. D. Vivo, V. Scarano","doi":"10.1109/IPPS.1997.580980","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580980","url":null,"abstract":"Studies the problem of mapping the N nodes of a data structure onto M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. In the literature, several algorithms are available for arrays (accessed by rows, columns, diagonals and subarrays) and trees (accessed by subtrees, root-to-leaf paths, etc.). Although some mapping algorithms for arrays allow conflict-free access to several templates at once (e.g. rows and columns), no mapping algorithm is known for efficiently accessing both subtree and root-to-leaf path templates in complete binary trees. We prove that any mapping algorithm that is conflict-free for one of these two templates has /spl Omega/(M/log M) conflicts on the other. Therefore, no mapping algorithm can be found that is conflict-free on both templates. We give an algorithm for mapping complete binary trees with N=2/sup M/-1 nodes on M memory modules in such a way that: (a) the number of conflicts for accessing a subtree template or a root-to-leaf path template is O[/spl radic/(M/logM)], (b) the load (i.e. the ratio between the maximum and minimum number of data items mapped on each module) is 1+o(1), and (c) the time complexity for retrieving the module where a given data item is stored is O(1) if a preprocessing phase of space and time complexity O(log N) is executed, or O(log log N) if no preprocessing is allowed.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121648319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Fast parallel computation of the polynomial shift 多项式移位的快速并行计算
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580933
E. Zima
Given an n-degree polynomial f(x) over an arbitrary ring, the shift of f(x) by c is the operation which computes the coefficients of the polynomial f(x+c). In this paper, we consider the case when the shift by the given constant c has to be performed several times (repeatedly). We propose a parallel algorithm that is suited to an SIMD architecture to perform the shift in O(1) time if we have O(n/sup 2/) processor elements available. The proposed algorithm is easy to generalize to multivariate polynomial shifts. The possibility of applying this algorithm to polynomials with coefficients from non-commutative rings is discussed, as well as the bit-wise complexity of the algorithm.
给定任意环上的n次多项式f(x), f(x)移位c是计算多项式f(x+c)系数的运算。在本文中,我们考虑的情况下,由给定常数c移位必须执行多次(重复)。我们提出了一种适合SIMD架构的并行算法,如果我们有O(n/sup 2/)个处理器元素可用,则该算法可以在O(1)时间内执行移位。该算法易于推广到多元多项式位移。讨论了将该算法应用于系数为非交换环的多项式的可能性,以及该算法的位复杂度。
{"title":"Fast parallel computation of the polynomial shift","authors":"E. Zima","doi":"10.1109/IPPS.1997.580933","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580933","url":null,"abstract":"Given an n-degree polynomial f(x) over an arbitrary ring, the shift of f(x) by c is the operation which computes the coefficients of the polynomial f(x+c). In this paper, we consider the case when the shift by the given constant c has to be performed several times (repeatedly). We propose a parallel algorithm that is suited to an SIMD architecture to perform the shift in O(1) time if we have O(n/sup 2/) processor elements available. The proposed algorithm is easy to generalize to multivariate polynomial shifts. The possibility of applying this algorithm to polynomials with coefficients from non-commutative rings is discussed, as well as the bit-wise complexity of the algorithm.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127929677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Parallel 'Go with the winners' algorithms in the LogP model LogP模型中并行的“与胜者同行”算法
Pub Date : 1997-04-01 DOI: 10.1109/IPPS.1997.580972
Marcus Peinado, Thomas Lengauer
The authors parallelize the 'Go with the winners' algorithm of Aldous and Vazirani (1994) and analyze the resulting parallel algorithm in the LogP-model. The main issues in the analysis are load imbalances and communication delays. The result of the analysis is a practical algorithm which, under reasonable assumptions, achieves linear speedup. Finally, they analyze the algorithm for a concrete application: generating models of amorphous chemical structures.
作者将Aldous和Vazirani(1994)的“Go with The winners”算法并行化,并在logp模型中分析由此产生的并行算法。分析中的主要问题是负载不平衡和通信延迟。分析结果是一种实用的算法,在合理的假设下,可以实现线性加速。最后,他们分析了该算法的具体应用:生成非晶化学结构的模型。
{"title":"Parallel 'Go with the winners' algorithms in the LogP model","authors":"Marcus Peinado, Thomas Lengauer","doi":"10.1109/IPPS.1997.580972","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580972","url":null,"abstract":"The authors parallelize the 'Go with the winners' algorithm of Aldous and Vazirani (1994) and analyze the resulting parallel algorithm in the LogP-model. The main issues in the analysis are load imbalances and communication delays. The result of the analysis is a practical algorithm which, under reasonable assumptions, achieves linear speedup. Finally, they analyze the algorithm for a concrete application: generating models of amorphous chemical structures.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132007214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
Proceedings 11th International Parallel Processing Symposium
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1