Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580905
R. Radhakrishnan, L. Moore, P. Wilsey
Several optimizations to the Time Warp synchronization protocol for parallel discrete event simulation have been proposed and studied. Many of these optimizations have included some form of dynamic adjustment (or control) of the operating parameters of the simulation (e.g. checkpoint interval, cancellation strategy). Traditionally dynamic parameter adjustment has been performed at the simulation object level; each simulation object collects measures of its operating behaviors (e.g. rollback frequency, rollback length, etc.) and uses them to adjust its operating parameters. The performance data collection functions and parameter adjustment are overhead costs that are incurred in the expectation of higher throughput. The paper presents a method of eliminating some of these overheads through the use of an external object to adjust the control parameters. That is, instead of inserting code for adjusting simulation parameters in the simulation object, an external control object is defined to periodically analyze each simulation object's performance data and revise that object's operating parameters. An implementation of an external control object in the WARPED Time Warp simulation kernel has been completed. The simulation parameters updated by the implemented control system are: checkpoint interval, and cancellation strategy (lazy or aggressive). A comparative analysis of three test cases shows that the external control mechanism provides speedups between 5%-17% over the best performing embedded dynamic adjustment algorithms.
本文对并行离散事件仿真中的时间扭曲同步协议进行了几种优化。许多这些优化都包括了对模拟操作参数的某种形式的动态调整(或控制)(例如检查点间隔,取消策略)。传统的动态参数调整是在仿真对象层面进行的;每个仿真对象收集其运行行为的度量(如回滚频率、回滚长度等),并使用这些度量来调整其运行参数。性能数据收集功能和参数调整是在期望更高吞吐量时产生的开销。本文提出了一种通过使用外部对象来调整控制参数来消除这些开销的方法。即不是在仿真对象中插入调整仿真参数的代码,而是定义一个外部控制对象来定期分析每个仿真对象的性能数据并修改该对象的运行参数。在Warp Time Warp仿真内核中完成了一个外部控制对象的实现。所实现的控制系统更新的仿真参数为:检查点间隔和取消策略(惰性或主动)。对三个测试用例的对比分析表明,外部控制机制比性能最好的嵌入式动态调整算法提供了5%-17%的速度提升。
{"title":"External adjustment of runtime parameters in Time Warp synchronized parallel simulators","authors":"R. Radhakrishnan, L. Moore, P. Wilsey","doi":"10.1109/IPPS.1997.580905","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580905","url":null,"abstract":"Several optimizations to the Time Warp synchronization protocol for parallel discrete event simulation have been proposed and studied. Many of these optimizations have included some form of dynamic adjustment (or control) of the operating parameters of the simulation (e.g. checkpoint interval, cancellation strategy). Traditionally dynamic parameter adjustment has been performed at the simulation object level; each simulation object collects measures of its operating behaviors (e.g. rollback frequency, rollback length, etc.) and uses them to adjust its operating parameters. The performance data collection functions and parameter adjustment are overhead costs that are incurred in the expectation of higher throughput. The paper presents a method of eliminating some of these overheads through the use of an external object to adjust the control parameters. That is, instead of inserting code for adjusting simulation parameters in the simulation object, an external control object is defined to periodically analyze each simulation object's performance data and revise that object's operating parameters. An implementation of an external control object in the WARPED Time Warp simulation kernel has been completed. The simulation parameters updated by the implemented control system are: checkpoint interval, and cancellation strategy (lazy or aggressive). A comparative analysis of three test cases shows that the external control mechanism provides speedups between 5%-17% over the best performing embedded dynamic adjustment algorithms.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115713743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580949
M. Flammini, S. Pérennes
Gossiping is an information dissemination process in which each processor has a distinct item of information and has to collect all the items possessed by the other processors. We derive lower bounds on the gossiping time of systolic protocols, i.e. constituted by a periodic repetition of simple communication steps. In particular if we denote by n the number of processors in the network, then for directed networks and for undirected networks in the half-duplex mode any s-systolic gossip protocol takes at least g(s) log/sub 2/ n time steps, where g(4)=1.8133, g(6)=1.5310 and g(8)=1.4721. For the case s=4 this result is improved to 2.0218 log/sub 2/ n for directed butterflies of degree 2 and we show that the 2.0218 log/sub 2/ n and 1.8133 log/sub 2/ n lower bounds hold also respectively for undirected Butterfly and de Bruijn networks of degree 2 in the full-duplex case. Our results are obtained by means of new technique relying on two novel concepts in the field: the notion of delay digraph of a systolic protocol and the use of matrix norm methods.
{"title":"Lower bounds on systolic gossip","authors":"M. Flammini, S. Pérennes","doi":"10.1109/IPPS.1997.580949","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580949","url":null,"abstract":"Gossiping is an information dissemination process in which each processor has a distinct item of information and has to collect all the items possessed by the other processors. We derive lower bounds on the gossiping time of systolic protocols, i.e. constituted by a periodic repetition of simple communication steps. In particular if we denote by n the number of processors in the network, then for directed networks and for undirected networks in the half-duplex mode any s-systolic gossip protocol takes at least g(s) log/sub 2/ n time steps, where g(4)=1.8133, g(6)=1.5310 and g(8)=1.4721. For the case s=4 this result is improved to 2.0218 log/sub 2/ n for directed butterflies of degree 2 and we show that the 2.0218 log/sub 2/ n and 1.8133 log/sub 2/ n lower bounds hold also respectively for undirected Butterfly and de Bruijn networks of degree 2 in the full-duplex case. Our results are obtained by means of new technique relying on two novel concepts in the field: the notion of delay digraph of a systolic protocol and the use of matrix norm methods.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117160888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580950
J. Knopman, J. S. Aude
This paper analyses alternatives for the parallelization of the Simulated Annealing algorithm when applied to the placement of modules in a VLSI circuit considering the use of PVM on an Ethernet cluster of workstations. It is shown that different parallelization approaches have to be used for high and low temperature values of the annealing process. The algorithm used for low temperatures is an adaptive version of the speculative algorithm proposed in the literature. Within this adaptive algorithm, the number of processors allocated to the solution of the placement problem and the number of moves evaluated per processor between synchronization points change with the temperature. At high temperatures, an algorithm based on the parallel evaluation of independent chains of moves has been adopted. It is shown that results with the same quality of those produced by the serial version can be obtained when shorter length chains are used in the parallel implementation.
{"title":"Parallel simulated annealing: an adaptive approach","authors":"J. Knopman, J. S. Aude","doi":"10.1109/IPPS.1997.580950","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580950","url":null,"abstract":"This paper analyses alternatives for the parallelization of the Simulated Annealing algorithm when applied to the placement of modules in a VLSI circuit considering the use of PVM on an Ethernet cluster of workstations. It is shown that different parallelization approaches have to be used for high and low temperature values of the annealing process. The algorithm used for low temperatures is an adaptive version of the speculative algorithm proposed in the literature. Within this adaptive algorithm, the number of processors allocated to the solution of the placement problem and the number of moves evaluated per processor between synchronization points change with the temperature. At high temperatures, an algorithm based on the parallel evaluation of independent chains of moves has been adopted. It is shown that results with the same quality of those produced by the serial version can be obtained when shorter length chains are used in the parallel implementation.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121243197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580978
M. Mavronicolas, M. Papatriantafilou, P. Tsigas
Counting networks form a new class of distributed, low-contention data structures made up of interconnected balancers, and are suitable for solving a variety of multiprocessor synchronization problems that can be expressed as counting problems. A linearizable counting network guarantees that the order of the values it returns respects the real-time order they were requested. Linearizability significantly raises the capabilities of the network, but at a possible price in network size or synchronization support. In this paper, we further pursue the systematic study of the impact of timing on linearizability for counting networks, along a research line initiated by Lynch et al. (1996). We consider two basic timing models: the instantaneous balancer model, in which the transition of a token from an input to an output port of a balancer is modeled as an instantaneous event, and the periodic balancer model, where balancers send out tokens at a fixed rate. We also consider lower and upper bounds on the delays incurred by wires connecting the balancers. We present necessary and sufficient conditions for linearizability in the form of precise inequalities that involve timing parameters and identify structural parameters of the counting network, which may be of more general interest. Our results significantly extend and strengthen previous impossibility and possibility results on linearizability in counting networks (Herlihy et al., 1990; Lynch et al., 1996).
计数网络形成了一类新的分布式、低争用的数据结构,由相互连接的平衡器组成,适用于解决各种可以表示为计数问题的多处理器同步问题。可线性化的计数网络保证它返回的值的顺序符合它们被请求的实时顺序。线性化显著地提高了网络的能力,但可能以网络大小或同步支持为代价。在本文中,我们进一步沿着Lynch等人(1996)发起的研究路线,系统地研究了时序对计数网络线性化的影响。我们考虑了两个基本的定时模型:瞬时平衡器模型,其中令牌从平衡器的输入端口到输出端口的转换被建模为瞬时事件,以及周期性平衡器模型,其中平衡器以固定速率发送令牌。我们还考虑了由连接平衡器的导线引起的延迟的下界和上界。我们以精确不等式的形式给出了线性化的充分必要条件,这些精确不等式涉及定时参数和识别计数网络的结构参数,这可能是更普遍的兴趣。我们的结果显着扩展和加强了先前关于计数网络线性化的不可能性和可能性结果(Herlihy et al., 1990;Lynch et al., 1996)。
{"title":"The impact of timing on linearizability in counting networks","authors":"M. Mavronicolas, M. Papatriantafilou, P. Tsigas","doi":"10.1109/IPPS.1997.580978","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580978","url":null,"abstract":"Counting networks form a new class of distributed, low-contention data structures made up of interconnected balancers, and are suitable for solving a variety of multiprocessor synchronization problems that can be expressed as counting problems. A linearizable counting network guarantees that the order of the values it returns respects the real-time order they were requested. Linearizability significantly raises the capabilities of the network, but at a possible price in network size or synchronization support. In this paper, we further pursue the systematic study of the impact of timing on linearizability for counting networks, along a research line initiated by Lynch et al. (1996). We consider two basic timing models: the instantaneous balancer model, in which the transition of a token from an input to an output port of a balancer is modeled as an instantaneous event, and the periodic balancer model, where balancers send out tokens at a fixed rate. We also consider lower and upper bounds on the delays incurred by wires connecting the balancers. We present necessary and sufficient conditions for linearizability in the form of precise inequalities that involve timing parameters and identify structural parameters of the counting network, which may be of more general interest. Our results significantly extend and strengthen previous impossibility and possibility results on linearizability in counting networks (Herlihy et al., 1990; Lynch et al., 1996).","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127437605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580886
B. V. Voorst, Luiz Pires, R. Jha, Mustafa Muhammad
This paper describes the implementation of the hypothesis testing benchmark, one of ten kernels from the C/sup 3/I (Command, Control, Communications and Intelligence) Parallel Benchmark Suite (C/sup 3/IPBS)/sup 1/. The benchmark was implemented and executed on a variety of parallel environments. This paper details the run times obtained with these implementations, and offers an analysis of the results.
{"title":"Implementation and results of hypothesis testing from the C/sup 3/I parallel benchmark suite","authors":"B. V. Voorst, Luiz Pires, R. Jha, Mustafa Muhammad","doi":"10.1109/IPPS.1997.580886","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580886","url":null,"abstract":"This paper describes the implementation of the hypothesis testing benchmark, one of ten kernels from the C/sup 3/I (Command, Control, Communications and Intelligence) Parallel Benchmark Suite (C/sup 3/IPBS)/sup 1/. The benchmark was implemented and executed on a variety of parallel environments. This paper details the run times obtained with these implementations, and offers an analysis of the results.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125381210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580884
J. Brehm, P. Worley
Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the Performance Prediction Tool (PerPreT) to predict performance of the Parallel Spectral Transform Shallow Water Model code (PSTSWM) on the Intel Paragon.
{"title":"Performance prediction for complex parallel applications","authors":"J. Brehm, P. Worley","doi":"10.1109/IPPS.1997.580884","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580884","url":null,"abstract":"Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the Performance Prediction Tool (PerPreT) to predict performance of the Parallel Spectral Transform Shallow Water Model code (PSTSWM) on the Intel Paragon.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123882379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580836
J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta
A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol.
{"title":"Coherent block data transfer in the FLASH multiprocessor","authors":"J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta","doi":"10.1109/IPPS.1997.580836","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580836","url":null,"abstract":"A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123641456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580980
V. Auletta, A. D. Vivo, V. Scarano
Studies the problem of mapping the N nodes of a data structure onto M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. In the literature, several algorithms are available for arrays (accessed by rows, columns, diagonals and subarrays) and trees (accessed by subtrees, root-to-leaf paths, etc.). Although some mapping algorithms for arrays allow conflict-free access to several templates at once (e.g. rows and columns), no mapping algorithm is known for efficiently accessing both subtree and root-to-leaf path templates in complete binary trees. We prove that any mapping algorithm that is conflict-free for one of these two templates has /spl Omega/(M/log M) conflicts on the other. Therefore, no mapping algorithm can be found that is conflict-free on both templates. We give an algorithm for mapping complete binary trees with N=2/sup M/-1 nodes on M memory modules in such a way that: (a) the number of conflicts for accessing a subtree template or a root-to-leaf path template is O[/spl radic/(M/logM)], (b) the load (i.e. the ratio between the maximum and minimum number of data items mapped on each module) is 1+o(1), and (c) the time complexity for retrieving the module where a given data item is stored is O(1) if a preprocessing phase of space and time complexity O(log N) is executed, or O(log log N) if no preprocessing is allowed.
{"title":"Multiple templates access of trees in parallel memory systems","authors":"V. Auletta, A. D. Vivo, V. Scarano","doi":"10.1109/IPPS.1997.580980","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580980","url":null,"abstract":"Studies the problem of mapping the N nodes of a data structure onto M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. In the literature, several algorithms are available for arrays (accessed by rows, columns, diagonals and subarrays) and trees (accessed by subtrees, root-to-leaf paths, etc.). Although some mapping algorithms for arrays allow conflict-free access to several templates at once (e.g. rows and columns), no mapping algorithm is known for efficiently accessing both subtree and root-to-leaf path templates in complete binary trees. We prove that any mapping algorithm that is conflict-free for one of these two templates has /spl Omega/(M/log M) conflicts on the other. Therefore, no mapping algorithm can be found that is conflict-free on both templates. We give an algorithm for mapping complete binary trees with N=2/sup M/-1 nodes on M memory modules in such a way that: (a) the number of conflicts for accessing a subtree template or a root-to-leaf path template is O[/spl radic/(M/logM)], (b) the load (i.e. the ratio between the maximum and minimum number of data items mapped on each module) is 1+o(1), and (c) the time complexity for retrieving the module where a given data item is stored is O(1) if a preprocessing phase of space and time complexity O(log N) is executed, or O(log log N) if no preprocessing is allowed.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121648319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580933
E. Zima
Given an n-degree polynomial f(x) over an arbitrary ring, the shift of f(x) by c is the operation which computes the coefficients of the polynomial f(x+c). In this paper, we consider the case when the shift by the given constant c has to be performed several times (repeatedly). We propose a parallel algorithm that is suited to an SIMD architecture to perform the shift in O(1) time if we have O(n/sup 2/) processor elements available. The proposed algorithm is easy to generalize to multivariate polynomial shifts. The possibility of applying this algorithm to polynomials with coefficients from non-commutative rings is discussed, as well as the bit-wise complexity of the algorithm.
{"title":"Fast parallel computation of the polynomial shift","authors":"E. Zima","doi":"10.1109/IPPS.1997.580933","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580933","url":null,"abstract":"Given an n-degree polynomial f(x) over an arbitrary ring, the shift of f(x) by c is the operation which computes the coefficients of the polynomial f(x+c). In this paper, we consider the case when the shift by the given constant c has to be performed several times (repeatedly). We propose a parallel algorithm that is suited to an SIMD architecture to perform the shift in O(1) time if we have O(n/sup 2/) processor elements available. The proposed algorithm is easy to generalize to multivariate polynomial shifts. The possibility of applying this algorithm to polynomials with coefficients from non-commutative rings is discussed, as well as the bit-wise complexity of the algorithm.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127929677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580972
Marcus Peinado, Thomas Lengauer
The authors parallelize the 'Go with the winners' algorithm of Aldous and Vazirani (1994) and analyze the resulting parallel algorithm in the LogP-model. The main issues in the analysis are load imbalances and communication delays. The result of the analysis is a practical algorithm which, under reasonable assumptions, achieves linear speedup. Finally, they analyze the algorithm for a concrete application: generating models of amorphous chemical structures.
作者将Aldous和Vazirani(1994)的“Go with The winners”算法并行化,并在logp模型中分析由此产生的并行算法。分析中的主要问题是负载不平衡和通信延迟。分析结果是一种实用的算法,在合理的假设下,可以实现线性加速。最后,他们分析了该算法的具体应用:生成非晶化学结构的模型。
{"title":"Parallel 'Go with the winners' algorithms in the LogP model","authors":"Marcus Peinado, Thomas Lengauer","doi":"10.1109/IPPS.1997.580972","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580972","url":null,"abstract":"The authors parallelize the 'Go with the winners' algorithm of Aldous and Vazirani (1994) and analyze the resulting parallel algorithm in the LogP-model. The main issues in the analysis are load imbalances and communication delays. The result of the analysis is a practical algorithm which, under reasonable assumptions, achieves linear speedup. Finally, they analyze the algorithm for a concrete application: generating models of amorphous chemical structures.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132007214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}