首页 > 最新文献

Proceedings of the 2015 International Workshop on Parallel Symbolic Computation最新文献

英文 中文
Cache oblivious sparse polynomial factoring using the funnel heap 缓存无关稀疏多项式分解使用漏斗堆
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790283
Fatima K. Abu Salem, Khalil El-Harake, Karl Gemayel
In [2] we demonstrated that overlapping sums of products arising in the Hensel lifting phase of the polytope factoring method using a Max priority queue reduces expression swell and achieves asymptotic reductions in the Hensel lifting phase. In this paper, we propose to implement the priority queue as a Funnel Heap, when polynomials are in sparse distributed representation. Funnel Heap is a cache oblivious priority queue with optimal cache complexity, and we additionally tailor several of its features to the polynomial arithmetic required. Funnel Heap is able to identify equal order monomials "for free" whilst it re-organises itself over sufficiently many updates. We adopt a batched mode for chaining equal order monomials that gets overlapped with Funnel Heap's mechanism for emptying its in-core components. We also develop a customised analysis of performance that captures the overhead due to chaining in terms of the fraction of reduction and replication observed in the queue, and get that batched chaining is sensitive to the number of distinct monomials residing in the queue, as opposed to the number of replicas chained. For sufficiently large input size with respect to the cache-line length, batched chaining that is "search free" leads to an implementation of Hensel lifting that exhibits optimal cache complexity in the number of replicas found in the queue. Additionally, we obtain an order of magnitude reduction in space, as well as a reduction in the logarithmic factor in work and cache complexity, when comparing our adaptation against [2]. Also, the resulting Hensel lifting process is cache-oblivious. Our benchmarks of the polytope method using Funnel Heap with chaining demonstrate dramatic improvements over the regular binary heap as well as MAGMA, where the latter fails to process sufficiently high degree but sparse polynomial factorisations.
在[2]中,我们证明了在使用最大优先队列的多面体分解方法的Hensel提升阶段产生的乘积的重叠和减少了表达式膨胀,并在Hensel提升阶段实现了渐近约简。在本文中,当多项式是稀疏分布表示时,我们提出将优先级队列实现为漏斗堆。漏斗堆是一个缓存无关的优先级队列,具有最佳的缓存复杂度,我们还根据所需的多项式算法定制了它的几个特征。漏斗堆能够“免费”识别等次单项式,同时它可以在足够多的更新中重新组织自己。我们采用批处理模式来链接等阶单项,这与漏斗堆清空其核心组件的机制重叠。我们还开发了一个定制的性能分析,根据在队列中观察到的减少和复制的比例来捕获由于链接而产生的开销,并得到批处理链接对驻留在队列中的不同单体的数量敏感,而不是链接的副本的数量。对于相对于缓存行长度足够大的输入大小,“不需要搜索”的批处理链将导致Hensel提升的实现,该实现在队列中找到的副本数量上显示出最佳的缓存复杂性。此外,当将我们的适应与[2]进行比较时,我们在空间上得到了一个数量级的减少,以及在工作和缓存复杂性方面的对数因子的减少。而且,产生的Hensel提升过程是缓参无关的。我们使用链链漏斗堆的多体方法的基准测试表明,与常规二进制堆和MAGMA相比,该方法有了显着的改进,后者无法处理足够高程度但稀疏的多项式分解。
{"title":"Cache oblivious sparse polynomial factoring using the funnel heap","authors":"Fatima K. Abu Salem, Khalil El-Harake, Karl Gemayel","doi":"10.1145/2790282.2790283","DOIUrl":"https://doi.org/10.1145/2790282.2790283","url":null,"abstract":"In [2] we demonstrated that overlapping sums of products arising in the Hensel lifting phase of the polytope factoring method using a Max priority queue reduces expression swell and achieves asymptotic reductions in the Hensel lifting phase. In this paper, we propose to implement the priority queue as a Funnel Heap, when polynomials are in sparse distributed representation. Funnel Heap is a cache oblivious priority queue with optimal cache complexity, and we additionally tailor several of its features to the polynomial arithmetic required. Funnel Heap is able to identify equal order monomials \"for free\" whilst it re-organises itself over sufficiently many updates. We adopt a batched mode for chaining equal order monomials that gets overlapped with Funnel Heap's mechanism for emptying its in-core components. We also develop a customised analysis of performance that captures the overhead due to chaining in terms of the fraction of reduction and replication observed in the queue, and get that batched chaining is sensitive to the number of distinct monomials residing in the queue, as opposed to the number of replicas chained. For sufficiently large input size with respect to the cache-line length, batched chaining that is \"search free\" leads to an implementation of Hensel lifting that exhibits optimal cache complexity in the number of replicas found in the queue. Additionally, we obtain an order of magnitude reduction in space, as well as a reduction in the logarithmic factor in work and cache complexity, when comparing our adaptation against [2]. Also, the resulting Hensel lifting process is cache-oblivious. Our benchmarks of the polytope method using Funnel Heap with chaining demonstrate dramatic improvements over the regular binary heap as well as MAGMA, where the latter fails to process sufficiently high degree but sparse polynomial factorisations.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131599280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A compact parallel implementation of F4 一个紧凑的并行实现的F4
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790293
M. Monagan, Roman Pearce
We present a compact and parallel C implementation of the F4 algorithm for computing Gröbner bases which uses Cilk. We give an easy way to parallelize the sparse linear algebra which is the main cost in practice. To obtain more speedup we attempted to parallelize the generation of sparse matrices as well. We present timings to assess the effectiveness of our approach and to compare our implementation to others.
我们提出了一个使用Cilk计算Gröbner基的F4算法的紧凑并行C实现。我们给出了一种简单的方法来并行化稀疏线性代数,这是实践中的主要代价。为了获得更大的加速,我们也尝试并行化稀疏矩阵的生成。我们提出了评估我们方法有效性的时间安排,并将我们的实施与其他方法进行比较。
{"title":"A compact parallel implementation of F4","authors":"M. Monagan, Roman Pearce","doi":"10.1145/2790282.2790293","DOIUrl":"https://doi.org/10.1145/2790282.2790293","url":null,"abstract":"We present a compact and parallel C implementation of the F4 algorithm for computing Gröbner bases which uses Cilk. We give an easy way to parallelize the sparse linear algebra which is the main cost in practice. To obtain more speedup we attempted to parallelize the generation of sparse matrices as well. We present timings to assess the effectiveness of our approach and to compare our implementation to others.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125976994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Direct solution of the (11,9,8)-MinRank problem by the block Wiedemann algorithm in magma with a tesla GPU 用tesla GPU直接求解岩浆中(11,9,8)-MinRank问题
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2791392
A. Steel
We show how some very large multivariate polynomial systems over finite fields can be solved by Gröbner basis techniques coupled with the Block Wiedemann algorithm, thus extending the Wiedemann-based 'Sparse FGLM' approach of Faugère and Mou. The main components of our approach are a dense variant of the Faugère F4 Gröbner basis algorithm and the Block Wiedemann algorithm, which have been implemented within the Magma Computer Algebra System (released in version V2.20 in late 2014). A major feature of the algorithms is that they map much of the computation to dense matrix multiplication, and this allows dramatic speedups to be achieved for large examples when an Nvidia Tesla GPU is available. As a result, the Magma implementation can directly solve a 16-bit random instance of the Courtois (11,9,8)-MinRank Challenge C in about 15.1 hours with a single Intel Sandybridge CPU core coupled with an Nvidia Tesla K40 GPU.
我们展示了如何通过Gröbner基技术与块Wiedemann算法相结合来解决有限域上的一些非常大的多元多项式系统,从而扩展了faug和Mou的基于Wiedemann的“稀疏FGLM”方法。我们方法的主要组成部分是faug F4 Gröbner基算法和Block Wiedemann算法的密集变体,它们已经在Magma计算机代数系统(2014年底发布的V2.20版本)中实现。这些算法的一个主要特点是,它们将大部分计算映射到密集矩阵乘法上,当Nvidia Tesla GPU可用时,这使得大型示例可以实现显着的加速。因此,Magma实现可以在大约15.1小时内使用单个Intel Sandybridge CPU核心和Nvidia Tesla K40 GPU直接解决Courtois (11,9,8)-MinRank Challenge C的16位随机实例。
{"title":"Direct solution of the (11,9,8)-MinRank problem by the block Wiedemann algorithm in magma with a tesla GPU","authors":"A. Steel","doi":"10.1145/2790282.2791392","DOIUrl":"https://doi.org/10.1145/2790282.2791392","url":null,"abstract":"We show how some very large multivariate polynomial systems over finite fields can be solved by Gröbner basis techniques coupled with the Block Wiedemann algorithm, thus extending the Wiedemann-based 'Sparse FGLM' approach of Faugère and Mou. The main components of our approach are a dense variant of the Faugère F4 Gröbner basis algorithm and the Block Wiedemann algorithm, which have been implemented within the Magma Computer Algebra System (released in version V2.20 in late 2014). A major feature of the algorithms is that they map much of the computation to dense matrix multiplication, and this allows dramatic speedups to be achieved for large examples when an Nvidia Tesla GPU is available. As a result, the Magma implementation can directly solve a 16-bit random instance of the Courtois (11,9,8)-MinRank Challenge C in about 15.1 hours with a single Intel Sandybridge CPU core coupled with an Nvidia Tesla K40 GPU.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122614958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Parallel algebraic linear algebra dedicated interface 并行代数线性代数专用接口
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790286
T. Gautier, Jean-Louis Roch, Ziad Sultan, Bastien Vialla
This work deals with parallelism in linear algebra routines. We propose a domain specific language based on C/C++ macros, PALADIn (Parallel Algebraic Linear Algebra Dedicated Interface). This domain specific language allows the user to write C++ code and benefit from sequential and parallel executions on shared memory architectures. With a unique syntax, the user can switch between different parallel runtime systems such as OpenMP, TBB and xKaapi. This interface provides data and task parallelism. Depending on the runtime system, task parallelism can use explicit synchronizations or data-dependency based synchronizations. Also, this language provides different matrix cutting strategies according to one or two dimensions. Moreover, block algorithms, such as block iterative and recursive matrix multiplication, can involve splitting according to three dimensions. The latter is also a feature that is provided to the user. The PALADIn interface can be used in any C++ library for linear algebra computation and gets the best performance from the three supported parallel runtime systems.
这项工作涉及线性代数例程中的并行性。我们提出了一种基于C/ c++宏的领域特定语言,PALADIn(并行代数线性代数专用接口)。这种特定于领域的语言允许用户编写c++代码,并受益于在共享内存架构上的顺序和并行执行。通过独特的语法,用户可以在不同的并行运行时系统(如OpenMP、TBB和xKaapi)之间切换。该接口提供数据和任务并行性。根据运行时系统的不同,任务并行性可以使用显式同步或基于数据依赖的同步。此外,该语言还根据一维或二维提供了不同的矩阵切割策略。此外,块算法,如块迭代和递归矩阵乘法,可能涉及根据三维分割。后者也是提供给用户的功能。PALADIn接口可以在任何c++库中用于线性代数计算,并在三种支持的并行运行时系统中获得最佳性能。
{"title":"Parallel algebraic linear algebra dedicated interface","authors":"T. Gautier, Jean-Louis Roch, Ziad Sultan, Bastien Vialla","doi":"10.1145/2790282.2790286","DOIUrl":"https://doi.org/10.1145/2790282.2790286","url":null,"abstract":"This work deals with parallelism in linear algebra routines. We propose a domain specific language based on C/C++ macros, PALADIn (Parallel Algebraic Linear Algebra Dedicated Interface). This domain specific language allows the user to write C++ code and benefit from sequential and parallel executions on shared memory architectures. With a unique syntax, the user can switch between different parallel runtime systems such as OpenMP, TBB and xKaapi. This interface provides data and task parallelism. Depending on the runtime system, task parallelism can use explicit synchronizations or data-dependency based synchronizations. Also, this language provides different matrix cutting strategies according to one or two dimensions. Moreover, block algorithms, such as block iterative and recursive matrix multiplication, can involve splitting according to three dimensions. The latter is also a feature that is provided to the user. The PALADIn interface can be used in any C++ library for linear algebra computation and gets the best performance from the three supported parallel runtime systems.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131929414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A hybrid symbolic-numeric approach to exceptional sets of generically zero-dimensional systems 一般零维系统异常集的混合符号-数值方法
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790288
J. Hauenstein, Alan C. Liddell
Exceptional sets are the sets where the dimension of the fiber of a map is larger than the generic fiber dimension, which we assume is zero. Such situations naturally arise in kinematics, for example, when designing a mechanism that moves when the generic case is rigid. In 2008, Sommese and Wampler showed that one can use fiber products to promote such sets to become irreducible components. We propose an alternative approach using rank constraints on Macaulay matrices. Symbolic computations are used to construct the proper Macaulay matrices, while numerical computations are used to solve the rank-constraint problem. Various exceptional sets are computed, including exceptional RR dyads, lines on surfaces in C3, and exceptional planar pentads.
例外集是指映射的纤维维数大于一般纤维维数的集合,我们假设一般纤维维数为零。这种情况在运动学中自然出现,例如,当设计一个机构在一般情况下是刚性时运动时。2008年,Sommese和Wampler表明,人们可以使用纤维产品来促进这些集合成为不可约的组件。我们提出了一种使用麦考利矩阵的秩约束的替代方法。符号计算用于构造合适的Macaulay矩阵,数值计算用于求解秩约束问题。计算了各种例外集,包括例外的RR对、C3中曲面上的线和例外的平面五边形。
{"title":"A hybrid symbolic-numeric approach to exceptional sets of generically zero-dimensional systems","authors":"J. Hauenstein, Alan C. Liddell","doi":"10.1145/2790282.2790288","DOIUrl":"https://doi.org/10.1145/2790282.2790288","url":null,"abstract":"Exceptional sets are the sets where the dimension of the fiber of a map is larger than the generic fiber dimension, which we assume is zero. Such situations naturally arise in kinematics, for example, when designing a mechanism that moves when the generic case is rigid. In 2008, Sommese and Wampler showed that one can use fiber products to promote such sets to become irreducible components. We propose an alternative approach using rank constraints on Macaulay matrices. Symbolic computations are used to construct the proper Macaulay matrices, while numerical computations are used to solve the rank-constraint problem. Various exceptional sets are computed, including exceptional RR dyads, lines on surfaces in C3, and exceptional planar pentads.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115497915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A parallel implementation for polynomial multiplication modulo a prime 一个对素数取模的多项式乘法的并行实现
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790291
M. Law, M. Monagan
We present a parallel implementation in Cilk C of a modular algorithm for multiplying two polynomials in Zq[x] for integer q > 1, for multi-core computers. Our algorithm uses Chinese remaindering. It multiplies modulo primes p1, p2, ... in parallel and uses a parallel FFT for each prime. Our software multiplies two polynomials of degree 109 modulo a 32 bit integer q in 83 seconds on a 20 core computer.
我们在Cilk C中提出了一个模块化算法的并行实现,用于在多核计算机上对整数q > 1进行Zq[x]中的两个多项式相乘。我们的算法使用中文余数。它乘以模数p1 p2,…并对每个素数使用并行FFT。我们的软件在一台20核的计算机上,在83秒内乘以两个109次模取32位整数q的多项式。
{"title":"A parallel implementation for polynomial multiplication modulo a prime","authors":"M. Law, M. Monagan","doi":"10.1145/2790282.2790291","DOIUrl":"https://doi.org/10.1145/2790282.2790291","url":null,"abstract":"We present a parallel implementation in Cilk C of a modular algorithm for multiplying two polynomials in Zq[x] for integer q > 1, for multi-core computers. Our algorithm uses Chinese remaindering. It multiplies modulo primes p1, p2, ... in parallel and uses a parallel FFT for each prime. Our software multiplies two polynomials of degree 109 modulo a 32 bit integer q in 83 seconds on a 20 core computer.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126501816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Parallel sparse multivariate polynomial division 并行稀疏多元多项式除法
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790285
M. Gastineau, J. Laskar
We present a scalable algorithm for dividing two sparse multivariate polynomials represented in a distributed format on shared memory multicore computers. The scalability on the large number of cores is ensured by the lack of synchronizations during the main parallel step. The merge and sorting operations are based on binary heap or tree data structures.
本文提出了一种在共享内存多核计算机上以分布式格式表示的两个稀疏多元多项式除法的可扩展算法。通过在主并行步骤期间缺乏同步,确保了在大量核心上的可伸缩性。合并和排序操作基于二进制堆或树数据结构。
{"title":"Parallel sparse multivariate polynomial division","authors":"M. Gastineau, J. Laskar","doi":"10.1145/2790282.2790285","DOIUrl":"https://doi.org/10.1145/2790282.2790285","url":null,"abstract":"We present a scalable algorithm for dividing two sparse multivariate polynomials represented in a distributed format on shared memory multicore computers. The scalability on the large number of cores is ensured by the lack of synchronizations during the main parallel step. The merge and sorting operations are based on binary heap or tree data structures.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"50 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122423310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
High performance implementation of the inverse TFT 逆TFT的高性能实现
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790292
Lingchuan Meng, Jeremy R. Johnson
The inverse truncated Fourier transform (ITFT) is a key component in the fast polynomial and large integer algorithms introduced by van der Hoeven. This paper reports a high performance implementation of the ITFT which poses additional challenges compared to that of the forward transform. A general-radix variant of the ITFT algorithm is developed to allow the implementation to automatically adapt to the memory hierarchy. Then a parallel ITFT algorithm is developed that trades off small arithmetic cost for full vectorization and improved multi-threaded parallelism. The algorithms are automatically generated and tuned to produce an arbitrary-size ITFT library. The new algorithms and the implementation smooths out the staircase performance associated with power-of-two modular FFT implementations, and provide significant performance improvement over zero-padding approaches even when high-performance FFT libraries are used.
截断傅立叶反变换(ITFT)是van der Hoeven提出的快速多项式和大整数算法的关键组成部分。本文报告了ITFT的高性能实现,与前向变换相比,它提出了额外的挑战。开发了ITFT算法的通用基数变体,以允许实现自动适应内存层次结构。然后,开发了一种并行ITFT算法,以较小的算法开销换取完全向量化和提高多线程并行性。算法会自动生成并调优,以生成任意大小的ITFT库。新的算法和实现平滑了与2次幂模块化FFT实现相关的阶梯性能,并且即使在使用高性能FFT库时,也比零填充方法提供了显着的性能改进。
{"title":"High performance implementation of the inverse TFT","authors":"Lingchuan Meng, Jeremy R. Johnson","doi":"10.1145/2790282.2790292","DOIUrl":"https://doi.org/10.1145/2790282.2790292","url":null,"abstract":"The inverse truncated Fourier transform (ITFT) is a key component in the fast polynomial and large integer algorithms introduced by van der Hoeven. This paper reports a high performance implementation of the ITFT which poses additional challenges compared to that of the forward transform. A general-radix variant of the ITFT algorithm is developed to allow the implementation to automatically adapt to the memory hierarchy. Then a parallel ITFT algorithm is developed that trades off small arithmetic cost for full vectorization and improved multi-threaded parallelism. The algorithms are automatically generated and tuned to produce an arbitrary-size ITFT library. The new algorithms and the implementation smooths out the staircase performance associated with power-of-two modular FFT implementations, and provide significant performance improvement over zero-padding approaches even when high-performance FFT libraries are used.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134226862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing and parallelizing the modular GCD algorithm 模块化GCD算法的优化与并行化
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790287
Matthew Gibson, M. Monagan
Our goal is to design and implement a high performance modular GCD algorithm for polynomial GCD computation in Zp[x1, x2, ..., xn] for multi-core computers which will be used to compute the GCD of polynomials over Z. For n = 2 we have designed and implemented in C a highly optimized serial code for primes p < 263. For n > 2 we parallelized in Cilk C Brown's dense modular GCD algorithm using our serial bivariate code at the base. For n = 3, we obtain good parallel speedup on multi-core computers with 16 and 20 cores. We also compare our code with the GCD codes in Maple and Magma.
我们的目标是设计和实现一种高性能的模块化GCD算法,用于Zp[x1, x2,…]中的多项式GCD计算。, xn]用于多核计算机,用于计算z上多项式的GCD。对于n = 2,我们在C中设计并实现了一个高度优化的素数序列代码p < 263。对于n > 2,我们在Cilk C Brown的密集模块化GCD算法中并行化,使用我们的串行二元代码作为基础。当n = 3时,我们在16核和20核的多核计算机上获得了良好的并行加速。我们还将我们的代码与Maple和Magma中的GCD代码进行了比较。
{"title":"Optimizing and parallelizing the modular GCD algorithm","authors":"Matthew Gibson, M. Monagan","doi":"10.1145/2790282.2790287","DOIUrl":"https://doi.org/10.1145/2790282.2790287","url":null,"abstract":"Our goal is to design and implement a high performance modular GCD algorithm for polynomial GCD computation in Zp[x1, x2, ..., xn] for multi-core computers which will be used to compute the GCD of polynomials over Z. For n = 2 we have designed and implemented in C a highly optimized serial code for primes p < 263. For n > 2 we parallelized in Cilk C Brown's dense modular GCD algorithm using our serial bivariate code at the base. For n = 3, we obtain good parallel speedup on multi-core computers with 16 and 20 cores. We also compare our code with the GCD codes in Maple and Magma.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GPU-acceleration of optimal permutation-puzzle solving 最优排列解谜的gpu加速
Pub Date : 2015-07-10 DOI: 10.1145/2790282.2790289
Hayakawa Hiroki, Ishida Naoaki, M. Hirokazu
We first investigate parallelization of Rubik's cube optimal solver, especially for acceleration by GPU. To examine its efficacy, we implement a simple solver based on Korf's algorithm, with which CPU and GPU collaborate in IDA* algorithm and a large number of GPU cores are utilized for speedup instead of a huge distance table used for pruning. Empirical studies succeeded to attain sufficient speedup by GPU-acceleration. There are many other similar puzzles of so-called permutation puzzles. The puzzle solving, i.e., restoring the original ordered state from a scrambled one is equivalent to the path-finding in the Cayley graph of the permutation group. We generalize the method used for Rubik's cube to much smaller problems, and examine its efficacy. The focus of our research interest is how efficient the parallel path-finding can be and whether the use of a large number of cores substitutes for a large distance table used for pruning, in general.
我们首先研究了魔方最优解的并行化,特别是GPU的加速。为了检验其有效性,我们在Korf算法的基础上实现了一个简单的求解器,其中CPU和GPU在IDA*算法中协作,利用大量GPU内核进行加速,而不是使用巨大的距离表进行修剪。实证研究成功地通过gpu加速获得了足够的加速。还有许多其他类似的所谓排列谜题。解谜,即从打乱状态恢复到原来的有序状态,等价于置换群的Cayley图中的寻径。我们将用于魔方的方法推广到更小的问题,并检验其有效性。我们的研究兴趣的焦点是并行寻路的效率,以及是否使用大量的核心替代用于修剪的大距离表。
{"title":"GPU-acceleration of optimal permutation-puzzle solving","authors":"Hayakawa Hiroki, Ishida Naoaki, M. Hirokazu","doi":"10.1145/2790282.2790289","DOIUrl":"https://doi.org/10.1145/2790282.2790289","url":null,"abstract":"We first investigate parallelization of Rubik's cube optimal solver, especially for acceleration by GPU. To examine its efficacy, we implement a simple solver based on Korf's algorithm, with which CPU and GPU collaborate in IDA* algorithm and a large number of GPU cores are utilized for speedup instead of a huge distance table used for pruning. Empirical studies succeeded to attain sufficient speedup by GPU-acceleration. There are many other similar puzzles of so-called permutation puzzles. The puzzle solving, i.e., restoring the original ordered state from a scrambled one is equivalent to the path-finding in the Cayley graph of the permutation group. We generalize the method used for Rubik's cube to much smaller problems, and examine its efficacy. The focus of our research interest is how efficient the parallel path-finding can be and whether the use of a large number of cores substitutes for a large distance table used for pruning, in general.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128532329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the 2015 International Workshop on Parallel Symbolic Computation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1