2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献

英文中文

On the Impact of Execution Models: A Case Study in Computational Chemistry 论执行模型的影响:以计算化学为例

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.111

D. Chavarría-Miranda, M. Halappanavar, S. Krishnamoorthy, J. Manzano, Abhinav Vishnu, A. Hoisie

Efficient utilization of high-performance computing (HPC) platforms is an important and complex problem. Execution models, abstract descriptions of the dynamic runtime behavior of the execution stack, have significant impact on the utilization of HPC systems. Using a computational chemistry kernel as a case study and a wide variety of execution models combined with load balancing techniques, we explore the impact of execution models on the utilization of an HPC system. We demonstrate a 50 percent improvement in performance by using work stealing relative to a more traditional static scheduling approach. We also use a novel semi-matching technique for load balancing that has comparable performance to a traditional hyper graph-based partitioning implementation, which is computationally expensive. Using this study, we found that execution model design choices and assumptions can limit critical optimizations such as global, dynamic load balancing and finding the correct balance between available work units and different system and runtime overheads. With the emergence of multi- and many-core architectures and the consequent growth in the complexity of HPC platforms, we believe that these lessons will be beneficial to researchers tuning diverse applications on modern HPC platforms, especially on emerging dynamic platforms with energy-induced performance variability.

高效利用高性能计算平台是一个重要而复杂的问题。执行模型是对执行栈动态运行时行为的抽象描述，对高性能计算系统的利用率有重要影响。我们以计算化学内核为例，结合负载平衡技术，探讨了各种执行模型对高性能计算系统利用率的影响。我们演示了与传统的静态调度方法相比，使用工作窃取可以提高50%的性能。我们还使用了一种新颖的半匹配技术来实现负载平衡，其性能可与传统的基于超图的分区实现相媲美，后者的计算成本很高。通过这项研究，我们发现执行模型的设计选择和假设可能会限制关键的优化，比如全局的、动态的负载平衡，以及在可用的工作单元和不同的系统和运行时开销之间找到正确的平衡。随着多核和多核架构的出现，以及随之而来的高性能计算平台复杂性的增长，我们相信这些经验教训将有助于研究人员在现代高性能计算平台上调整各种应用程序，特别是在具有能量诱导性能变化的新兴动态平台上。

{"title":"On the Impact of Execution Models: A Case Study in Computational Chemistry","authors":"D. Chavarría-Miranda, M. Halappanavar, S. Krishnamoorthy, J. Manzano, Abhinav Vishnu, A. Hoisie","doi":"10.1109/IPDPSW.2015.111","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.111","url":null,"abstract":"Efficient utilization of high-performance computing (HPC) platforms is an important and complex problem. Execution models, abstract descriptions of the dynamic runtime behavior of the execution stack, have significant impact on the utilization of HPC systems. Using a computational chemistry kernel as a case study and a wide variety of execution models combined with load balancing techniques, we explore the impact of execution models on the utilization of an HPC system. We demonstrate a 50 percent improvement in performance by using work stealing relative to a more traditional static scheduling approach. We also use a novel semi-matching technique for load balancing that has comparable performance to a traditional hyper graph-based partitioning implementation, which is computationally expensive. Using this study, we found that execution model design choices and assumptions can limit critical optimizations such as global, dynamic load balancing and finding the correct balance between available work units and different system and runtime overheads. With the emergence of multi- and many-core architectures and the consequent growth in the complexity of HPC platforms, we believe that these lessons will be beneficial to researchers tuning diverse applications on modern HPC platforms, especially on emerging dynamic platforms with energy-induced performance variability.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"750 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126941519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers 结合后向恢复和前向恢复处理迭代求解中的无声错误

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.22

M. Fasi, Y. Robert, B. Uçar

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167 -- 176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with check pointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. When a silent error is detected by the verification mechanism, one can rollback to, and re-execute from, the last checkpoint. In this paper, we also propose to combine check pointing and verification, but we use ABFT rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the Conjugate Gradient algorithm. Finally, we validate our new approach through a set of simulations.

最近的几篇论文介绍了一种周期性验证机制来检测迭代求解器中的无声错误。Chen [PPoPP'13, pp. 167—176]已经展示了如何将这种验证机制(检查两个向量的正交性并重新计算残差的稳定性测试)与检查点结合起来:其思想是每d次迭代验证一次，并且每c × d次迭代检查点。当验证机制检测到无声错误时，可以回滚到最后一个检查点，并从该检查点重新执行。在本文中，我们也建议将检查指向和验证结合起来，但我们使用ABFT而不是稳定性测试。ABFT可用于错误检测，也可用于错误检测和纠正，当检测到单个错误时，允许向前恢复(不回滚也不重新执行)。我们引入了一个抽象的性能模型来计算所有方案的性能，并使用共轭梯度算法对其进行了实例化。最后，我们通过一组仿真验证了我们的新方法。

引用次数: 6

Scalable Task-Parallel SGD on Matrix Factorization in Multicore Architectures 基于矩阵分解的多核可扩展任务并行SGD

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.135

Yusuke Nishioka, K. Taura

Recommendation is an indispensable technique especially in e-commerce services such as Amazon or Netflix to provide more preferable items to users. Matrix factorization is a well-known algorithm for recommendation which estimates affinities between users and items solely based on ratings explicitly given by users. To handle the large amounts of data, stochastic gradient descent (SGD), which is an online loss minimization algorithm, can be applied to matrix factorization. SGD is an effective method in terms of both convergence speed and memory consumption, but is difficult to be parallelized due to its essential sequentiality. FPSGD by Zhuang et al. Cite fpsgd is an existing parallel SGD method for matrix factorization by dividing the rating matrix into many small blocks. Threads work on blocks, so that they do not update the same rows or columns of the factor matrices. Because of this technique FPSGD achieves higher convergence speed than other existing methods. Still, as we demonstrate in this paper, FPSGD does not scale beyond 32 cores with 1.4GB Netflix dataset because assigning non-conflicting blocks to threads needs a lock operation. In this work, we propose an alternative approach of SGD for matrix factorization using task parallel programming model. As a result, we have successfully overcome the bottleneck of FPSGD and achieved higher scalability with 64 cores.

推荐是一种必不可少的技术，特别是在亚马逊或Netflix等电子商务服务中，为用户提供更喜欢的商品。矩阵分解是一种著名的推荐算法，它仅根据用户明确给出的评分来估计用户和物品之间的亲和力。为了处理大量数据，随机梯度下降算法(SGD)是一种在线损失最小化算法，可以应用于矩阵分解。SGD在收敛速度和内存消耗方面都是一种有效的方法，但由于其本质上的顺序性而难以并行化。FPSGD(庄等)Cite fpsgd是一种现有的并行SGD方法，通过将评级矩阵划分为许多小块来进行矩阵分解。线程在块上工作，因此它们不会更新因子矩阵的相同行或列。由于这种技术，FPSGD的收敛速度比其他现有方法要快。尽管如此，正如我们在本文中所演示的那样，FPSGD在使用1.4GB Netflix数据集时不能扩展到32核以上，因为将不冲突的块分配给线程需要锁操作。在这项工作中，我们提出了一种使用任务并行编程模型进行矩阵分解的SGD替代方法。因此，我们成功地克服了FPSGD的瓶颈，并在64核下实现了更高的可扩展性。

{"title":"Scalable Task-Parallel SGD on Matrix Factorization in Multicore Architectures","authors":"Yusuke Nishioka, K. Taura","doi":"10.1109/IPDPSW.2015.135","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.135","url":null,"abstract":"Recommendation is an indispensable technique especially in e-commerce services such as Amazon or Netflix to provide more preferable items to users. Matrix factorization is a well-known algorithm for recommendation which estimates affinities between users and items solely based on ratings explicitly given by users. To handle the large amounts of data, stochastic gradient descent (SGD), which is an online loss minimization algorithm, can be applied to matrix factorization. SGD is an effective method in terms of both convergence speed and memory consumption, but is difficult to be parallelized due to its essential sequentiality. FPSGD by Zhuang et al. Cite fpsgd is an existing parallel SGD method for matrix factorization by dividing the rating matrix into many small blocks. Threads work on blocks, so that they do not update the same rows or columns of the factor matrices. Because of this technique FPSGD achieves higher convergence speed than other existing methods. Still, as we demonstrate in this paper, FPSGD does not scale beyond 32 cores with 1.4GB Netflix dataset because assigning non-conflicting blocks to threads needs a lock operation. In this work, we propose an alternative approach of SGD for matrix factorization using task parallel programming model. As a result, we have successfully overcome the bottleneck of FPSGD and achieved higher scalability with 64 cores.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130687982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Storm Pub-Sub: High Performance, Scalable Content Based Event Matching System Using Storm Storm Pub-Sub:使用Storm的高性能、可扩展的基于内容的事件匹配系统

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.95

M. Shah, D. Kulkarni

Storm pub-sub is a novel high performance publish subscribe system designed to efficiently match events and the subscriptions with high throughput. Moving a content based pub-sub system first to a local cluster and then to a distributed cluster framework is for high performance and scalability. We depart from the use of broker overlays, where each server must support the whole range of operations of a pub-sub service, as well as overlay management and routing functionality. In this system different operations involved in pub-sub are separated to leverage their natural potential for parallelization using bolts. The storm pub-sub is compared with the traditional pub-sub system Siena, a broker based architecture. Through experimentation on local cluster as well as on distributed cluster we show that our approach of designing publish subscribe system on storm scales well for high volume of data. Storm pub-sub system approximately produces 2200 event/s on distributed cluster. In this paper we describe design and implementation of storm pub-sub and evaluate it in terms of scalability and throughput.

Storm发布订阅系统是一种新型的高性能发布订阅系统，旨在实现事件与高吞吐量订阅的高效匹配。首先将基于内容的发布-子系统移动到本地集群，然后再移动到分布式集群框架是为了获得高性能和可伸缩性。我们不再使用代理覆盖，在这种情况下，每个服务器必须支持发布-订阅服务的全部操作范围，以及覆盖管理和路由功能。在这个系统中，pub-sub中涉及的不同操作被分开，以利用它们使用螺栓进行并行化的自然潜力。将风暴发布-订阅系统与传统的基于代理的发布-订阅系统Siena进行了比较。通过在本地集群和分布式集群上的实验表明，我们设计的风暴级发布订阅系统可以很好地满足大数据量的需求。Storm pub-sub系统在分布式集群上大约每秒产生2200个事件。本文描述了风暴发布-订阅的设计和实现，并从可扩展性和吞吐量方面对其进行了评估。

引用次数: 6

A Branch-and-Estimate Heuristic Procedure for Solving Nonconvex Integer Optimization Problems 求解非凸整数优化问题的分支估计启发式方法

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.43

Prashant Palkar, Ashutosh Mahajan

We present a method for solving nonconvex mixed-integer nonlinear programs using a branch-and-bound framework. At each node in the search tree, we solve the continuous nonlinear relaxation multiple times using an existing non-linear solver. Since the relaxation we create is in general not convex, this method may not find an optimal solution. In order to mitigate this difficulty, we solve the relaxation multiple times in parallel starting from different initial points. Our preliminary computational experiments show that this approach gives optimal or near-optimal solutions on benchmark problems, and that the method benefits well from parallelism.

给出了一种用分支定界框架求解非凸混合整数非线性规划的方法。在搜索树的每个节点上，我们使用已有的非线性求解器求解连续非线性松弛问题多次。由于我们创建的松弛通常不是凸的，所以这种方法可能找不到最优解。为了减轻这一困难，我们从不同的初始点开始并行求解多次松弛。我们的初步计算实验表明，该方法在基准问题上给出了最优或接近最优的解决方案，并且该方法从并行性中获益颇多。

引用次数: 0

Implementing Uniform Reliable Broadcast in Anonymous Distributed Systems with Fair Lossy Channels 在具有公平有损信道的匿名分布式系统中实现统一可靠广播

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.23

Jian Tang, M. Larrea, S. Arévalo, Ernesto Jiménez

Uniform Reliable Broadcast (URB) is an important abstraction in distributed systems, offering delivery guarantee when spreading messages among processes. Informally, URB guarantees that if a process (correct or not) delivers a message m, then all correct processes deliver m. This abstraction has been extensively investigated in distributed systems where all processes have different identifiers. Furthermore, the majority of papers in the literature usually assume that the communication channels of the system are reliable, which is not always the case in real systems. In this paper, the URB abstraction is investigated in anonymous asynchronous message passing systems with fair lossy communication channels. Firstly, a simple algorithm is given to solve URB in such system model assuming a majority of correct processes. Then a new failure detector class AT is proposed. With AT, URB can be implemented with any number of correct processes. Due to the message loss caused by fair lossy communication channels, every correct process in this first algorithm has to broadcast all URB delivered messages forever, which makes the algorithm to be non-quiescent. In order to get a quiescent URB algorithm in anonymous asynchronous systems, a perfect anonymous failure detector AP* is proposed. Finally, a quiescent URB algorithm using AT and AP* is given.

统一可靠广播(Uniform Reliable Broadcast, URB)是分布式系统中的一个重要抽象，为消息在进程间传播提供了传递保证。非正式地说，URB保证如果一个进程(正确与否)交付消息m，那么所有正确的进程都会交付消息m。这种抽象已经在分布式系统中得到了广泛的研究，其中所有进程都有不同的标识符。此外，文献中的大多数论文通常假设系统的通信信道是可靠的，而在实际系统中并不总是如此。本文研究了具有公平损耗通信信道的匿名异步消息传递系统中的URB抽象问题。首先，给出了一种简单的算法来求解该系统模型中大多数正确过程的URB。然后提出了一种新的故障检测器AT类。使用AT, URB可以使用任意数量的正确流程来实现。由于公平有损通信信道导致的消息丢失，第一种算法中的每个正确进程都必须永远广播所有URB传递的消息，这使得该算法是非静态的。为了在匿名异步系统中实现静态URB算法，提出了一种完善的匿名故障检测器AP*。最后，给出了一种基于AT和AP*的静态URB算法。

{"title":"Implementing Uniform Reliable Broadcast in Anonymous Distributed Systems with Fair Lossy Channels","authors":"Jian Tang, M. Larrea, S. Arévalo, Ernesto Jiménez","doi":"10.1109/IPDPSW.2015.23","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.23","url":null,"abstract":"Uniform Reliable Broadcast (URB) is an important abstraction in distributed systems, offering delivery guarantee when spreading messages among processes. Informally, URB guarantees that if a process (correct or not) delivers a message m, then all correct processes deliver m. This abstraction has been extensively investigated in distributed systems where all processes have different identifiers. Furthermore, the majority of papers in the literature usually assume that the communication channels of the system are reliable, which is not always the case in real systems. In this paper, the URB abstraction is investigated in anonymous asynchronous message passing systems with fair lossy communication channels. Firstly, a simple algorithm is given to solve URB in such system model assuming a majority of correct processes. Then a new failure detector class AT is proposed. With AT, URB can be implemented with any number of correct processes. Due to the message loss caused by fair lossy communication channels, every correct process in this first algorithm has to broadcast all URB delivered messages forever, which makes the algorithm to be non-quiescent. In order to get a quiescent URB algorithm in anonymous asynchronous systems, a perfect anonymous failure detector AP* is proposed. Finally, a quiescent URB algorithm using AT and AP* is given.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134482985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems 大规模分布式系统故障日志模式检测研究

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.109

Nentawe Gurumdimma, A. Jhumka, Maria Liakata, Edward Chuah, J. Browne

The ability to automatically detect faults or fault patterns to enhance system reliability is important for system administrators in reducing system failures. To achieve this objective, the message logs from cluster system are augmented with failure information, i.e., The raw log data is labelled. However, tagging or labelling of raw log data is very costly. In this paper, our objective is to detect failure patterns in the message logs using unlabelled data. To achieve our aim, we propose a methodology whereby a pre-processing step is first performed where redundant data is removed. A clustering algorithm is then executed on the resulting logs, and we further developed an unsupervised algorithm to detect failure patterns in the clustered log by harnessing the characteristics of these sequences. We evaluated our methodology on large production data, and results shows that, on average, an f-measure of 78% can be obtained without having data labels. The implication of our methodology is that a system administrator with little knowledge of the system can detect failure runs with reasonably high accuracy.

自动检测故障或故障模式以增强系统可靠性的能力对于系统管理员减少系统故障非常重要。为了实现这一目标，将来自集群系统的消息日志添加故障信息，即标记原始日志数据。然而，对原始日志数据进行标记是非常昂贵的。在本文中，我们的目标是使用未标记的数据检测消息日志中的故障模式。为了实现我们的目标，我们提出了一种方法，即首先执行预处理步骤，其中删除冗余数据。然后在生成的日志上执行聚类算法，我们进一步开发了一种无监督算法，通过利用这些序列的特征来检测聚类日志中的故障模式。我们在大量生产数据上评估了我们的方法，结果表明，平均而言，在没有数据标签的情况下可以获得78%的f-measure。我们的方法的含义是，对系统知之甚少的系统管理员可以以相当高的准确性检测故障运行。

{"title":"Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems","authors":"Nentawe Gurumdimma, A. Jhumka, Maria Liakata, Edward Chuah, J. Browne","doi":"10.1109/IPDPSW.2015.109","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.109","url":null,"abstract":"The ability to automatically detect faults or fault patterns to enhance system reliability is important for system administrators in reducing system failures. To achieve this objective, the message logs from cluster system are augmented with failure information, i.e., The raw log data is labelled. However, tagging or labelling of raw log data is very costly. In this paper, our objective is to detect failure patterns in the message logs using unlabelled data. To achieve our aim, we propose a methodology whereby a pre-processing step is first performed where redundant data is removed. A clustering algorithm is then executed on the resulting logs, and we further developed an unsupervised algorithm to detect failure patterns in the clustered log by harnessing the characteristics of these sequences. We evaluated our methodology on large production data, and results shows that, on average, an f-measure of 78% can be obtained without having data labels. The implication of our methodology is that a system administrator with little knowledge of the system can detect failure runs with reasonably high accuracy.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132690601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Bulk GCD Computation Using a GPU to Break Weak RSA Keys 使用GPU批量GCD计算破解弱RSA密钥

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.54

Toru Fujita, K. Nakano, Yasuaki Ito

RSA is one the most well-known public-key cryptosystems widely used for secure data transfer. An RSA encryption key includes a modulus n which is the product of two large prime numbers p and q. If an RSA modulus n can be decomposed into p and q, the corresponding decryption key can be computed easily from them and the original message can be obtained using it. RSA cryptosystem relies on the hardness of factorization of RSA modulus. Suppose that we have a lot of encryption keys collected from the Web. If some of them are inappropriately generated so that they share the same prime number, then they can be decomposed by computing their GCD (Greatest Common Divisor). Actually, a previously published investigation showed that a certain ratio of RSA moduli in encryption keys in the Web are sharing prime numbers. We may find such weak RSA moduli n by computing the GCD of many pairs of RSA moduli. The main contribution of this paper is to present a new Euclidean algorithm for computing the GCD of all pairs of encryption moduli. The idea of our new Euclidean algorithm that we call Approximate Euclidean algorithm is to compute an approximation of quotient by just one 64-bit division and to use it for reducing the number of iterations of the Euclidean algorithm. We also present an implementation of Approximate Euclidean algorithm optimized for CUDA-enabled GPUs. The experimental results show that our implementation for 1024-bit GCD on GeForce GTX 780Ti runs more than 80 times faster than the Intel Xeon CPU implementation. Further, our GPU implementation is more than 9 times faster than the best known published GCD computation using the same generation GPU.

RSA是最著名的公钥密码系统之一，广泛用于安全数据传输。RSA加密密钥包含一个模n，它是两个大素数p和q的乘积。如果RSA模n可以分解为p和q，则可以很容易地从它们中计算出相应的解密密钥，并可以利用它获得原始消息。RSA密码系统依赖于RSA模的分解硬度。假设我们从Web上收集了很多加密密钥。如果它们中的一些被不恰当地生成，以至于它们共享相同的素数，那么它们可以通过计算它们的GCD(最大公约数)来分解。实际上，之前发表的一项调查表明，在Web上的加密密钥中，有一定比例的RSA模共享素数。通过计算多对RSA模的GCD，我们可以找到这样的弱RSA模n。本文的主要贡献是提出了一种新的计算所有加密模对的GCD的欧几里得算法。我们新的欧几里得算法的思想，我们称之为近似欧几里得算法，是通过一个64位除法来计算商的近似值，并使用它来减少欧几里得算法的迭代次数。我们还提出了一种近似欧几里得算法的实现，该算法针对支持cuda的gpu进行了优化。实验结果表明，我们在GeForce GTX 780Ti上实现的1024位GCD运行速度比Intel Xeon CPU实现的速度快80倍以上。此外，我们的GPU实现比使用同一代GPU的最知名的GCD计算快9倍以上。

{"title":"Bulk GCD Computation Using a GPU to Break Weak RSA Keys","authors":"Toru Fujita, K. Nakano, Yasuaki Ito","doi":"10.1109/IPDPSW.2015.54","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.54","url":null,"abstract":"RSA is one the most well-known public-key cryptosystems widely used for secure data transfer. An RSA encryption key includes a modulus n which is the product of two large prime numbers p and q. If an RSA modulus n can be decomposed into p and q, the corresponding decryption key can be computed easily from them and the original message can be obtained using it. RSA cryptosystem relies on the hardness of factorization of RSA modulus. Suppose that we have a lot of encryption keys collected from the Web. If some of them are inappropriately generated so that they share the same prime number, then they can be decomposed by computing their GCD (Greatest Common Divisor). Actually, a previously published investigation showed that a certain ratio of RSA moduli in encryption keys in the Web are sharing prime numbers. We may find such weak RSA moduli n by computing the GCD of many pairs of RSA moduli. The main contribution of this paper is to present a new Euclidean algorithm for computing the GCD of all pairs of encryption moduli. The idea of our new Euclidean algorithm that we call Approximate Euclidean algorithm is to compute an approximation of quotient by just one 64-bit division and to use it for reducing the number of iterations of the Euclidean algorithm. We also present an implementation of Approximate Euclidean algorithm optimized for CUDA-enabled GPUs. The experimental results show that our implementation for 1024-bit GCD on GeForce GTX 780Ti runs more than 80 times faster than the Intel Xeon CPU implementation. Further, our GPU implementation is more than 9 times faster than the best known published GCD computation using the same generation GPU.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131597628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability 基于机器学习的自动调优增强OpenCL性能可移植性

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.85

Thomas L. Falch, A. Elster

Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good performance on another device. In this paper, we use machine learning-based auto-tuning to address this problem. Benchmarks are run on a random subset of the entire tuning parameter configuration space, and the results are used to build an artificial neural network based model. The model can then be used to find interesting parts of the parameter space for further search. We evaluate our method with different benchmarks, on several devices, including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970 GPU. Our model achieves a mean relative error as low as 6.1%, and is able to find configurations as little as 1.3% worse than the global minimum.

异构计算将不同架构的设备结合在一起，它越来越受欢迎，并承诺在降低能耗的同时提高性能。OpenCL已被提议作为编写此类系统的标准，并提供功能可移植性。但是，它确实存在性能可移植性差的问题，为一个设备调优的代码必须重新调优才能在另一个设备上获得良好的性能。在本文中，我们使用基于机器学习的自动调谐来解决这个问题。在整个调优参数配置空间的随机子集上运行基准测试，结果用于构建基于人工神经网络的模型。然后，可以使用该模型找到参数空间中有趣的部分，以便进一步搜索。我们在几种设备上使用不同的基准测试来评估我们的方法，包括英特尔i7 3770 CPU，英伟达K40 GPU和AMD Radeon HD 7970 GPU。我们的模型实现了低至6.1%的平均相对误差，并且能够找到比全球最小值差1.3%的配置。

引用次数: 39

Fine-Grained Acceleration of HMMER 3.0 via Architecture-Aware Optimization on Massively Parallel Processors 大规模并行处理器上基于架构感知优化的HMMER 3.0细粒度加速

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.107

Hanyu Jiang, N. Ganesan

HMMER search used for protein Motif finding which is a probabilistic method based on profile hidden Markov models, is one of popular tools for protein homology sequence search. The current version of HMMER (version 3.0) is highly optimized for performance on multi-core and SSE-supported systems while maintaining accuracy. The computational workhorse of the HMMER 3.0 task-pipeline, the MSV and P7Viterbi stages together consume about 95% of the execution time. These two stages can prove to be a significant bottleneck for the current implementation, and can be accelerated via architecture-aware reformulation of the algorithm, along with hybrid task and data level parallelism. In this work we target the core-segments of HMMER3 hmmsearch tool viz. The MSV and the P7Viterbi and present a fine grained parallelization scheme designed and implemented on Graphics Processing Units (GPUs). This three-tiered approach, parallelizes scoring of a sequence across each warp, multiple sequences within each block and multiple blocks within the device. At the fine-grained level, this technique naturally takes advantage of the concurrency of threads within a warp, and completely eliminates the overhead of synchronization. The HMM used for the MSV and P7Viterbi segments share several core features, with few differences. Hence the techniques developed for acceleration of the MSV segment can also be readily applied to the P7Viterbi segment. However, the presence of additional D-D transitions in the HMM for P7Viterbi induces sequential dependencies. This is handled by implementing the Lazy-F procedure as in HMMER 3.0 but for SIMT architectures in a warp-synchronous fashion. Finally, we also study scalability across multiple devices of early Fermi Architecture. Compared to the core-segments, MSV and P7Viterbi of the optimized HMMER3 task pipeline, our implementation achieves up to 5.4-fold speedup for MSV, 2.9-fold speedup for P7viterbi and 3.8-fold speedup for combined pipeline of them on a single Kepler GPU while preserving the sensitivity and accuracy of HMMER 3.0. Multi-GPU implementation on Fermi architecture yields up to 7.8× speedup.

hmm搜索是一种基于隐马尔可夫模型的蛋白质序列查找方法，是蛋白质同源序列搜索的常用工具之一。当前版本的HMMER(3.0版)在保持精度的同时，对多核和sse支持的系统的性能进行了高度优化。HMMER 3.0任务管道的计算主力，MSV和P7Viterbi阶段总共消耗了大约95%的执行时间。这两个阶段可能是当前实现的重要瓶颈，可以通过架构感知的算法重新表述以及混合任务和数据级并行性来加速。在这项工作中，我们针对HMMER3 hmmsearch工具的核心部分，即MSV和P7Viterbi，提出了一种在图形处理单元(gpu)上设计和实现的细粒度并行化方案。这种三层的方法，平行得分的序列跨越每个经纱，多个序列在每个块和多个块内的设备。在细粒度级别上，这种技术自然地利用了曲内线程的并发性，并完全消除了同步的开销。用于MSV和P7Viterbi段的HMM共享几个核心功能，几乎没有区别。因此，用于MSV段加速的技术也可以很容易地应用于P7Viterbi段。然而，在P7Viterbi的HMM中存在额外的D-D转换导致了序列依赖性。这是通过在HMMER 3.0中实现Lazy-F过程来处理的，但SIMT架构采用了一种warp-synchronous方式。最后，我们还研究了早期费米架构跨多个设备的可扩展性。与优化后的HMMER3任务管道的核心段、MSV和P7Viterbi相比，我们的实现在单个Kepler GPU上实现了高达5.4倍的MSV加速，2.9倍的P7Viterbi加速和3.8倍的组合管道加速，同时保持了hmmer3.0的灵敏度和准确性。在费米架构上的多gpu实现产生高达7.8倍的加速。

{"title":"Fine-Grained Acceleration of HMMER 3.0 via Architecture-Aware Optimization on Massively Parallel Processors","authors":"Hanyu Jiang, N. Ganesan","doi":"10.1109/IPDPSW.2015.107","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.107","url":null,"abstract":"HMMER search used for protein Motif finding which is a probabilistic method based on profile hidden Markov models, is one of popular tools for protein homology sequence search. The current version of HMMER (version 3.0) is highly optimized for performance on multi-core and SSE-supported systems while maintaining accuracy. The computational workhorse of the HMMER 3.0 task-pipeline, the MSV and P7Viterbi stages together consume about 95% of the execution time. These two stages can prove to be a significant bottleneck for the current implementation, and can be accelerated via architecture-aware reformulation of the algorithm, along with hybrid task and data level parallelism. In this work we target the core-segments of HMMER3 hmmsearch tool viz. The MSV and the P7Viterbi and present a fine grained parallelization scheme designed and implemented on Graphics Processing Units (GPUs). This three-tiered approach, parallelizes scoring of a sequence across each warp, multiple sequences within each block and multiple blocks within the device. At the fine-grained level, this technique naturally takes advantage of the concurrency of threads within a warp, and completely eliminates the overhead of synchronization. The HMM used for the MSV and P7Viterbi segments share several core features, with few differences. Hence the techniques developed for acceleration of the MSV segment can also be readily applied to the P7Viterbi segment. However, the presence of additional D-D transitions in the HMM for P7Viterbi induces sequential dependencies. This is handled by implementing the Lazy-F procedure as in HMMER 3.0 but for SIMT architectures in a warp-synchronous fashion. Finally, we also study scalability across multiple devices of early Fermi Architecture. Compared to the core-segments, MSV and P7Viterbi of the optimized HMMER3 task pipeline, our implementation achieves up to 5.4-fold speedup for MSV, 2.9-fold speedup for P7viterbi and 3.8-fold speedup for combined pipeline of them on a single Kepler GPU while preserving the sensitivity and accuracy of HMMER 3.0. Multi-GPU implementation on Fermi architecture yields up to 7.8× speedup.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121212530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀