首页 > 最新文献

2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Computing structural controllability of linearly-coupled complex networks 线性耦合复杂网络结构可控性的计算
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091064
R. Rajaei, A. Ramezani, B. Shafai
Structural controllability, as a generic structure-based property in determining the ability of a complex network to reach the desired configuration, is addressed in this work. Using a robust measure derived from robust control theory, this paper deals with structural controllability of a type of weighted network of networks (NetoNets) involving linear couplings between its corresponding networks and clusters. Unlike the structural controllability degrees rooted in graph theory, this paper takes the advantage of uncertain systems to define the notion of structural controllability in a straightforward and less computationally complex way. Moreover, the spectrum of required energy is discussed. Eventually, the results for the proposed measure of structural controllability of scale-free networks are given to justify the proposed measure of an efficient and effective guarantee for fully controllability of the NetoNets in exposure to cluster and network-dependency connections. The proposed measure is an optimal solution according to structural energy-related control of the NetoNet where the upper bound of the required energy is illustrated an efficient measure for structural controllability of the class of NetoNet. Arbitrarily connectivity of low connected vertices to their higher connected counterparts in clusters results in effective controllability. In the same direction, as seminal works in structural controllability of complex networks to avoid the highly-connected nodes, the larger the cluster/network connectivity degree is, the less fully controllability of NetoNet is guaranteed.
结构可控性,作为一种通用的基于结构的性质,在决定一个复杂的网络的能力,以达到所需的配置,是解决在这项工作。利用鲁棒控制理论导出的鲁棒度量,研究了一类加权网络的结构可控性,该网络涉及相应网络和簇之间的线性耦合。与植根于图论的结构可控性度不同,本文利用不确定系统的优势,以一种直接的、计算复杂性较低的方式定义了结构可控性的概念。此外,还讨论了所需能量谱。最后,给出了所提出的无标度网络结构可控性度量的结果,以证明所提出的度量是在暴露于集群和网络依赖连接时netonet完全可控性的有效保证。所提出的方法是根据NetoNet结构能量相关控制的最优解,其中所需能量的上界是一类NetoNet结构可控性的有效措施。簇中低连通点与高连通点的任意连通性导致有效的可控性。同样,作为避免高连接节点的复杂网络结构可控性的开创性工作,集群/网络连通度越大,NetoNet的完全可控性就越不能得到保证。
{"title":"Computing structural controllability of linearly-coupled complex networks","authors":"R. Rajaei, A. Ramezani, B. Shafai","doi":"10.1109/HPEC.2017.8091064","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091064","url":null,"abstract":"Structural controllability, as a generic structure-based property in determining the ability of a complex network to reach the desired configuration, is addressed in this work. Using a robust measure derived from robust control theory, this paper deals with structural controllability of a type of weighted network of networks (NetoNets) involving linear couplings between its corresponding networks and clusters. Unlike the structural controllability degrees rooted in graph theory, this paper takes the advantage of uncertain systems to define the notion of structural controllability in a straightforward and less computationally complex way. Moreover, the spectrum of required energy is discussed. Eventually, the results for the proposed measure of structural controllability of scale-free networks are given to justify the proposed measure of an efficient and effective guarantee for fully controllability of the NetoNets in exposure to cluster and network-dependency connections. The proposed measure is an optimal solution according to structural energy-related control of the NetoNet where the upper bound of the required energy is illustrated an efficient measure for structural controllability of the class of NetoNet. Arbitrarily connectivity of low connected vertices to their higher connected counterparts in clusters results in effective controllability. In the same direction, as seminal works in structural controllability of complex networks to avoid the highly-connected nodes, the larger the cluster/network connectivity degree is, the less fully controllability of NetoNet is guaranteed.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115887573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lossy compression on IoT big data by exploiting spatiotemporal correlation 利用时空相关性对物联网大数据进行有损压缩
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091030
Aekyeung Moon, Jaeyoung Kim, Jialing Zhang, S. Son
As the volume of data generated by various deployed IoT devices increases, storing and processing IoT big data becomes a huge challenge. While compression, especially lossy ones, can drastically reduce data volume, finding an optimal balance between the volume reduction and the information loss is not an easy task given that the data collected by diverse sensors exhibit different characteristics. Motivated by this, we present a feasibility analysis of lossy compression on agricultural sensor data by comparing fidelity of reconstructed data from various signal processing algorithms and temporal difference encoding. Specifically, we evaluated five real-world sensor data from weather stations as one of major IoT applications. Our experimental results indicate that Discrete Cosine Transform (DCT) and Fast Walsh-Hadamard Transform (FWHT) generate higher compression ratios than others. In terms of information loss, Lossy Delta Encoding (LDE) significantly outperforms others nonetheless. We also observe that, as compression factor is increased, error rates for all compression algorithms also increase. However, the impact of introduced error is much severe in DCT and FWHT while LDE was able to maintain a relatively lower error rate than other methods.
随着各种部署物联网设备产生的数据量的增加,存储和处理物联网大数据成为一个巨大的挑战。虽然压缩,特别是有损压缩,可以大大减少数据量,但在体积减少和信息损失之间找到最佳平衡并不是一件容易的事情,因为不同传感器收集的数据具有不同的特征。基于此,我们通过比较各种信号处理算法和时间差分编码重建数据的保真度,对农业传感器数据进行有损压缩的可行性分析。具体来说,我们评估了来自气象站的五个真实传感器数据,作为主要的物联网应用之一。我们的实验结果表明,离散余弦变换(DCT)和快速沃尔什-阿达玛变换(FWHT)比其他方法产生更高的压缩比。在信息丢失方面,有损增量编码(LDE)明显优于其他编码。我们还观察到,随着压缩因子的增加,所有压缩算法的错误率也会增加。然而,在DCT和FWHT中,引入误差的影响非常严重,而LDE能够保持比其他方法相对较低的错误率。
{"title":"Lossy compression on IoT big data by exploiting spatiotemporal correlation","authors":"Aekyeung Moon, Jaeyoung Kim, Jialing Zhang, S. Son","doi":"10.1109/HPEC.2017.8091030","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091030","url":null,"abstract":"As the volume of data generated by various deployed IoT devices increases, storing and processing IoT big data becomes a huge challenge. While compression, especially lossy ones, can drastically reduce data volume, finding an optimal balance between the volume reduction and the information loss is not an easy task given that the data collected by diverse sensors exhibit different characteristics. Motivated by this, we present a feasibility analysis of lossy compression on agricultural sensor data by comparing fidelity of reconstructed data from various signal processing algorithms and temporal difference encoding. Specifically, we evaluated five real-world sensor data from weather stations as one of major IoT applications. Our experimental results indicate that Discrete Cosine Transform (DCT) and Fast Walsh-Hadamard Transform (FWHT) generate higher compression ratios than others. In terms of information loss, Lossy Delta Encoding (LDE) significantly outperforms others nonetheless. We also observe that, as compression factor is increased, error rates for all compression algorithms also increase. However, the impact of introduced error is much severe in DCT and FWHT while LDE was able to maintain a relatively lower error rate than other methods.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115938974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Static graph challenge on GPU GPU上的静态图形挑战
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091034
M. Bisson, M. Fatica
This paper presents the details of a CUDA implementation of the Subgraph Isomorphism Graph Challenge, a new effort aimed at driving progress in the graph analytics field. challenge consists of two graph analytics: triangle counting and k-truss. We present our CUDA implementation of the graph triangle counting operation and of the k-truss subgraph decomposition. Both implementations share the same codebase taking advantage of a set intersection operation implemented via bitmaps. The analytics are implemented in four kernels optimized for different types of graphs. At runtime, lightweight heuristics are used to select the kernel to run based on the specific graph taken as input.
本文介绍了子图同构图挑战的CUDA实现的细节,这是一项旨在推动图分析领域进步的新努力。挑战包括两个图分析:三角形计数和k-桁架。我们给出了图形三角形计数操作和k-truss子图分解的CUDA实现。两个实现共享相同的代码库,利用通过位图实现的集合交集操作。分析是在针对不同类型的图进行优化的四个内核中实现的。在运行时,使用轻量级启发式方法根据作为输入的特定图选择要运行的内核。
{"title":"Static graph challenge on GPU","authors":"M. Bisson, M. Fatica","doi":"10.1109/HPEC.2017.8091034","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091034","url":null,"abstract":"This paper presents the details of a CUDA implementation of the Subgraph Isomorphism Graph Challenge, a new effort aimed at driving progress in the graph analytics field. challenge consists of two graph analytics: triangle counting and k-truss. We present our CUDA implementation of the graph triangle counting operation and of the k-truss subgraph decomposition. Both implementations share the same codebase taking advantage of a set intersection operation implemented via bitmaps. The analytics are implemented in four kernels optimized for different types of graphs. At runtime, lightweight heuristics are used to select the kernel to run based on the specific graph taken as input.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128878235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Quickly finding a truss in a haystack 在干草堆里迅速找到一个桁架
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091038
Oded Green, James Fox, Euna Kim, F. Busato, N. Bombieri, Kartik Lakhotia, Shijie Zhou, Shreyas G. Singapura, Hanqing Zeng, R. Kannan, V. Prasanna, David A. Bader
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k-truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of non-conforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm.
图的k-桁架是一个子图,这样每条边都与k-桁架中的其余元素紧密相连。图的k桁架也可以表示图中的一个重要群体。找到一个图的k-truss可以在一个多项式的时间内完成,相比之下,找到其他子图,如cliques。虽然有许多公式和算法用于寻找图的最大k桁架,但其中许多往往是计算昂贵的,并且不能很好地扩展。许多算法是迭代的,在图的每次迭代中使用静态图三角形计数。在这项工作中,我们提出了一种新的算法,用于寻找图的k桁架(对于给定的k),以及使用动态图公式的最大k桁架。我们的算法有两个主要优点。1)与许多算法在去除不符合边后重新运行静态图三角形计数不同,我们使用了一种新的动态图公式,只需要更新受去除影响的边。由于我们的更新是本地的,与其他算法相比,我们只做了一小部分工作。2)与过去的顺序方法相比,我们的算法具有极高的可扩展性,能够同时检测删除的三角形。虽然我们的算法与架构无关,但我们展示了基于CUDA的NVIDIA gpu实现。在许多情况下,我们的新算法比Graph Challenge基准测试快100 - 10000x。此外,我们的算法显示出显著的加速,在某些情况下超过70倍,比最近开发的顺序和高度优化的算法。
{"title":"Quickly finding a truss in a haystack","authors":"Oded Green, James Fox, Euna Kim, F. Busato, N. Bombieri, Kartik Lakhotia, Shijie Zhou, Shreyas G. Singapura, Hanqing Zeng, R. Kannan, V. Prasanna, David A. Bader","doi":"10.1109/HPEC.2017.8091038","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091038","url":null,"abstract":"The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k-truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of non-conforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"77 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121915968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Efficient and accurate Word2Vec implementations in GPU and shared-memory multicore architectures 在GPU和共享内存多核架构中实现高效准确的Word2Vec
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091076
Trevor M. Simonton, G. Alaghband
Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this work, we propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory CPU systems, and on modern NVIDIA GPUs with CUDA. We accomplish this on multi-core CPUs by batching training operations to increase thread locality and to reduce accesses to shared memory. We then propose new heterogeneous NVIDIA GPU CUDA implementations of both the skip gram hierarchical softmax and negative sampling techniques that utilize shared memory registers and in-warp shuffle operations for maximized performance. Our GPU skip gram with negative sampling approach produces a higher quality of word vectors than previous GPU implementations, and our flexible skip gram with hierarchical softmax implementation achieves a factor of 10 speedup of the existing methods.
Word2Vec是一套流行的机器学习算法,它使用神经网络生成单词的密集向量表示。这些向量已被证明在各种机器学习任务中是有用的。在这项工作中,我们提出了新的方法来提高Word2Vec跳过gram的速度与分层softmax架构在多核共享内存CPU系统和现代NVIDIA gpu CUDA。我们通过批处理训练操作在多核cpu上实现这一点,以增加线程局部性并减少对共享内存的访问。然后,我们提出了新的异构NVIDIA GPU CUDA实现,包括跳过克分层softmax和负采样技术,这些技术利用共享内存寄存器和warp shuffle操作来最大化性能。我们采用负采样方法的GPU跳过克比以前的GPU实现产生更高质量的词向量,并且我们采用分层softmax实现的灵活跳过克比现有方法实现了10倍的加速。
{"title":"Efficient and accurate Word2Vec implementations in GPU and shared-memory multicore architectures","authors":"Trevor M. Simonton, G. Alaghband","doi":"10.1109/HPEC.2017.8091076","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091076","url":null,"abstract":"Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this work, we propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory CPU systems, and on modern NVIDIA GPUs with CUDA. We accomplish this on multi-core CPUs by batching training operations to increase thread locality and to reduce accesses to shared memory. We then propose new heterogeneous NVIDIA GPU CUDA implementations of both the skip gram hierarchical softmax and negative sampling techniques that utilize shared memory registers and in-warp shuffle operations for maximized performance. Our GPU skip gram with negative sampling approach produces a higher quality of word vectors than previous GPU implementations, and our flexible skip gram with hierarchical softmax implementation achieves a factor of 10 speedup of the existing methods.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132188992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
OpenCL for HPC with FPGAs: Case study in molecular electrostatics 用fpga实现高性能计算的OpenCL:分子静电学的案例研究
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091078
Chen Yang, Jiayi Sheng, Rushi Patel, A. Sanaullah, Vipin Sachdeva, M. Herbordt
FPGAs have emerged as a cost-effective accelerator alternative in clouds and clusters. Programmability remains a challenge, however, with OpenCL being generally recognized as a likely part of the solution. In this work we seek to advance the use of OpenCL for HPC on FPGAs in two ways. The first is by examining a core HPC application, Molecular Dynamics. The second is by examining a fundamental design pattern that we believe has not yet been described for OpenCL: passing data from a set of producer datapaths to a set of consumer datapaths, in particular, where the producers generate data non-uniformly. We evaluate several designs: single level versions in Verilog and in OpenCL, a two-level Verilog version with optimized arbiter, and several two-level OpenCL versions with different arbitration and hand-shaking mechanisms, including one with an embedded Verilog module. For the Verilog designs, we find that FPGAs retain their high-efficiency with a factor of 50 χ to 80 χ performance benefit over a single core. We also find that OpenCL may be competitive with HDLs for the straightline versions of the code, but that for designs with more complex arbitration and hand-shaking, relative performance is substantially diminished.
fpga已经成为云和集群中一种经济高效的加速器替代方案。然而,可编程性仍然是一个挑战,OpenCL被普遍认为可能是解决方案的一部分。在这项工作中,我们试图以两种方式推进OpenCL在fpga上的HPC使用。第一个是通过检查核心HPC应用程序,分子动力学。第二种是通过检查一个基本的设计模式,我们认为这个模式还没有被描述为OpenCL:将数据从一组生产者数据路径传递到一组消费者数据路径,特别是在生产者不一致地生成数据的情况下。我们评估了几种设计:Verilog和OpenCL的单级版本,具有优化仲裁器的两级Verilog版本,以及具有不同仲裁和握手机制的几个两级OpenCL版本,其中一个带有嵌入式Verilog模块。对于Verilog设计,我们发现fpga保持其高效率,性能优势比单核高50 χ至80 χ。我们还发现,对于代码的直线版本,OpenCL可能与hdl竞争,但对于具有更复杂仲裁和握手的设计,相对性能大大降低。
{"title":"OpenCL for HPC with FPGAs: Case study in molecular electrostatics","authors":"Chen Yang, Jiayi Sheng, Rushi Patel, A. Sanaullah, Vipin Sachdeva, M. Herbordt","doi":"10.1109/HPEC.2017.8091078","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091078","url":null,"abstract":"FPGAs have emerged as a cost-effective accelerator alternative in clouds and clusters. Programmability remains a challenge, however, with OpenCL being generally recognized as a likely part of the solution. In this work we seek to advance the use of OpenCL for HPC on FPGAs in two ways. The first is by examining a core HPC application, Molecular Dynamics. The second is by examining a fundamental design pattern that we believe has not yet been described for OpenCL: passing data from a set of producer datapaths to a set of consumer datapaths, in particular, where the producers generate data non-uniformly. We evaluate several designs: single level versions in Verilog and in OpenCL, a two-level Verilog version with optimized arbiter, and several two-level OpenCL versions with different arbitration and hand-shaking mechanisms, including one with an embedded Verilog module. For the Verilog designs, we find that FPGAs retain their high-efficiency with a factor of 50 χ to 80 χ performance benefit over a single core. We also find that OpenCL may be competitive with HDLs for the straightline versions of the code, but that for designs with more complex arbitration and hand-shaking, relative performance is substantially diminished.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132319988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Evaluating critical bits in arithmetic operations due to timing violations 在算术运算中由于时间冲突而计算关键位
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091090
Sungseob Whang, Tymani Rachford, Dimitra Papagiannopoulou, T. Moreshet, R. I. Bahar
Various error models are being used in simulation of voltage-scaled arithmetic units to examine application-level tolerance of timing violations. The selection of an error model needs further consideration, as differences in error models drastically affect the performance of the application. Specifically, floating point arithmetic units (FPUs) have architectural characteristics that characterize its behavior. We examine the architecture of FPUs and design a new error model, which we call Critical Bit. We run selected benchmark applications with Critical Bit and other widely used error injection models to demonstrate the differences.
各种误差模型被用于电压比例算术单元的仿真,以检查应用级对时序违规的容忍度。错误模型的选择需要进一步考虑,因为错误模型的差异会极大地影响应用程序的性能。具体来说,浮点运算单元(fpu)具有表征其行为的体系结构特征。我们研究了fpu的结构,并设计了一个新的误差模型,我们称之为临界位。我们使用Critical Bit和其他广泛使用的错误注入模型运行了选定的基准应用程序,以演示它们之间的差异。
{"title":"Evaluating critical bits in arithmetic operations due to timing violations","authors":"Sungseob Whang, Tymani Rachford, Dimitra Papagiannopoulou, T. Moreshet, R. I. Bahar","doi":"10.1109/HPEC.2017.8091090","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091090","url":null,"abstract":"Various error models are being used in simulation of voltage-scaled arithmetic units to examine application-level tolerance of timing violations. The selection of an error model needs further consideration, as differences in error models drastically affect the performance of the application. Specifically, floating point arithmetic units (FPUs) have architectural characteristics that characterize its behavior. We examine the architecture of FPUs and design a new error model, which we call Critical Bit. We run selected benchmark applications with Critical Bit and other widely used error injection models to demonstrate the differences.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124625485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Hybrid flash arrays for HPC storage systems: An alternative to burst buffers 用于高性能计算存储系统的混合闪存阵列:突发缓冲区的替代方案
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091092
T. Petersen, John Bent
Cloud and high-performance computing storage systems are comprised of thousands of physical storage devices and uses software that organize them into multiple data tiers based on access frequency. The characteristics of these devices lend themselves well to these tiers as devices have differing ratios of performance to capacity. Due to this, these systems have, for the past several years, incorporated a mix of flash devices and mechanical spinning hard disk drives. Although a single media type will be ideal, the economic reality is that a hybrid system must use flash for performance and disk for capacity. Within the high-performance computing community, flash has been used to create a new tier called burst buffers which are typically software managed, user visible, wed to a particular file system, and buffer all IO traffic before subsequent migration to disk. In this paper, we propose an alternative architecture that is hardware managed, user transparent, file system agnostic, and that only buffers small IO while allowing large sequential IO to access the disks directly. Our evaluation of this alternative architecture finds that it achieves comparable results to the reported burst buffer numbers and improves on systems comprised solely of disks by several orders of magnitude for a fraction of the cost.
云计算和高性能计算存储系统由数千个物理存储设备组成,并使用软件根据访问频率将其组织成多个数据层。这些设备的特性很适合这些层,因为设备具有不同的性能与容量比率。因此,在过去的几年里,这些系统结合了闪存设备和机械旋转硬盘驱动器的组合。虽然单一介质类型是理想的,但经济现实是混合系统必须使用闪存来提高性能,使用磁盘来提高容量。在高性能计算社区中,闪存已被用于创建一个称为突发缓冲区的新层,该层通常由软件管理,用户可见,绑定到特定的文件系统,并在随后迁移到磁盘之前缓冲所有IO流量。在本文中,我们提出了一种替代架构,它是硬件管理的,用户透明的,文件系统无关的,并且只缓冲小IO而允许大的顺序IO直接访问磁盘。我们对这种替代架构的评估发现,它达到了与报告的突发缓冲区数量相当的结果,并且在仅由磁盘组成的系统上以一小部分成本提高了几个数量级。
{"title":"Hybrid flash arrays for HPC storage systems: An alternative to burst buffers","authors":"T. Petersen, John Bent","doi":"10.1109/HPEC.2017.8091092","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091092","url":null,"abstract":"Cloud and high-performance computing storage systems are comprised of thousands of physical storage devices and uses software that organize them into multiple data tiers based on access frequency. The characteristics of these devices lend themselves well to these tiers as devices have differing ratios of performance to capacity. Due to this, these systems have, for the past several years, incorporated a mix of flash devices and mechanical spinning hard disk drives. Although a single media type will be ideal, the economic reality is that a hybrid system must use flash for performance and disk for capacity. Within the high-performance computing community, flash has been used to create a new tier called burst buffers which are typically software managed, user visible, wed to a particular file system, and buffer all IO traffic before subsequent migration to disk. In this paper, we propose an alternative architecture that is hardware managed, user transparent, file system agnostic, and that only buffers small IO while allowing large sequential IO to access the disks directly. Our evaluation of this alternative architecture finds that it achieves comparable results to the reported burst buffer numbers and improves on systems comprised solely of disks by several orders of magnitude for a fraction of the cost.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116026109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
GPU accelerated gigabit level BCH and LDPC concatenated coding system GPU加速千兆级BCH和LDPC连接编码系统
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091021
Selcuk Keskin, T. Koçak
Increasing data traffic and multimedia services in recent years have paved the way for the development of optical transmission methods to be used in high bandwidth communications systems. In order to meet the very high throughput requirements, dedicated application specific integrated circuit and field programmable gate array solutions for low-density parity-check decoding are proposed in recent years. Conversely, software solutions are less expensive, scalable, and flexible and have shorter development cycle. A natural solution to lower the error floor is to concatenate the LDPC code with an algebraic outer code to clean up the residual errors. In this paper, we present the design and parallel software implementation of a major computation algorithm for LDPC decoding on general purpose graphics processing units as inner code and BCH decoding algorithm as outer code to achieve excellent error-correcting performance. The experimental results show that the proposed GPU-based concatenated decoder achieves the maximum decoding throughput of 1.82Gbps at 10 iterations with low bit-error rate (BER).
近年来不断增加的数据流量和多媒体业务为用于高带宽通信系统的光传输方法的发展铺平了道路。为了满足非常高的吞吐量要求,近年来提出了用于低密度奇偶校验解码的专用集成电路和现场可编程门阵列解决方案。相反,软件解决方案更便宜、可伸缩、灵活,开发周期更短。降低错误层的自然解决方案是将LDPC代码与代数外部代码连接起来,以清除残余错误。本文提出了一种在通用图形处理单元上以LDPC译码为内码,以BCH译码为外码的主要计算算法的设计和并行软件实现,以获得优异的纠错性能。实验结果表明,基于gpu的级联解码器在10次迭代后的最大解码吞吐量为1.82Gbps,且误码率较低。
{"title":"GPU accelerated gigabit level BCH and LDPC concatenated coding system","authors":"Selcuk Keskin, T. Koçak","doi":"10.1109/HPEC.2017.8091021","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091021","url":null,"abstract":"Increasing data traffic and multimedia services in recent years have paved the way for the development of optical transmission methods to be used in high bandwidth communications systems. In order to meet the very high throughput requirements, dedicated application specific integrated circuit and field programmable gate array solutions for low-density parity-check decoding are proposed in recent years. Conversely, software solutions are less expensive, scalable, and flexible and have shorter development cycle. A natural solution to lower the error floor is to concatenate the LDPC code with an algebraic outer code to clean up the residual errors. In this paper, we present the design and parallel software implementation of a major computation algorithm for LDPC decoding on general purpose graphics processing units as inner code and BCH decoding algorithm as outer code to achieve excellent error-correcting performance. The experimental results show that the proposed GPU-based concatenated decoder achieves the maximum decoding throughput of 1.82Gbps at 10 iterations with low bit-error rate (BER).","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129302962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
High-performance low-energy implementation of cryptographic algorithms on a programmable SoC for IoT devices 在物联网设备的可编程SoC上实现高性能低功耗加密算法
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091062
Boyou Zhou, Manuel Egele, A. Joshi
Due to severe power and timing constraints of the "things" in the Internet of things (IoT), cryptography is expensive for these devices. Custom hardware provides a viable solution. However, implementations of cryptographic algorithms in the devices need to be upgraded frequently compared to the longevity of these "things". Therefore, there is a critical need for reconfigurable, low-power and high-performance cryptography implementations for IoT devices. In this paper, we propose to use an FPGA as the reconfigurable substrate for cryptographic operations. We demonstrate our proposed approach on a Zedboard, which has two ARM cores and a Zynq FPGA. The implemented cryptographic algorithms include symmetric cryptography, asymmetric cryptography, and secure hash functions. We also integrate our cryptographic engines with the OpenSSL library to inherit the library's support for block cipher modes. Our approach shows that the FPGA-based reconfigurable cryptographic components consume between 1.8× and 4033× less energy and run between 1.6× and 2983× faster than the software implementation. At the same time, the FPGA implementation of cryptographic operations is more flexible compared to custom hardware implementations of cryptographic components.
由于物联网(IoT)中“事物”的严格功率和时间限制,加密对于这些设备来说是昂贵的。定制硬件提供了一个可行的解决方案。然而,与这些“东西”的寿命相比,设备中加密算法的实现需要频繁升级。因此,物联网设备迫切需要可重构、低功耗和高性能的加密实现。在本文中,我们建议使用FPGA作为加密操作的可重构基板。我们在具有两个ARM内核和一个Zynq FPGA的Zedboard上演示了我们提出的方法。实现的加密算法包括对称加密、非对称加密和安全哈希函数。我们还将我们的加密引擎与OpenSSL库集成,以继承该库对分组密码模式的支持。我们的方法表明,基于fpga的可重构加密组件的能耗比软件实现低1.8到4033倍,运行速度快1.6到2983倍。同时,与自定义硬件实现的加密组件相比,FPGA实现的加密操作更加灵活。
{"title":"High-performance low-energy implementation of cryptographic algorithms on a programmable SoC for IoT devices","authors":"Boyou Zhou, Manuel Egele, A. Joshi","doi":"10.1109/HPEC.2017.8091062","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091062","url":null,"abstract":"Due to severe power and timing constraints of the \"things\" in the Internet of things (IoT), cryptography is expensive for these devices. Custom hardware provides a viable solution. However, implementations of cryptographic algorithms in the devices need to be upgraded frequently compared to the longevity of these \"things\". Therefore, there is a critical need for reconfigurable, low-power and high-performance cryptography implementations for IoT devices. In this paper, we propose to use an FPGA as the reconfigurable substrate for cryptographic operations. We demonstrate our proposed approach on a Zedboard, which has two ARM cores and a Zynq FPGA. The implemented cryptographic algorithms include symmetric cryptography, asymmetric cryptography, and secure hash functions. We also integrate our cryptographic engines with the OpenSSL library to inherit the library's support for block cipher modes. Our approach shows that the FPGA-based reconfigurable cryptographic components consume between 1.8× and 4033× less energy and run between 1.6× and 2983× faster than the software implementation. At the same time, the FPGA implementation of cryptographic operations is more flexible compared to custom hardware implementations of cryptographic components.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132554262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2017 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1