IEEE Transactions on Multi-Scale Computing Systems最新文献

A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation 一种用于节能计算的单片三维混合结构

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-11-20 DOI: 10.1109/TMSCS.2018.2882433

Ye Yu;Niraj K. Jha

The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmable gate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1× and 2.5× speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0× and 40.5×, respectively, for compute-intensive applications, and 2.0× and 4.2×, respectively, for memory-intensive applications. This translates to 1745.3× energy-delay product (EDP) improvement for compute-intensive applications and 10.5× for memory-intensive applications.

摩尔定律预测的芯片多处理器（CMP）性能的指数级增长不再是由于单个CPU内核的时钟速率的增加，而是由于CMP中内核数量的增加。随着技术节点的缩小，更多的晶体管集成在相同的占地面积内，以提供更高的性能。然而，伴随而来的是更高的功耗，通常超过了廉价冷却技术的应对能力。此电源墙可防止芯片在所有设备通电的情况下全速运行。这被称为暗硅问题。CMP开发中的另一个主要瓶颈是CPU时钟速率和内存访问速度之间的不平衡。内存墙使CPU无法充分利用其计算能力。为了解决电源墙和内存墙问题，我们提出了一种单片3D混合架构，该架构由多核CPU层、细粒度动态可重构（FDR）现场可编程门阵列（FPGA）层和多个电阻RAM（RRAM）层组成。FDR层用作加速器。它使用时间逻辑折叠的概念来定位片上通信。RRAM层通过高效的内存接口连接到CPU和FDR层，该接口利用了单片层间过孔的巨大带宽，并隐藏了大型数据传输的延迟。我们根据两种类型的基准测试来评估体系结构：计算密集型和内存密集型。我们表明，该体系结构在两种类型的应用程序中都能以更好的性能显著降低功耗和能耗。与基线相比，我们的体系结构在计算密集型和内存密集型基准测试上分别实现了平均43.1倍和2.5倍的加速。对于计算密集型应用，功耗和能耗分别降低了5.0倍和40.5倍，对于内存密集型应用则分别降低了2.0倍和4.2倍。这意味着，对于计算密集型应用，能量延迟乘积（EDP）提高了1745.3倍，对于内存密集型应用提高了10.5倍。

{"title":"A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation","authors":"Ye Yu;Niraj K. Jha","doi":"10.1109/TMSCS.2018.2882433","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2882433","url":null,"abstract":"The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmable gate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1× and 2.5× speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0× and 40.5×, respectively, for compute-intensive applications, and 2.0× and 4.2×, respectively, for memory-intensive applications. This translates to 1745.3× energy-delay product (EDP) improvement for compute-intensive applications and 10.5× for memory-intensive applications.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"533-547"},"PeriodicalIF":0.0,"publicationDate":"2018-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2882433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

H$^2$OEIN: A Hierarchical Hybrid Optical/Electrical Interconnection Network for Exascale Computing Systems H$^2$OEIN：用于Exascale计算系统的分层混合光/电互连网络

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-11-18 DOI: 10.1109/TMSCS.2018.2881715

Yunfeng Lu;Huaxi Gu;Krishnendu Chakrabarty;Yintang Yang

The performance of high-performance computing (HPC) systems is largely determined by the interconnection network. The rising demand for computing capability leads to an expansion of the interconnection network and a corresponding increase in system cost and power consumption. The growing use of optical interconnects not only reduces the network cost and power consumption, but also meets the system-scaling bandwidth demands. However, unlike in an electrical switch, the lack of a buffer in the optical switch makes it hard to operate an all-optical network at packet-level granularity. In this paper, we propose a hierarchical hybrid optical/electrical interconnection network (H

$^2$

OEIN) based on low-radix switches and arrayed waveguide grating routers (AWGRs). In the lower layers, the use of low-radix switches results in lower cost and power consumption. The modular structure composed of low-radix switches facilitates the expansion of the network. At higher layers, high bandwidth and fast switching can be achieved using AWGR based optical interconnects. Because the higher layers of the network are passive, the power consumption can be reduced to a large extent. Network simulation results show that H

$^2$

OEIN reduces the cost by 25 percent and the power consumption by 45 percent compared to a dragonfly network in configurations with over 300,000 nodes.

高性能计算（HPC）系统的性能在很大程度上取决于互连网络。对计算能力的不断增长的需求导致互连网络的扩展以及系统成本和功耗的相应增加。光互连的日益使用不仅降低了网络成本和功耗，而且满足了系统扩展带宽的需求。然而，与电交换机不同的是，光交换机中缺少缓冲区，这使得很难以分组级别的粒度操作全光网络。在本文中，我们提出了一种基于低基数开关和阵列波导光栅路由器（AWGRs）的分层混合光/电互连网络（H$^2$OEIN）。在较低的层中，使用低基数交换机可以降低成本和功耗。由低基数交换机组成的模块化结构便于网络的扩展。在更高层，使用基于AWGR的光学互连可以实现高带宽和快速切换。由于网络的高层是被动的，因此可以在很大程度上降低功耗。网络仿真结果表明，在超过300000个节点的配置中，与蜻蜓网络相比，H$^2$OEIN降低了25%的成本和45%的功耗。

{"title":"H$^2$OEIN: A Hierarchical Hybrid Optical/Electrical Interconnection Network for Exascale Computing Systems","authors":"Yunfeng Lu;Huaxi Gu;Krishnendu Chakrabarty;Yintang Yang","doi":"10.1109/TMSCS.2018.2881715","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2881715","url":null,"abstract":"The performance of high-performance computing (HPC) systems is largely determined by the interconnection network. The rising demand for computing capability leads to an expansion of the interconnection network and a corresponding increase in system cost and power consumption. The growing use of optical interconnects not only reduces the network cost and power consumption, but also meets the system-scaling bandwidth demands. However, unlike in an electrical switch, the lack of a buffer in the optical switch makes it hard to operate an all-optical network at packet-level granularity. In this paper, we propose a hierarchical hybrid optical/electrical interconnection network (H\u0000<inline-formula><tex-math>$^2$</tex-math></inline-formula>\u0000OEIN) based on low-radix switches and arrayed waveguide grating routers (AWGRs). In the lower layers, the use of low-radix switches results in lower cost and power consumption. The modular structure composed of low-radix switches facilitates the expansion of the network. At higher layers, high bandwidth and fast switching can be achieved using AWGR based optical interconnects. Because the higher layers of the network are passive, the power consumption can be reduced to a large extent. Network simulation results show that H\u0000<inline-formula><tex-math>$^2$</tex-math></inline-formula>\u0000OEIN reduces the cost by 25 percent and the power consumption by 45 percent compared to a dragonfly network in configurations with over 300,000 nodes.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"722-733"},"PeriodicalIF":0.0,"publicationDate":"2018-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2881715","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel, Simulator for Heterogeneous Cloud Systems that Incorporate Custom Hardware Accelerators 一种用于异构云系统的新型模拟器，包含自定义硬件加速器

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-11-04 DOI: 10.1109/TMSCS.2018.2879601

Nikolaos Tampouratzis;Ioannis Papaefstathiou

The growing use of hardware accelerators in both embedded (e.g., automotive) and high end systems (e.g., Cloud infrastructure) triggers an urgent demand for simulation frameworks that can simulate in an integrated manner all the components (i.e., CPUs, Memories, Networks, and Hardware Accelerators) of a system-under-design(SuD). By utilizing such a simulator, software design can proceed in parallel with hardware development which results in the reduction of the so important time-to-market. The main problem, however, is that currently there is a shortage of such simulation frameworks; most simulators used for modelling the user applications (i.e., full-system CPU/Mem/Peripheral simulators) lack any type of support for tailor-made hardware accelerators. The presented ACSIM framework is the first known open-source, high-performance simulator that can handle holistically system-of-systems including processors, peripherals, accelerators, and networks; such an approach is, for example, very appealing for the design of Cloud Servers that incorporate FPGAs as PCI-connected accelerators. ACSIM is an extension of the COSSIM simulation framework and it integrates, in a novel and efficient way, a combined system and network simulator with a SystemC simulator, in a transparent to the end-user way. ACSIM has been evaluated when executing several real-world use cases; the end results demonstrate that the presented approach has up to 99 percent accuracy in the reported SuD aspects (when compared with the corresponding characteristics measured in the real systems), while the overall simulation time can be accelerated almost linearly with the number of CPUs utilized by the simulator. More importantly, the presented interconnection scheme between the Processing and the SystemC simulators is orders of magnitude faster than the existing solutions, while ACSIM can efficiently simulate up to several hundreds of processing nodes with hardware accelerators interconnected together, in a fully distributed manner.

硬件加速器在嵌入式（例如，汽车）和高端系统（例如，云基础设施）中的日益使用，引发了对模拟框架的迫切需求，该框架可以以集成的方式模拟正在设计的系统（SuD）的所有组件（即，CPU、存储器、网络和硬件加速器）。通过使用这样的模拟器，软件设计可以与硬件开发并行进行，从而减少了如此重要的上市时间。然而，主要问题是，目前缺乏这种模拟框架；大多数用于建模用户应用程序的模拟器（即全系统CPU/Mem/外围模拟器）缺乏对定制硬件加速器的任何类型的支持。所提出的ACSIM框架是第一个已知的开源、高性能模拟器，可以全面处理系统系统，包括处理器、外围设备、加速器和网络；例如，这种方法对于将FPGA作为PCI连接加速器的云服务器的设计非常有吸引力。ACSIM是COSSIM模拟框架的扩展，它以一种新颖高效的方式将系统和网络模拟器与SystemC模拟器相结合，并以对最终用户透明的方式进行集成。ACSIM在执行几个真实世界的用例时进行了评估；最终结果表明，所提出的方法在所报告的SuD方面具有高达99%的准确性（与实际系统中测量的相应特性相比），而总体模拟时间可以随着模拟器使用的CPU数量几乎线性地加速。更重要的是，所提出的Processing和SystemC模拟器之间的互连方案比现有解决方案快几个数量级，而ACSIM可以以完全分布式的方式高效地模拟多达数百个具有互连在一起的硬件加速器的处理节点。

{"title":"A Novel, Simulator for Heterogeneous Cloud Systems that Incorporate Custom Hardware Accelerators","authors":"Nikolaos Tampouratzis;Ioannis Papaefstathiou","doi":"10.1109/TMSCS.2018.2879601","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2879601","url":null,"abstract":"The growing use of hardware accelerators in both embedded (e.g., automotive) and high end systems (e.g., Cloud infrastructure) triggers an urgent demand for simulation frameworks that can simulate in an integrated manner all the components (i.e., CPUs, Memories, Networks, and Hardware Accelerators) of a system-under-design(SuD). By utilizing such a simulator, software design can proceed in parallel with hardware development which results in the reduction of the so important time-to-market. The main problem, however, is that currently there is a shortage of such simulation frameworks; most simulators used for modelling the user applications (i.e., full-system CPU/Mem/Peripheral simulators) lack any type of support for tailor-made hardware accelerators. The presented ACSIM framework is the first known open-source, high-performance simulator that can handle holistically system-of-systems including processors, peripherals, accelerators, and networks; such an approach is, for example, very appealing for the design of Cloud Servers that incorporate FPGAs as PCI-connected accelerators. ACSIM is an extension of the COSSIM simulation framework and it integrates, in a novel and efficient way, a combined system and network simulator with a SystemC simulator, in a transparent to the end-user way. ACSIM has been evaluated when executing several real-world use cases; the end results demonstrate that the presented approach has up to 99 percent accuracy in the reported SuD aspects (when compared with the corresponding characteristics measured in the real systems), while the overall simulation time can be accelerated almost linearly with the number of CPUs utilized by the simulator. More importantly, the presented interconnection scheme between the Processing and the SystemC simulators is orders of magnitude faster than the existing solutions, while ACSIM can efficiently simulate up to several hundreds of processing nodes with hardware accelerators interconnected together, in a fully distributed manner.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"565-576"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2879601","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Enforcing End-to-End I/O Policies for Scientific Workflows Using Software-Defined Storage Resource Enclaves 使用软件定义的存储资源包为科学工作流强制执行端到端I/O策略

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-11-01 DOI: 10.1109/TMSCS.2018.2879096

Suman Karki;Bao Nguyen;Joshua Feener;Kei Davis;Xuechen Zhang

Data-intensive knowledge discovery requires scientific applications to run concurrently with analytics and visualization codes executing in situ for timely output inspection and knowledge extraction. Consequently, I/O pipelines of scientific workflows can be long and complex because they comprise many stages of analytics across different layers of the I/O stack of high-performance computing systems. Performance limitations at any I/O layer or stage can cause an I/O bottleneck resulting in greater than expected end-to-end I/O latency. In this paper, we present the design and implementation of a novel data management infrastructure called Software-Defined Storage Resource Enclaves (SIREN) at system level to enforce end-to-end policies that dictate an I/O pipeline's performance. SIREN provides an I/O performance interface for users to specify the desired storage resources in the context of in-situ analytics. If suboptimal performance of analytics is caused by an I/O bottleneck when data are transferred between simulations and analytics, schedulers in different layers of the I/O stack automatically provide the guaranteed lower bounds on I/O throughput. Our experimental results demonstrate that SIREN provides performance isolation among scientific workflows sharing multiple storage servers across two I/O layers (burst buffer and parallel file systems) while maintaining high system scalability and resource utilization.

数据密集型知识发现要求科学应用程序与现场执行的分析和可视化代码同时运行，以便及时进行输出检查和知识提取。因此，科学工作流程的I/O管道可能是漫长而复杂的，因为它们包括高性能计算系统I/O堆栈不同层的许多分析阶段。任何I/O层或阶段的性能限制都可能导致I/O瓶颈，从而导致端到端I/O延迟超过预期。在本文中，我们介绍了一种新的数据管理基础设施的设计和实现，该基础设施称为系统级的软件定义存储资源包（SIREN），用于强制执行决定I/O管道性能的端到端策略。SIREN为用户提供了一个I/O性能接口，以便在原位分析的环境中指定所需的存储资源。如果分析的次优性能是由模拟和分析之间传输数据时的I/O瓶颈引起的，则I/O堆栈不同层中的调度器会自动提供I/O吞吐量的保证下限。我们的实验结果表明，SIREN在跨两个I/O层（突发缓冲区和并行文件系统）共享多个存储服务器的科学工作流程之间提供了性能隔离，同时保持了较高的系统可扩展性和资源利用率。

{"title":"Enforcing End-to-End I/O Policies for Scientific Workflows Using Software-Defined Storage Resource Enclaves","authors":"Suman Karki;Bao Nguyen;Joshua Feener;Kei Davis;Xuechen Zhang","doi":"10.1109/TMSCS.2018.2879096","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2879096","url":null,"abstract":"Data-intensive knowledge discovery requires scientific applications to run concurrently with analytics and visualization codes executing in situ for timely output inspection and knowledge extraction. Consequently, I/O pipelines of scientific workflows can be long and complex because they comprise many stages of analytics across different layers of the I/O stack of high-performance computing systems. Performance limitations at any I/O layer or stage can cause an I/O bottleneck resulting in greater than expected end-to-end I/O latency. In this paper, we present the design and implementation of a novel data management infrastructure called \u0000<italic>Software-Defined Storage Resource Enclaves</i>\u0000 (SIREN) at system level to enforce end-to-end policies that dictate an I/O pipeline's performance. SIREN provides an I/O performance interface for users to specify the desired storage resources in the context of in-situ analytics. If suboptimal performance of analytics is caused by an I/O bottleneck when data are transferred between simulations and analytics, schedulers in different layers of the I/O stack automatically provide the guaranteed lower bounds on I/O throughput. Our experimental results demonstrate that SIREN provides performance isolation among scientific workflows sharing multiple storage servers across two I/O layers (burst buffer and parallel file systems) while maintaining high system scalability and resource utilization.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"662-675"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2879096","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Low Register-Complexity Systolic Digit-Serial Multiplier Over $GF(2^m)$ Based on Trinomials 基于三进制的$GF（2^m）$上的低寄存器复杂度收缩数字串行乘法器

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-10-28 DOI: 10.1109/TMSCS.2018.2878437

Jiafeng Xie;Pramod Kumar Meher;Xiaojun Zhou;Chiou-Yng Lee

Digit-serial systolic multipliers over

$GF(2^m)$

based on the National Institute of Standards and Technology (NIST) recommended trinomials play a critical role in the real-time operations of cryptosystems. Systolic multipliers over

$GF(2^m)$

involve a large number of registers of size

$O(m^2)$

which results in significant increase in area complexity. In this paper, we propose a novel low register-complexity digit-serial trinomial-based finite field multiplier. The proposed architecture is derived through two novel coherent interdependent stages: (i) derivation of an efficient hardware-oriented algorithm based on a novel input-operand feeding scheme and (ii) appropriate design of novel low register-complexity systolic structure based on the proposed algorithm. The extension of the proposed design to Karatsuba algorithm (KA)-based structure is also presented. The proposed design is synthesized for FPGA implementation and it is shown that it (the design based on regular multiplication process) could achieve more than 12.1 percent saving in area-delay product and nearly 2.8 percent saving in power-delay product. To the best of the authors’ knowledge, the register-complexity of proposed structure is so far the least among the competing designs for trinomial based systolic multipliers (for the same type of multiplication algorithm).

基于美国国家标准与技术研究所（NIST）推荐的三进制数，超过$GF（2^m）$的数字序列收缩乘数在密码系统的实时操作中发挥着关键作用。超过$GF（2^m）$的收缩乘数涉及大量大小为$O（m^2）$的寄存器，这导致区域复杂性的显著增加。在本文中，我们提出了一种新的低寄存器复杂度的数字串行三项有限域乘法器。所提出的体系结构是通过两个新的相干相互依赖阶段推导出来的：（i）基于新的输入操作数馈送方案推导出一种高效的面向硬件的算法；（ii）基于所提出的算法适当设计出新的低寄存器复杂度收缩结构。还介绍了将所提出的设计扩展到基于Karatsuba算法（KA）的结构。将所提出的设计用于FPGA实现，结果表明，该设计（基于规则乘法过程的设计）可以实现超过12.1%的面积延迟乘积节省和近2.8%的功率延迟乘积节省。据作者所知，所提出结构的寄存器复杂性是迄今为止基于三项式的收缩乘法器（对于相同类型的乘法算法）的竞争设计中最小的。

{"title":"Low Register-Complexity Systolic Digit-Serial Multiplier Over $GF(2^m)$ Based on Trinomials","authors":"Jiafeng Xie;Pramod Kumar Meher;Xiaojun Zhou;Chiou-Yng Lee","doi":"10.1109/TMSCS.2018.2878437","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2878437","url":null,"abstract":"Digit-serial systolic multipliers over \u0000<inline-formula><tex-math>$GF(2^m)$</tex-math></inline-formula>\u0000 based on the National Institute of Standards and Technology (NIST) recommended trinomials play a critical role in the real-time operations of cryptosystems. Systolic multipliers over \u0000<inline-formula><tex-math>$GF(2^m)$</tex-math></inline-formula>\u0000 involve a large number of registers of size \u0000<inline-formula><tex-math>$O(m^2)$</tex-math></inline-formula>\u0000 which results in significant increase in area complexity. In this paper, we propose a novel low register-complexity digit-serial trinomial-based finite field multiplier. The proposed architecture is derived through two novel coherent interdependent stages: (i) derivation of an efficient hardware-oriented algorithm based on a novel input-operand feeding scheme and (ii) appropriate design of novel low register-complexity systolic structure based on the proposed algorithm. The extension of the proposed design to Karatsuba algorithm (KA)-based structure is also presented. The proposed design is synthesized for FPGA implementation and it is shown that it (the design based on regular multiplication process) could achieve more than 12.1 percent saving in area-delay product and nearly 2.8 percent saving in power-delay product. To the best of the authors’ knowledge, the register-complexity of proposed structure is so far the least among the competing designs for trinomial based systolic multipliers (for the same type of multiplication algorithm).","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"773-783"},"PeriodicalIF":0.0,"publicationDate":"2018-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2878437","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Frequency Offset-Based Ring Oscillator Physical Unclonable Function 基于频率偏移的环形振荡器物理不可控制函数

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-10-24 DOI: 10.1109/TMSCS.2018.2877737

Jiliang Zhang;Xiao Tan;Yuanjing Zhang;Weizheng Wang;Zheng Qin

Weak Physical Unclonable Function (PUF) is a promising lightweight hardware security primitive that is used for secret key generation without the requirement of secure nonvolatile electrically erasable programmable read-only memory (EEPROM) or battery backed static random-access memory (SRAM) for resource-limited applications such as Internet of Thing (IoT) and embedded systems. The Ring Oscillator (RO) PUF is one of the most popular weak PUFs that can generate the volatile key by comparing the frequency difference between any two ROs. However, it is difficult for the RO PUF to maintain an absolutely stable response with operating environment varies. In order to eliminate the impact of environment factors, previous RO PUFs incur significant hardware overheads to improve the reliability. This paper proposes a frequency offset-based RO PUF structure which exhibits high reliability and low hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs to improve reliability. Prototype implementation on Xilinx 65 nm Field-programmable Gate Arrays (FPGAs) shows the low overhead of the new structure and 100 percent reliability with temperature range of 45

$^circ mathrm{C}$

$sim$

95

$^circ mathrm{C}$

.

弱物理不可控制功能（PUF）是一种很有前途的轻量级硬件安全原语，用于生成密钥，而不需要安全的非易失性电可擦除可编程只读存储器（EEPROM）或电池支持的静态随机存取存储器（SRAM），用于物联网（IoT）和嵌入式系统等资源有限的应用。环形振荡器（RO）PUF是最流行的弱PUF之一，它可以通过比较任意两个RO之间的频率差来生成易失性密钥。然而，RO PUF很难随着操作环境的变化而保持绝对稳定的响应。为了消除环境因素的影响，以前的RO PUF会产生显著的硬件开销，以提高可靠性。本文提出了一种基于频率偏移的RO PUF结构，该结构具有高可靠性和低硬件开销。关键思想是通过偏移RO对的频率来提高可靠性，从而使频率差大于给定阈值。在Xilinx 65nm现场可编程门阵列（FPGA）上的原型实现显示了新结构的低开销和100%的可靠性，温度范围为45$^cirmathrm{C}$sim$95$^curmathrm｛C}$。

{"title":"Frequency Offset-Based Ring Oscillator Physical Unclonable Function","authors":"Jiliang Zhang;Xiao Tan;Yuanjing Zhang;Weizheng Wang;Zheng Qin","doi":"10.1109/TMSCS.2018.2877737","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2877737","url":null,"abstract":"Weak Physical Unclonable Function (PUF) is a promising lightweight hardware security primitive that is used for secret key generation without the requirement of secure nonvolatile electrically erasable programmable read-only memory (EEPROM) or battery backed static random-access memory (SRAM) for resource-limited applications such as Internet of Thing (IoT) and embedded systems. The Ring Oscillator (RO) PUF is one of the most popular weak PUFs that can generate the volatile key by comparing the frequency difference between any two ROs. However, it is difficult for the RO PUF to maintain an absolutely stable response with operating environment varies. In order to eliminate the impact of environment factors, previous RO PUFs incur significant hardware overheads to improve the reliability. This paper proposes a frequency offset-based RO PUF structure which exhibits high reliability and low hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs to improve reliability. Prototype implementation on Xilinx 65 nm Field-programmable Gate Arrays (FPGAs) shows the low overhead of the new structure and 100 percent reliability with temperature range of 45 \u0000<inline-formula><tex-math>$^circ mathrm{C}$</tex-math></inline-formula>\u0000 \u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u0000 95 \u0000<inline-formula><tex-math>$^circ mathrm{C}$</tex-math></inline-formula>\u0000.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"711-721"},"PeriodicalIF":0.0,"publicationDate":"2018-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2877737","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

TPR: Traffic Pattern-Based Adaptive Routing for Dragonfly Networks 基于流量模式的Dragonfly网络自适应路由

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-10-21 DOI: 10.1109/TMSCS.2018.2877264

Peyman Faizian;Juan Francisco Alfaro;Md Shafayat Rahman;Md Atiqul Mollah;Xin Yuan;Scott Pakin;Michael Lang

The Cray Cascade architecture uses Dragonfly as its interconnect topology and employs a globally adaptive routing scheme called UGAL. UGAL directs traffic based on link loads but may make inappropriate adaptive routing decisions in various situations, which degrades its performance. In this work, we propose traffic pattern-based adaptive routing (TPR) for Dragonfly that improves UGAL by incorporating a traffic pattern-based adaptation mechanism. The idea is to explicitly use the link usage statistics that are collected in performance counters to infer the traffic pattern, and to take the inferred traffic pattern plus link loads into consideration when making adaptive routing decisions. Our performance evaluation results on a diverse set of traffic conditions indicate that by incorporating the traffic pattern-based adaptation mechanism, TPR is much more effective in making adaptive routing decisions and achieves significant lower latency under low load and higher throughput under high load than its underlying UGAL.

Cray Cascade架构使用Dragonfly作为其互连拓扑，并采用称为UGAL的全局自适应路由方案。UGAL基于链路负载来引导流量，但在各种情况下可能会做出不适当的自适应路由决策，这会降低其性能。在这项工作中，我们为Dragonfly提出了基于交通模式的自适应路由（TPR），通过引入基于交通模式自适应机制来改进UGAL。其思想是明确使用性能计数器中收集的链路使用统计信息来推断流量模式，并在做出自适应路由决策时将推断的流量模式加上链路负载考虑在内。我们在不同流量条件下的性能评估结果表明，通过结合基于流量模式的自适应机制，TPR在做出自适应路由决策方面要有效得多，并且在低负载下实现了显著更低的延迟，在高负载下实现比其底层UGAL更高的吞吐量。

{"title":"TPR: Traffic Pattern-Based Adaptive Routing for Dragonfly Networks","authors":"Peyman Faizian;Juan Francisco Alfaro;Md Shafayat Rahman;Md Atiqul Mollah;Xin Yuan;Scott Pakin;Michael Lang","doi":"10.1109/TMSCS.2018.2877264","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2877264","url":null,"abstract":"The Cray Cascade architecture uses Dragonfly as its interconnect topology and employs a globally adaptive routing scheme called UGAL. UGAL directs traffic based on link loads but may make inappropriate adaptive routing decisions in various situations, which degrades its performance. In this work, we propose traffic pattern-based adaptive routing (TPR) for Dragonfly that improves UGAL by incorporating a traffic pattern-based adaptation mechanism. The idea is to explicitly use the link usage statistics that are collected in performance counters to infer the traffic pattern, and to take the inferred traffic pattern plus link loads into consideration when making adaptive routing decisions. Our performance evaluation results on a diverse set of traffic conditions indicate that by incorporating the traffic pattern-based adaptation mechanism, TPR is much more effective in making adaptive routing decisions and achieves significant lower latency under low load and higher throughput under high load than its underlying UGAL.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"931-943"},"PeriodicalIF":0.0,"publicationDate":"2018-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2877264","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67861394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A New Fluid-Chip Co-Design for Digital Microfluidic Biochips Considering Cost Drivers and Design Convergence 考虑成本驱动因素和设计融合的新型数字微流控生物芯片流体芯片联合设计

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-10-05 DOI: 10.1109/TMSCS.2018.2874248

Arpan Chakraborty;Piyali Datta;Rajat Kumar Pal

The design process for digital microfluidic biochips (DMFBs) is becoming more complex due to the growing need for essential bio-protocols. A number of significant fluid- and chip-level synthesis tools have been offered previously for designing an efficient system. Several important cost drivers like bioassay schedule length, total pin count, congestion-free wiring, total wire length, and total layer count together measure the efficiency of the DMFBs. Besides, existing design gaps among the sub-tasks of the fluid and chip level make the design process expensive delaying the time-to-market and increasing the overall cost. In this context, removal of design cycles among the sub-tasks is a prior need to obtain a low-cost and efficient platform. Hence, this paper aims to propose a fluid-chip co-design methodology in dealing with the consideration of the fluid-chip cost drivers, while reducing the design cycles in between. A simulation study considering a number of benchmarks has been presented to observe the performance.

由于对基本生物协议的需求不断增长，数字微流控生物芯片的设计过程变得越来越复杂。以前已经提供了许多重要的流体和芯片级合成工具来设计高效的系统。几个重要的成本驱动因素，如生物测定计划长度、总引脚数、无拥塞布线、总导线长度和总层数，共同衡量DMFB的效率。此外，流体级和芯片级的子任务之间存在的设计差距使得设计过程昂贵，延迟了上市时间并增加了总体成本。在这种情况下，去除子任务之间的设计周期是获得低成本和高效平台的先决条件。因此，本文旨在提出一种流体芯片协同设计方法，以处理流体芯片成本驱动因素的考虑，同时减少其间的设计周期。为了观察性能，已经提出了一项考虑了许多基准的模拟研究。

引用次数: 4

SIRIUS: Enabling Progressive Data Exploration for Extreme-Scale Scientific Data SIRIUS：实现极端规模科学数据的渐进式数据探索

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-10-01 DOI: 10.1109/TMSCS.2018.2886851

Zhenbo Qiao;Tao Lu;Huizhang Luo;Qing Liu;Scott Klasky;Norbert Podhorszki;Jinzhen Wang

Scientific simulations on high performance computing (HPC) platforms generate large quantities of data. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. However, a systematic solution to support these steps has been lacking in the current HPC software ecosystem. To that end, this paper develops SIRIUS, a progressive JPEG-like data management scheme for storing and analyzing big scientific data. It co-designs data decimation, compression, and data storage, taking the hardware characteristics of each storage tier into considerations. With reasonably low overhead, our approach refactors simulation data, using either topological or uniform decimation, into a much smaller, reduced-accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. Data saved on different tiers can then be selectively retrieved to restore the level of accuracy that satisfies data analytics. Thus, SIRIUS provides a paradigm shift towards elastic data analytics and enables end users to make trade-offs between analysis speed and accuracy on-the-fly. This paper further develops algorithms to preserve statistics for data decimation, a common requirement for reducing data. We assess the impact of SIRIUS on unstructured triangular meshes, a pervasive data model used in scientific simulations. In particular, we evaluate two realistic use cases: the blob detection in fusion and high-pressure area extraction in computational fluid dynamics.

高性能计算（HPC）平台上的科学模拟生成大量数据。为了弥合计算和I/O之间日益扩大的差距，并使数据能够更有效地存储和分析，需要重构、减少模拟输出，并将其适当地映射到存储层。然而，在当前的HPC软件生态系统中，一直缺乏支持这些步骤的系统解决方案。为此，本文开发了SIRIUS，这是一种类似JPEG的渐进式数据管理方案，用于存储和分析大科学数据。它共同设计数据抽取、压缩和数据存储，并考虑到每个存储层的硬件特性。在相当低的开销下，我们的方法使用拓扑或均匀抽取将模拟数据重构为一个更小、精度降低的基本数据集，并在需要时使用一系列增量来提高精度。基本数据集和增量被压缩并写入多个存储层。然后可以选择性地检索保存在不同层上的数据，以恢复满足数据分析的准确性水平。因此，SIRIUS提供了向弹性数据分析的范式转变，并使最终用户能够在分析速度和准确性之间进行权衡。本文进一步开发了为数据抽取保留统计数据的算法，这是减少数据的常见要求。我们评估了SIRIUS对非结构化三角形网格的影响，这是一种用于科学模拟的普遍数据模型。特别地，我们评估了两个现实的用例：融合中的斑点检测和计算流体动力学中的高压区域提取。

{"title":"SIRIUS: Enabling Progressive Data Exploration for Extreme-Scale Scientific Data","authors":"Zhenbo Qiao;Tao Lu;Huizhang Luo;Qing Liu;Scott Klasky;Norbert Podhorszki;Jinzhen Wang","doi":"10.1109/TMSCS.2018.2886851","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2886851","url":null,"abstract":"Scientific simulations on high performance computing (HPC) platforms generate large quantities of data. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. However, a systematic solution to support these steps has been lacking in the current HPC software ecosystem. To that end, this paper develops SIRIUS, a progressive JPEG-like data management scheme for storing and analyzing big scientific data. It co-designs data decimation, compression, and data storage, taking the hardware characteristics of each storage tier into considerations. With reasonably low overhead, our approach refactors simulation data, using either topological or uniform decimation, into a much smaller, reduced-accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. Data saved on different tiers can then be selectively retrieved to restore the level of accuracy that satisfies data analytics. Thus, SIRIUS provides a paradigm shift towards elastic data analytics and enables end users to make trade-offs between analysis speed and accuracy on-the-fly. This paper further develops algorithms to preserve statistics for data decimation, a common requirement for reducing data. We assess the impact of SIRIUS on unstructured triangular meshes, a pervasive data model used in scientific simulations. In particular, we evaluate two realistic use cases: the blob detection in fusion and high-pressure area extraction in computational fluid dynamics.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"900-913"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2886851","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68025495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

2018 Index IEEE Transactions on Multi-Scale Computing Systems Vol. 4 2018年索引IEEE多尺度计算系统汇刊第4卷

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-10-01 DOI: 10.1109/TMSCS.2019.2902963

Presents the 2018 subject/author index for this publication.

提供本出版物2018年主题/作者索引。

引用次数: 0