2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献_第2页

Integrating Intra-and Intercellular Simulation of a 2D HL-1 Cardiac Model Based on Embedded GPUs 基于嵌入式gpu的二维HL-1心脏模型胞内胞间集成仿真

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00041

Baohua Liu, W. Shen, Xin Zhu, Xingyu Wangchen

Simulation of electrophysiological cardiac models enables researchers to investigate the activity of heart under various circumstances. Fortunately, recent development in embedded parallel computing architectures has made it possible for one to efficiently simulate sophisticated electrophysiological models that match up to real conditions on embedded computing devices, which typically relies on large scale CPU or GPU clusters in the past. In this paper, a simultaneous implementation of a 2D Takeuchi-HL-1 cardiac model combining unicellular and intercellular solver is proposed and conducted on NVIDIA Jetson Tegra X2 embedded computer. The experiment results demonstrate that our implementation yields considerable efficiency improvement compared with that using non-simultaneous methods, without loss of simulation accuracy. Moreover, it's also proved that embedded devices are much more energy-efficient than conventional systems on the simulation.

心脏电生理模型的模拟使研究人员能够研究心脏在各种情况下的活动。幸运的是，嵌入式并行计算架构的最新发展使得人们可以有效地模拟复杂的电生理模型，这些模型与嵌入式计算设备上的实际情况相匹配，而嵌入式计算设备过去通常依赖于大规模的CPU或GPU集群。本文提出了一种结合单细胞和细胞间求解器的二维Takeuchi-HL-1心脏模型的同时实现方法，并在NVIDIA Jetson Tegra X2嵌入式计算机上进行了实现。实验结果表明，与使用非同步方法相比，我们的实现在不损失仿真精度的情况下提高了相当大的效率。此外，仿真还证明了嵌入式设备比传统系统节能得多。

引用次数: 0

MITRACA: A Next-Gen Heterogeneous Architecture MITRACA:下一代异构架构

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00050

Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku

GPU (Graphics Processing Unit) and CPU (Central Processing Unit) possess a sufficient and appropriate performance to compute massively parallel applications like AI, Big data, and material sciences. However, their real performance is far lower than those theoretical ones. The primary reason for the performance degradation is that they suffer from limited memory bandwidth and inefficient interconnection topology not optimized for these types of applications. Thus, from the viewpoint of real computational performance called computational efficiency, FPGA (Field Programmable Gate Array) is now becoming an attractive chip for these types of applications with massively parallel computation. FPGA can efficiently propose optimized communication and bridge different computing accelerators as customized hardware. In other words, FPGA-based hardware accelerators offer a convenient solution for both high performance and high memory bandwidth. However, one serious concern is usability. For example, the FPGA design using hardware description language is a meticulous task and requires specialized skill sets as well as a long time to market. An overlay architecture will become an appropriate candidate that can resolve this issue because it offers a software layer that simplifies FPGA programmability by abstracting the fabric resources. Thus, this article proposes an overlay architecture based on a tightly-connected many-core-based CGRA (Coarse-Grained Reconfigurable Architecture). It will help software engineers on seamlessly implementing their applications. Our final goal is not on the current fine-grained FPGAs but new middle-to-course-grained programmable chips. If an ASIC (Application-Specific Integrated Circuit) implementation was adopted, the performance would achieve at least ten times higher compared with the current FPGA implementation because of the working frequency. In this article, the proposed overlay system provides a programmable interface that virtualizes FPGA resources and let prospected users focus on high-level software programming.

GPU(图形处理单元)和CPU(中央处理单元)拥有足够和适当的性能来计算大规模并行应用，如人工智能，大数据和材料科学。然而，它们的实际性能远低于理论性能。性能下降的主要原因是它们受到有限的内存带宽和没有针对这些类型的应用程序进行优化的低效互连拓扑的影响。因此，从被称为计算效率的真正计算性能的角度来看，FPGA(现场可编程门阵列)现在成为具有大规模并行计算的这些类型应用的有吸引力的芯片。FPGA可以作为定制硬件有效地提出优化的通信和桥接不同的计算加速器。换句话说，基于fpga的硬件加速器为高性能和高内存带宽提供了方便的解决方案。然而，一个严重的问题是可用性。例如，使用硬件描述语言的FPGA设计是一项细致的任务，需要专业的技能集以及很长的上市时间。覆盖架构将成为解决此问题的合适候选，因为它提供了一个通过抽象结构资源来简化FPGA可编程性的软件层。因此，本文提出了一种基于紧密连接的多核CGRA(粗粒度可重构架构)的覆盖体系结构。它将帮助软件工程师无缝地实现他们的应用程序。我们的最终目标不是当前的细粒度fpga，而是新的中至粗粒度可编程芯片。如果采用ASIC(专用集成电路)实现，由于工作频率的原因，性能将比目前的FPGA实现提高至少十倍。在本文中，提出的覆盖系统提供了一个可编程接口，可以虚拟化FPGA资源，让潜在用户专注于高级软件编程。

{"title":"MITRACA: A Next-Gen Heterogeneous Architecture","authors":"Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku","doi":"10.1109/MCSoC.2019.00050","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00050","url":null,"abstract":"GPU (Graphics Processing Unit) and CPU (Central Processing Unit) possess a sufficient and appropriate performance to compute massively parallel applications like AI, Big data, and material sciences. However, their real performance is far lower than those theoretical ones. The primary reason for the performance degradation is that they suffer from limited memory bandwidth and inefficient interconnection topology not optimized for these types of applications. Thus, from the viewpoint of real computational performance called computational efficiency, FPGA (Field Programmable Gate Array) is now becoming an attractive chip for these types of applications with massively parallel computation. FPGA can efficiently propose optimized communication and bridge different computing accelerators as customized hardware. In other words, FPGA-based hardware accelerators offer a convenient solution for both high performance and high memory bandwidth. However, one serious concern is usability. For example, the FPGA design using hardware description language is a meticulous task and requires specialized skill sets as well as a long time to market. An overlay architecture will become an appropriate candidate that can resolve this issue because it offers a software layer that simplifies FPGA programmability by abstracting the fabric resources. Thus, this article proposes an overlay architecture based on a tightly-connected many-core-based CGRA (Coarse-Grained Reconfigurable Architecture). It will help software engineers on seamlessly implementing their applications. Our final goal is not on the current fine-grained FPGAs but new middle-to-course-grained programmable chips. If an ASIC (Application-Specific Integrated Circuit) implementation was adopted, the performance would achieve at least ten times higher compared with the current FPGA implementation because of the working frequency. In this article, the proposed overlay system provides a programmable interface that virtualizes FPGA resources and let prospected users focus on high-level software programming.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132295681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Novel SLM-Based Virtual FPGA Overlay Architecture 一种新的基于slm的虚拟FPGA覆盖架构

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00018

Theingi Myint, M. Amagasaki, Qian Zhao, M. Iida, M. Kiyama

To implement virtual field-programmable gate array (vFPGA) layers on physical devices, FPGA overlay technologies have been introduced to provide inter-FPGA bitstream compatibility. Conventional LUT-based vFPGA overlay architectures have very large resource overheads because LUT resource requirements increase as O(2k) with an increasing number of inputs, k. In this paper, we propose a novel SLM-based vFPGA overlay architectures that employ our previously proposed scalable logic module (SLM) as a logic cell. SLMs can cover most frequently used logics with far fewer hardware resources than LUTs. Evaluation results show that a 6-input SLM-based vFPGA can reduce LUT and flip-flop resource usage by up to 21% and 21% on an Artix-7 FPGA, on a Kintex-7 FPGA, and on a Kintex UltraScale+ FPGA respectively, as compared to a LUT-based vFPGA of the same input size. Similarly, a 7-input SLM-based vFPGA can reduce LUT and flip-flop resource usage by up to 32% and 35% on an Artix-7 FPGA, 30% and 35% on a Kintex-7 FPGA, and 30% and 35% on a Kintex UltraScale+ FPGA respectively, as compared to a LUT-based vFPGA of the same input size. Delay results of SLM-based vFPGA overlay architectures are almost the same with the comparison of LUTbased vFPGA overlay architectures.

为了在物理设备上实现虚拟现场可编程门阵列(vFPGA)层，引入了FPGA覆盖技术来提供FPGA间的比特流兼容性。传统的基于LUT的vFPGA覆盖架构具有非常大的资源开销，因为LUT资源需求随着输入数量k的增加而增加O(2k)。在本文中，我们提出了一种新的基于SLM的vFPGA覆盖架构，该架构采用我们之前提出的可扩展逻辑模块(SLM)作为逻辑单元。slm可以用比lut少得多的硬件资源覆盖最常用的逻辑。评估结果表明，与相同输入尺寸的基于LUT的vFPGA相比，基于6输入slm的vFPGA在Artix-7 FPGA、Kintex-7 FPGA和Kintex UltraScale+ FPGA上可分别减少高达21%和21%的LUT和触发器资源使用。同样，与相同输入大小的基于LUT的vFPGA相比，基于7输入slm的vFPGA可以分别减少32%和35%的LUT和触发器资源使用，在Artix-7 FPGA上减少30%和35%，在Kintex-7 FPGA上减少30%和35%，在Kintex UltraScale+ FPGA上减少30%和35%。基于slm的vFPGA覆盖结构的延迟结果与基于luta的vFPGA覆盖结构的比较结果几乎相同。

{"title":"A Novel SLM-Based Virtual FPGA Overlay Architecture","authors":"Theingi Myint, M. Amagasaki, Qian Zhao, M. Iida, M. Kiyama","doi":"10.1109/MCSoC.2019.00018","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00018","url":null,"abstract":"To implement virtual field-programmable gate array (vFPGA) layers on physical devices, FPGA overlay technologies have been introduced to provide inter-FPGA bitstream compatibility. Conventional LUT-based vFPGA overlay architectures have very large resource overheads because LUT resource requirements increase as O(2k) with an increasing number of inputs, k. In this paper, we propose a novel SLM-based vFPGA overlay architectures that employ our previously proposed scalable logic module (SLM) as a logic cell. SLMs can cover most frequently used logics with far fewer hardware resources than LUTs. Evaluation results show that a 6-input SLM-based vFPGA can reduce LUT and flip-flop resource usage by up to 21% and 21% on an Artix-7 FPGA, on a Kintex-7 FPGA, and on a Kintex UltraScale+ FPGA respectively, as compared to a LUT-based vFPGA of the same input size. Similarly, a 7-input SLM-based vFPGA can reduce LUT and flip-flop resource usage by up to 32% and 35% on an Artix-7 FPGA, 30% and 35% on a Kintex-7 FPGA, and 30% and 35% on a Kintex UltraScale+ FPGA respectively, as compared to a LUT-based vFPGA of the same input size. Delay results of SLM-based vFPGA overlay architectures are almost the same with the comparison of LUTbased vFPGA overlay architectures.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121175210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Low-Latency and Flexible TDM NoC for Strong Isolation in Security-Critical Systems 用于安全关键系统强隔离的低延迟灵活TDM NoC

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00029

M. Alonso, J. Flich, M. Turki, D. Bertozzi

Shared security-critical systems are typically organized as a set of domains that must be kept separate. The network-on-chip (NoC) is key to delivering strong domain isolation, since many of its internal resources are shared between packets from different domains; therefore time-division multiplexing (TDM) is often implemented to avoid any form of interference. Prior approaches to TDM-based scheduling of NoCs lose relevance when they are challenged with conflicting requirements of latency optimization, area efficiency, architectural flexibility and fast reconfigurability. In many cases, aggressive latency optimizations are performed at the cost of timing channel protection. In this paper, we propose a new scheduling approach of time slots in 2D-mesh TDM NoCs that follows directly from the properties of the Channel Dependency Graph. As a result, the isolation-performance trade-off is consistently improved with respect to state-of-the-art solutions across the domain configuration space. When combined with a new token-based mechanism to dispatch scheduling directives, our approach enables the effective reconfiguration of the number of domains, unlike the static nature of most previous proposals.

共享的安全关键型系统通常组织为一组必须保持独立的域。片上网络(NoC)是提供强域隔离的关键，因为它的许多内部资源在来自不同域的数据包之间共享;因此，通常采用时分复用(TDM)来避免任何形式的干扰。当时延优化、区域效率、架构灵活性和快速可重构性等要求相互冲突时，现有的基于tdm的noc调度方法失去了相关性。在许多情况下，积极的延迟优化是以牺牲时间通道保护为代价的。在本文中，我们提出了一种新的二维网格TDM noc的时隙调度方法，该方法直接遵循通道依赖图的属性。因此，相对于跨域配置空间的最先进的解决方案，隔离与性能之间的权衡得到了持续改进。当结合新的基于令牌的机制来调度调度指令时，我们的方法可以有效地重新配置域的数量，而不像大多数以前的建议的静态性质。

{"title":"A Low-Latency and Flexible TDM NoC for Strong Isolation in Security-Critical Systems","authors":"M. Alonso, J. Flich, M. Turki, D. Bertozzi","doi":"10.1109/MCSoC.2019.00029","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00029","url":null,"abstract":"Shared security-critical systems are typically organized as a set of domains that must be kept separate. The network-on-chip (NoC) is key to delivering strong domain isolation, since many of its internal resources are shared between packets from different domains; therefore time-division multiplexing (TDM) is often implemented to avoid any form of interference. Prior approaches to TDM-based scheduling of NoCs lose relevance when they are challenged with conflicting requirements of latency optimization, area efficiency, architectural flexibility and fast reconfigurability. In many cases, aggressive latency optimizations are performed at the cost of timing channel protection. In this paper, we propose a new scheduling approach of time slots in 2D-mesh TDM NoCs that follows directly from the properties of the Channel Dependency Graph. As a result, the isolation-performance trade-off is consistently improved with respect to state-of-the-art solutions across the domain configuration space. When combined with a new token-based mechanism to dispatch scheduling directives, our approach enables the effective reconfiguration of the number of domains, unlike the static nature of most previous proposals.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127193142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Traffic-Robust Routing Algorithm for Network-on-Chip Systems 片上网络系统的流量鲁棒路由算法

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00037

Siying Xu, M. Meyer, Xin Jiang, Takahiro Watanabe

Network-on-chip (NoC) has been proposed as a better interconnection method than the bus architecture. Recently, a large number of routing algorithms have been proposed to improve the network performance. They usually show their benefits under particular traffic patterns. However, traffic patterns are generally unknown in advance and vary according to the application due to the behavioral diversity between inter-core and memory access communications. In this paper, a local traffic pattern detecting mechanism is proposed to detect the current traffic patterns including uniform, transpose, hotspot and real workloads, and then the routing algorithm will be switched to the most suitable one according to the detection result. Experimental results show that the traffic pattern can be accurately detected. For the hotspot traffic pattern, the success rate of the detector can reach up to 100 percent when the hotspot percentage is larger than 8. With the help of the proposed traffic-robust routing algorithm, the network can always work with a more suitable routing algorithm and achieve better performance.

片上网络(NoC)被认为是一种比总线架构更好的互连方法。近年来，人们提出了大量的路由算法来提高网络的性能。它们通常在特定的交通模式下显示出它们的好处。然而，由于核间通信和内存访问通信之间的行为差异，流量模式通常是事先未知的，并且根据应用程序而变化。本文提出了一种本地流量模式检测机制，可以检测当前的流量模式，包括均匀、转置、热点和真实负载，然后根据检测结果切换到最适合的路由算法。实验结果表明，该方法可以准确地检测出交通模式。对于热点流量型，当热点百分比大于8时，检测器的成功率可达100%。利用本文提出的流量鲁棒路由算法，网络可以始终选择更合适的路由算法，从而获得更好的性能。

{"title":"A Traffic-Robust Routing Algorithm for Network-on-Chip Systems","authors":"Siying Xu, M. Meyer, Xin Jiang, Takahiro Watanabe","doi":"10.1109/MCSoC.2019.00037","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00037","url":null,"abstract":"Network-on-chip (NoC) has been proposed as a better interconnection method than the bus architecture. Recently, a large number of routing algorithms have been proposed to improve the network performance. They usually show their benefits under particular traffic patterns. However, traffic patterns are generally unknown in advance and vary according to the application due to the behavioral diversity between inter-core and memory access communications. In this paper, a local traffic pattern detecting mechanism is proposed to detect the current traffic patterns including uniform, transpose, hotspot and real workloads, and then the routing algorithm will be switched to the most suitable one according to the detection result. Experimental results show that the traffic pattern can be accurately detected. For the hotspot traffic pattern, the success rate of the detector can reach up to 100 percent when the hotspot percentage is larger than 8. With the help of the proposed traffic-robust routing algorithm, the network can always work with a more suitable routing algorithm and achieve better performance.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124064248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Enhanced ID Authentication Scheme Using FPGA-Based Ring Oscillator PUF 基于fpga的环形振荡器PUF的增强ID认证方案

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00052

Van-Toan Tran, Quang-Kien Trinh, Van‐Phuc Hoang

FPGA-based ring oscillator (RO) PUF is very popular for its unique properties and easy implementation. However, the designs are normally expensive, and the RO frequency is highly sensitive to operating condition and other types of global variations. In addition, the local variations are also highly correlated, which normally requires complex the identification (ID) extraction algorithm and/or a large number of ROs. In this work, by using statistical analysis, we have experimentally shown that the RO frequencies are very sensitive to global variation factors. Fortunately, their local process variations within a die are relatively consistent regardless of the operating condition and this can be used for unique ID extraction. Furthermore, we have proposed an ID authentication scheme using FPGA-based RO PUF. Our proposed scheme allows to fully extract the local variation characteristics by using an almost technology-and vendor-agnostic PUF circuit. In addition, the ID extraction circuit is kept simple and compact, so that the overall design is area-and energy-efficient. The experimental results show a very good level of reliability (99.94 %) for a design of 32 ROs in different physical FPGAs.

基于fpga的环形振荡器(RO) PUF以其独特的性能和易于实现而广受欢迎。然而，这种设计通常是昂贵的，并且反渗透频率对操作条件和其他类型的全局变化高度敏感。此外，局部变异也是高度相关的，这通常需要复杂的识别(ID)提取算法和/或大量的ROs。在这项工作中，通过统计分析，我们实验表明，反渗透频率对全球变化因子非常敏感。幸运的是，无论操作条件如何，他们在模具内的本地工艺变化相对一致，这可以用于唯一ID提取。此外，我们还提出了一种基于fpga的RO PUF身份验证方案。我们提出的方案允许通过使用几乎与技术和供应商无关的PUF电路来充分提取局部变化特征。此外，ID提取电路保持简单和紧凑，使整体设计是面积和节能。实验结果表明，在不同物理fpga中设计32个ro具有很高的可靠性(99.94%)。

{"title":"Enhanced ID Authentication Scheme Using FPGA-Based Ring Oscillator PUF","authors":"Van-Toan Tran, Quang-Kien Trinh, Van‐Phuc Hoang","doi":"10.1109/MCSoC.2019.00052","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00052","url":null,"abstract":"FPGA-based ring oscillator (RO) PUF is very popular for its unique properties and easy implementation. However, the designs are normally expensive, and the RO frequency is highly sensitive to operating condition and other types of global variations. In addition, the local variations are also highly correlated, which normally requires complex the identification (ID) extraction algorithm and/or a large number of ROs. In this work, by using statistical analysis, we have experimentally shown that the RO frequencies are very sensitive to global variation factors. Fortunately, their local process variations within a die are relatively consistent regardless of the operating condition and this can be used for unique ID extraction. Furthermore, we have proposed an ID authentication scheme using FPGA-based RO PUF. Our proposed scheme allows to fully extract the local variation characteristics by using an almost technology-and vendor-agnostic PUF circuit. In addition, the ID extraction circuit is kept simple and compact, so that the overall design is area-and energy-efficient. The experimental results show a very good level of reliability (99.94 %) for a design of 32 ROs in different physical FPGAs.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127424977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices 密集对称分层半可分矩阵的分布O(N)线性求解器

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00008

Chenhan D. Yu, Severin Reiz, G. Biros

We present a distributed memory algorithm for the approximate hierarchical factorization of symmetric positive definite (SPD) matrices. Our method is based on the distributed memory GOFMM, an algorithm that appeared in SC18 (doi:10.1109/SC.2018.00018). GOFMM constructs a hierarchical matrix approximation of an arbitrary SPD matrix that compresses the matrix by creating low-rank approximations of the off-diagonal blocks. GOFMM method has no guarantees of success for arbitrary SPD matrices. (This is similar to the SVD; not every matrix admits a good low-rank approximation.) But for many SPD matrices, GOFMM does enable compression that results in fast matrix-vector multiplication that can reach N logN time—as opposed to N2 required for a dense matrix. GOFMM supports shared and distributed memory parallelism. In this paper, we build an approximate "ULV" factorization based on the Hierarchically Semi-Separable (HSS) compression of the GOFMM. This factorization requires O(N) work (given the compressed matrix) and O(N=p) + O(log p) time on p MPI processes (assuming a hypercube topology). The previous state-of-the-art required O(N logN) work. We present the factorization algorithm, discuss its complexity, and present weak and strong scaling results for the "factorization" and "solve" phases of our algorithm. We also discuss the performance of the inexact ULV factorization as a preconditioner for a few exemplary large dense linear systems. In our largest run, we were able to factorize a 67M-by-67M matrix in less than one second; and solve a system with 64 right-hand sides in less than one-tenth of a second. This run was on 6,144 Intel "Skylake" cores on the SKX partition of the Stampede2 system at the Texas Advanced Computing Center.

针对对称正定矩阵的近似分层分解问题，提出了一种分布式记忆算法。我们的方法基于分布式内存GOFMM，该算法出现在SC18中(doi:10.1109/SC.2018.00018)。GOFMM构造任意SPD矩阵的层次矩阵近似值，通过创建非对角线块的低秩近似值来压缩矩阵。对于任意SPD矩阵，GOFMM方法不能保证成功。(这类似于SVD;不是每个矩阵都有好的低秩近似。)但是对于许多SPD矩阵，GOFMM确实支持压缩，从而导致快速的矩阵向量乘法，可以达到N logN时间，而密集矩阵则需要N2时间。GOFMM支持共享和分布式内存并行性。在本文中，我们基于GOFMM的层次半可分离(HSS)压缩构造了一个近似的“ULV”分解。这种分解需要O(N)功(给定压缩矩阵)和O(N=p) + O(log p)时间在p个MPI进程上(假设是超立方体拓扑)。以前最先进的技术需要O(N logN)的工作。我们提出了分解算法，讨论了它的复杂度，并给出了我们算法的“分解”和“求解”阶段的弱和强缩放结果。我们还讨论了不精确ULV分解作为几个典型的大型密集线性系统的前置条件的性能。在我们最大的一次运行中，我们能够在不到一秒的时间内分解一个67m × 67m的矩阵;在不到十分之一秒的时间内解出一个有64个等式的方程组。这次运行是在德克萨斯高级计算中心Stampede2系统的SKX分区上的6144个英特尔“Skylake”内核上进行的。

{"title":"Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices","authors":"Chenhan D. Yu, Severin Reiz, G. Biros","doi":"10.1109/MCSoC.2019.00008","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00008","url":null,"abstract":"We present a distributed memory algorithm for the approximate hierarchical factorization of symmetric positive definite (SPD) matrices. Our method is based on the distributed memory GOFMM, an algorithm that appeared in SC18 (doi:10.1109/SC.2018.00018). GOFMM constructs a hierarchical matrix approximation of an arbitrary SPD matrix that compresses the matrix by creating low-rank approximations of the off-diagonal blocks. GOFMM method has no guarantees of success for arbitrary SPD matrices. (This is similar to the SVD; not every matrix admits a good low-rank approximation.) But for many SPD matrices, GOFMM does enable compression that results in fast matrix-vector multiplication that can reach N logN time—as opposed to N2 required for a dense matrix. GOFMM supports shared and distributed memory parallelism. In this paper, we build an approximate \"ULV\" factorization based on the Hierarchically Semi-Separable (HSS) compression of the GOFMM. This factorization requires O(N) work (given the compressed matrix) and O(N=p) + O(log p) time on p MPI processes (assuming a hypercube topology). The previous state-of-the-art required O(N logN) work. We present the factorization algorithm, discuss its complexity, and present weak and strong scaling results for the \"factorization\" and \"solve\" phases of our algorithm. We also discuss the performance of the inexact ULV factorization as a preconditioner for a few exemplary large dense linear systems. In our largest run, we were able to factorize a 67M-by-67M matrix in less than one second; and solve a system with 64 right-hand sides in less than one-tenth of a second. This run was on 6,144 Intel \"Skylake\" cores on the SKX partition of the Stampede2 system at the Texas Advanced Computing Center.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128723743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Towards an Efficient Hardware Architecture for Odd-Even Based Merge Sorter 基于奇偶归并排序器的高效硬件结构研究

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00043

Elsayed A. Elsayed, Kenji Kise

Sorting is widely used in several practical applications such as searching and database. This paper proposes two improved FPGA-based architectures for merge sorter that use less hardware resources compared to the state-of-the-art. For instance, with 64 sorted records are output per cycle, implementation results of our first proposal show an improvement in the required number of Flip Flops (FFs) and Look-Up Tables (LUTs) by 84.4% and 77.7%, respectively over the state-of-the-art. In addition, the throughput of our merge sorter is 1.065x higher than that of state-of-the-art. As for the second proposal, a significant improvement is achieved by 66.3% and 84.6% for the needed FFs and LUTs, respectively. Moreover, while our second proposed merge sorter uses significant less resources, it achieves about 95.9% of the performance of state-of-the-art merge sorter.

排序在搜索和数据库等实际应用中有着广泛的应用。本文提出了两种改进的基于fpga的合并排序器架构，与最先进的架构相比，它们使用的硬件资源更少。例如，在每个周期输出64个排序记录的情况下，我们的第一个提案的实现结果显示，与最先进的技术相比，Flip - flop (ff)和lookup Tables (lut)所需的数量分别提高了84.4%和77.7%。此外，我们的合并排序器的吞吐量比最先进的高1.065倍。对于第二项建议，所需的ff和lut分别实现了66.3%和84.6%的显着改善。此外，虽然我们提出的第二个合并排序器使用的资源少得多，但它的性能达到了最先进的合并排序器的95.9%。

引用次数: 8

Tumour Detection using Convolutional Neural Network on a Lightweight Multi-Core Device 基于卷积神经网络的轻型多核设备肿瘤检测

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00020

T. Teo, Weihao Tan, Y. Tan

Convolutional neural networks (CNN) have been the main driving force behind image classification and it is widely used. Large amounts of processing power and computation complexity is required to mimic our human brain as in the image classification. Such complexity may result in large bulky systems. A lack of such, while possible, may result in a rather limited use case and as such constrained functional implementation. One solution is to explore the use of Multicore System on Chips (MCSoC). CNN, however, were commonly built on Graphics Processing Units (GPU) based machine. In this paper, we reduce the overall size of a CNN while retaining a satisfactory level of accuracy so that it is better suited to be deployed in an MCSoC environment. We trained a CNN model that was validated on detecting malignant tumor cells. The results show significant boost in functionality in the form of faster inference times and smaller model parameter sizes, deploying neural networks in an environment that would have otherwise seemed less practical. Efficient inference networks on lightweight systems can serve as an inexpensive and physically small alternative to existing Artificial Intelligence (AI) systems that are generally costly, bulky and power hungry.

卷积神经网络(CNN)已经成为图像分类的主要推动力，并得到了广泛的应用。在图像分类中，需要大量的处理能力和计算复杂度来模拟人类的大脑。这种复杂性可能导致庞大的系统。尽管可能，但缺乏这样的功能可能会导致相当有限的用例和受约束的功能实现。一种解决方案是探索使用多核系统芯片(MCSoC)。然而，CNN通常建立在基于图形处理单元(GPU)的机器上。在本文中，我们减少了CNN的总体尺寸，同时保持了令人满意的精度水平，使其更适合部署在MCSoC环境中。我们训练了一个CNN模型，该模型在检测恶性肿瘤细胞方面得到了验证。结果显示，在更快的推理时间和更小的模型参数大小的形式下，在功能上有了显著的提升，将神经网络部署在一个否则看起来不太实用的环境中。轻量级系统上的高效推理网络可以作为现有人工智能(AI)系统的一种廉价且体积小的替代方案，而现有人工智能(AI)系统通常成本高昂、体积庞大且耗电量大。

{"title":"Tumour Detection using Convolutional Neural Network on a Lightweight Multi-Core Device","authors":"T. Teo, Weihao Tan, Y. Tan","doi":"10.1109/MCSoC.2019.00020","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00020","url":null,"abstract":"Convolutional neural networks (CNN) have been the main driving force behind image classification and it is widely used. Large amounts of processing power and computation complexity is required to mimic our human brain as in the image classification. Such complexity may result in large bulky systems. A lack of such, while possible, may result in a rather limited use case and as such constrained functional implementation. One solution is to explore the use of Multicore System on Chips (MCSoC). CNN, however, were commonly built on Graphics Processing Units (GPU) based machine. In this paper, we reduce the overall size of a CNN while retaining a satisfactory level of accuracy so that it is better suited to be deployed in an MCSoC environment. We trained a CNN model that was validated on detecting malignant tumor cells. The results show significant boost in functionality in the form of faster inference times and smaller model parameter sizes, deploying neural networks in an environment that would have otherwise seemed less practical. Efficient inference networks on lightweight systems can serve as an inexpensive and physically small alternative to existing Artificial Intelligence (AI) systems that are generally costly, bulky and power hungry.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130831317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Efficient Search-Space Encoding for System-Level Design Space Exploration of Embedded Systems 嵌入式系统级设计空间探索的高效搜索空间编码

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00046

Valentina Richthammer, M. Glaß

For Design Space Exploration (DSE) of embedded systems as a combinatorial Multi-Objective Optimization Problem (MOP), metaheuristic optimization approaches are typically employed to determine high-quality solutions within limited optimization time. This requires the encoding of implementations from the design space in a search space which represents the available degrees of freedom for the optimization approach. Determining an encoding that ensures all design constraints are met by construction is, however, impossible for multi-/many-core DSE problems, so that the search space contains infeasible solutions. While state-of-the-art DSE techniques repair infeasible solutions, little to no attention has been paid to the efficiency of the resulting encoding w.r.t. its suitability for the employed optimization approach. Therefore, we formally define requirements for an efficient search space and analyze the drawbacks of automatically generated inefficient encodings. We furthermore present efficient search-space encodings for a state-of-the-art hybrid optimization approach suitable for a wide range of MOPs. The proposed encodings significantly reduce the required degree of repair, allowing us to introduce a feedback loop from repaired solutions in the design space to the respective encoded solutions in the efficient search space to further improve the optimization. The positive effects of the proposed efficient encoding and design-space feedback are demonstrated for system-level DSE using benchmarks from the domains of embedded many-core as well as networked automotive systems. Compared to inefficient search spaces from literature, significant enhancements in both optimization quality and time are observed. Furthermore, we propose metrics to quantify search-space efficiency which provide novel insights into the interdependence of search space and design space for multi-/many-core DSE.

嵌入式系统的设计空间探索(DSE)是一个组合多目标优化问题(MOP)，通常采用元启发式优化方法在有限的优化时间内确定高质量的解决方案。这需要在搜索空间中编码来自设计空间的实现，搜索空间表示优化方法的可用自由度。然而，对于多核/多核DSE问题来说，确定一种编码以确保结构满足所有设计约束是不可能的，因此搜索空间中包含不可行的解决方案。虽然最先进的DSE技术修复了不可行的解决方案，但很少或根本没有注意到所产生的编码的效率，而不是其对所采用的优化方法的适用性。因此，我们正式定义了高效搜索空间的需求，并分析了自动生成低效编码的缺点。我们进一步提出了一种最先进的混合优化方法的高效搜索空间编码，适用于广泛的MOPs。所提出的编码显著降低了所需的修复程度，允许我们引入从设计空间中的修复解到有效搜索空间中各自编码解的反馈回路，以进一步提高优化。利用嵌入式多核和联网汽车系统领域的基准测试，证明了所提出的高效编码和设计空间反馈对系统级DSE的积极影响。与文献中低效的搜索空间相比，在优化质量和时间上都有显著的提高。此外，我们提出了量化搜索空间效率的指标，这为多核/多核DSE的搜索空间和设计空间的相互依赖性提供了新的见解。

{"title":"Efficient Search-Space Encoding for System-Level Design Space Exploration of Embedded Systems","authors":"Valentina Richthammer, M. Glaß","doi":"10.1109/MCSoC.2019.00046","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00046","url":null,"abstract":"For Design Space Exploration (DSE) of embedded systems as a combinatorial Multi-Objective Optimization Problem (MOP), metaheuristic optimization approaches are typically employed to determine high-quality solutions within limited optimization time. This requires the encoding of implementations from the design space in a search space which represents the available degrees of freedom for the optimization approach. Determining an encoding that ensures all design constraints are met by construction is, however, impossible for multi-/many-core DSE problems, so that the search space contains infeasible solutions. While state-of-the-art DSE techniques repair infeasible solutions, little to no attention has been paid to the efficiency of the resulting encoding w.r.t. its suitability for the employed optimization approach. Therefore, we formally define requirements for an efficient search space and analyze the drawbacks of automatically generated inefficient encodings. We furthermore present efficient search-space encodings for a state-of-the-art hybrid optimization approach suitable for a wide range of MOPs. The proposed encodings significantly reduce the required degree of repair, allowing us to introduce a feedback loop from repaired solutions in the design space to the respective encoded solutions in the efficient search space to further improve the optimization. The positive effects of the proposed efficient encoding and design-space feedback are demonstrated for system-level DSE using benchmarks from the domains of embedded many-core as well as networked automotive systems. Compared to inefficient search spaces from literature, significant enhancements in both optimization quality and time are observed. Furthermore, we propose metrics to quantify search-space efficiency which provide novel insights into the interdependence of search space and design space for multi-/many-core DSE.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130085574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3