2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

英文中文

Smart Ontology-Based Event Identification 基于智能本体的事件识别

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00027

Sarika Jain, Archana Patel

Identifying an event and all its attributes help in timely response to emergencies or business decisions. Although accurate event identification has been studied in the last decade, fewer thoughts have been put into determining actions with context-dependent effects. This paper is motivated by the desire to develop a synergy between the different answers on the same query posed by users of differing priority. The proposed approach exploits semantic technologies to model the personalized behavior. We provide a control protocol that recognizes the pattern in the flow of precision as the priority of user changes. The control protocol has been utilized to define the priority of the user and is exploited in an efficient algorithm to yield good tradeoffs between various attributes of the decision. Both bottom-up and top-down parsing of the ontological knowledge base is depicted depending on whether the event object is available in the knowledge base or not. The algorithm is then tested on the real-world use case of events of terrorist attacks. The algorithm renders varying answer with varying precision based on a balance between the available resources, the required certainty, the required specificity level, and the acceptable threshold value. The proposed control protocol and the algorithm proved to be logically sound and seem to be a direct consequence of representing knowledge in a manner that is complete.

识别事件及其所有属性有助于及时响应紧急情况或做出业务决策。虽然准确的事件识别在过去的十年中已经得到了研究，但很少有人把想法放在确定具有上下文依赖效应的行为上。本文的动机是希望开发不同优先级用户对同一查询的不同答案之间的协同作用。该方法利用语义技术对个性化行为进行建模。我们提供了一个控制协议，可以识别精度流中的模式作为用户更改的优先级。控制协议被用来定义用户的优先级，并在一个有效的算法中被利用，以在决策的各种属性之间产生良好的权衡。根据事件对象在知识库中是否可用来描述本体知识库的自底向上和自顶向下解析。然后在现实世界的恐怖袭击事件用例中测试该算法。该算法基于可用资源、所需的确定性、所需的特异性水平和可接受的阈值之间的平衡，以不同的精度呈现不同的答案。所提出的控制协议和算法被证明在逻辑上是合理的，并且似乎是以一种完整的方式表示知识的直接结果。

{"title":"Smart Ontology-Based Event Identification","authors":"Sarika Jain, Archana Patel","doi":"10.1109/MCSoC.2019.00027","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00027","url":null,"abstract":"Identifying an event and all its attributes help in timely response to emergencies or business decisions. Although accurate event identification has been studied in the last decade, fewer thoughts have been put into determining actions with context-dependent effects. This paper is motivated by the desire to develop a synergy between the different answers on the same query posed by users of differing priority. The proposed approach exploits semantic technologies to model the personalized behavior. We provide a control protocol that recognizes the pattern in the flow of precision as the priority of user changes. The control protocol has been utilized to define the priority of the user and is exploited in an efficient algorithm to yield good tradeoffs between various attributes of the decision. Both bottom-up and top-down parsing of the ontological knowledge base is depicted depending on whether the event object is available in the knowledge base or not. The algorithm is then tested on the real-world use case of events of terrorist attacks. The algorithm renders varying answer with varying precision based on a balance between the available resources, the required certainty, the required specificity level, and the acceptable threshold value. The proposed control protocol and the algorithm proved to be logically sound and seem to be a direct consequence of representing knowledge in a manner that is complete.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"142 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113997904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Modular Memory System for RISC-V Based MPSoCs on Xilinx FPGAs 基于赛灵思fpga的基于RISC-V的mpsoc模块化存储系统

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00017

Ahmed Kamaleldin, Muhammad Ali, P. Rad, Marcus Gottschalk, D. Göhringer

Current application domains, like mobile robotics, or internet of things require high computational power associated with low energy consumption. Therefore, MPSoCs are widely used as an adequate platform for high performance embedded computation. Recently, the emergence of RISC-V instruction set architecture drives SoC designers to adopt it in the design of MPSoCs as a cost-free, modular processor and suitable to be implemented in different hardware platforms. Furthermore, these characteristics make the RISC-V an interesting candidate for an FPGA soft-core processor. In this paper, we present a modular hybrid memory system for a lightweight RISC-V based MPSoC architecture. The implementation of the hybrid memory consists of a global scratchpad on-chip shared memory for both instruction and data for the purpose of communication and synchronization between the processing elements. In addition to a tightly coupled memory associated with each processing element for low latency memory access for private computation. Moreover, the complete MPSoC architecture is scalable and configurable, in terms of the number of PEs, shared/private memory sizes and the number of memory mapped peripherals. A benchmarking environment is developed to evaluate the performance of the proposed hybrid memory system in terms of memory access latency and memory bandwidth and their impact on the computation time. The complete MPSoC architecture is implemented and tested on a Xilinx Zynq 7000 FPGA device.

当前的应用领域，如移动机器人或物联网，需要低能耗的高计算能力。因此，mpsoc作为高性能嵌入式计算的合适平台被广泛应用。最近，RISC-V指令集架构的出现促使SoC设计者在mpsoc设计中采用它作为一种低成本的模块化处理器，适合在不同的硬件平台上实现。此外，这些特性使RISC-V成为FPGA软核处理器的有趣候选。在本文中，我们提出了一种基于轻量级RISC-V的MPSoC架构的模块化混合存储系统。混合存储器的实现包括用于指令和数据的片上全局刮擦板，用于处理元件之间的通信和同步。除了与每个处理元素相关联的紧耦合内存之外，还可以为私有计算提供低延迟内存访问。此外，在pe数量、共享/私有内存大小和内存映射外设数量方面，完整的MPSoC架构是可扩展和可配置的。开发了一个基准测试环境，以评估所提出的混合存储系统在内存访问延迟和内存带宽方面的性能及其对计算时间的影响。完整的MPSoC架构在赛灵思Zynq 7000 FPGA器件上实现和测试。

{"title":"Modular Memory System for RISC-V Based MPSoCs on Xilinx FPGAs","authors":"Ahmed Kamaleldin, Muhammad Ali, P. Rad, Marcus Gottschalk, D. Göhringer","doi":"10.1109/MCSoC.2019.00017","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00017","url":null,"abstract":"Current application domains, like mobile robotics, or internet of things require high computational power associated with low energy consumption. Therefore, MPSoCs are widely used as an adequate platform for high performance embedded computation. Recently, the emergence of RISC-V instruction set architecture drives SoC designers to adopt it in the design of MPSoCs as a cost-free, modular processor and suitable to be implemented in different hardware platforms. Furthermore, these characteristics make the RISC-V an interesting candidate for an FPGA soft-core processor. In this paper, we present a modular hybrid memory system for a lightweight RISC-V based MPSoC architecture. The implementation of the hybrid memory consists of a global scratchpad on-chip shared memory for both instruction and data for the purpose of communication and synchronization between the processing elements. In addition to a tightly coupled memory associated with each processing element for low latency memory access for private computation. Moreover, the complete MPSoC architecture is scalable and configurable, in terms of the number of PEs, shared/private memory sizes and the number of memory mapped peripherals. A benchmarking environment is developed to evaluate the performance of the proposed hybrid memory system in terms of memory access latency and memory bandwidth and their impact on the computation time. The complete MPSoC architecture is implemented and tested on a Xilinx Zynq 7000 FPGA device.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114852812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

An Automatic MPI Process Mapping Method Considering Locality and Memory Congestion on NUMA Systems 考虑局部性和内存拥塞的NUMA系统MPI进程自动映射方法

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00010

Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa

MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the locality problem because a high number of processor cores in the systems can cause heavy congestion on shared caches and memory controllers. To optimize the process mapping, it is necessary to determine the communication behavior of the MPI processes. Previous methods rely on offline profiling to analyze the communication behavior, which incurs a high overhead and is potentially time-consuming. In this paper, we propose a method that automatically performs MPI process mapping for adapting to communication behaviors while considering both locality and memory congestion. Our method works at runtime during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. The proposed method has been evaluated with the NAS parallel benchmarks on a NUMA system. Experimental results show that our method can achieve performance close to an oracle-based mapping method with low overhead to the application execution. The performance improvement is up to 27.4% (13.4% on average) compared with the default mapping of the MPI runtime system.

MPI进程映射是在非统一内存访问(NUMA)系统上实现可伸缩性能的重要步骤。传统的方法只注重改善通讯的局部性。然而，相关研究表明，在现代NUMA系统上，内存拥塞问题可能会导致比局部性问题更严重的性能下降，因为系统中的大量处理器内核可能会导致共享缓存和内存控制器上的严重拥塞。为了优化进程映射，有必要确定MPI进程的通信行为。以前的方法依赖于脱机分析来分析通信行为，这会产生很高的开销，并且可能很耗时。在本文中，我们提出了一种自动执行MPI进程映射的方法，以适应通信行为，同时考虑局部性和内存拥塞。我们的方法在MPI应用程序执行期间在运行时工作。它不需要修改应用程序，不需要事先了解通信行为，也不需要更改硬件和操作系统。该方法已在NUMA系统上进行了NAS并行基准测试。实验结果表明，该方法可以达到接近基于oracle的映射方法的性能，并且对应用程序的执行开销很小。与MPI运行时系统的默认映射相比，性能提升高达27.4%(平均13.4%)。

{"title":"An Automatic MPI Process Mapping Method Considering Locality and Memory Congestion on NUMA Systems","authors":"Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa","doi":"10.1109/MCSoC.2019.00010","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00010","url":null,"abstract":"MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the locality problem because a high number of processor cores in the systems can cause heavy congestion on shared caches and memory controllers. To optimize the process mapping, it is necessary to determine the communication behavior of the MPI processes. Previous methods rely on offline profiling to analyze the communication behavior, which incurs a high overhead and is potentially time-consuming. In this paper, we propose a method that automatically performs MPI process mapping for adapting to communication behaviors while considering both locality and memory congestion. Our method works at runtime during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. The proposed method has been evaluated with the NAS parallel benchmarks on a NUMA system. Experimental results show that our method can achieve performance close to an oracle-based mapping method with low overhead to the application execution. The performance improvement is up to 27.4% (13.4% on average) compared with the default mapping of the MPI runtime system.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122547600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Machine Learning Enabled Long-Term Performance Evaluation Framework for NoCs 基于机器学习的石油公司长期绩效评估框架

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00031

Jie Hou, Qi Han, M. Radetzki

The rapidly increasing transistor density enables the evolution of many-core on-chip systems. Networks-on-Chips (NoCs) are the preferred communication infrastructure for such systems. Technology scaling increases the susceptibility to failures in the NoC's components. However, such a NoC can still operate at the cost of performance degradation. Therefore, it is not sufficient to analyze the performance and reliability of a NoC separately. In this paper, we propose a machine learning enabled performability evaluation framework to treat both aspects together. It applies Markov reward models. In addition, it leverages machine learning techniques to obtain different performance metrics under consideration of faulty routers and various simulation parameters quickly, which is a challenging task in an analytical manner. Moreover, we use a mesh-based NoC to demonstrate our methodology. Long-term performances of mesh 8x8 under XY and fault-tolerant negative-first routing algorithms are evaluated.

快速增加的晶体管密度使多核片上系统的发展成为可能。片上网络(noc)是这类系统的首选通信基础设施。技术扩展增加了NoC组件故障的易感性。然而，这样的NoC仍然会以性能下降为代价。因此，单独分析NoC的性能和可靠性是不够的。在本文中，我们提出了一个支持机器学习的性能评估框架来同时处理这两个方面。它应用了马尔可夫奖励模型。此外，它利用机器学习技术在考虑故障路由器和各种仿真参数的情况下快速获得不同的性能指标，这在分析方式上是一项具有挑战性的任务。此外，我们使用基于网格的NoC来演示我们的方法。对mesh 8x8在XY和容错负优先路由算法下的长期性能进行了评估。

引用次数: 2

A System Delay Monitor Exploiting Automatic Cell-Based Design Flow and Post-Silicon Calibration 基于自动单元设计流程和后硅校正的系统延迟监视器

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00012

Hayate Okuhara, Ryosuke Kazami, H. Amano

In this work, we present a low-overhead performance monitor which can emulate the maximum operational frequency of a target system by utilizing a delay chain so as to achieve efficient adaptive voltage control. The proposed monitor can be fully built by logic cells provided by general PDKs; thus, an automatic cell-based design flow can be used for its implementation. In addition, interconnect delay behaviors can also be imitated by exploiting wires which are automatically routed. In order to validate our concept, the proposed monitor is fabricated with a 65-nm Fully Depleted Silicon on Insulator (FD-SOI) technology. Real chip experiments reveal that the automated layout design can achieve the reasonable ability to delay emulation. Indeed, when the maximum operational frequency of a CNN accelerator is emulated, the proposed SDM achieved several percents of the performance tracking error. Also, its power overhead is only few percents.

在这项工作中，我们提出了一种低开销的性能监视器，它可以利用延迟链来模拟目标系统的最大工作频率，从而实现有效的自适应电压控制。所提出的监视器可以完全由一般pdk提供的逻辑单元构建;因此，可以使用基于单元格的自动设计流来实现它。此外，互连延迟行为也可以通过利用自动路由的导线来模拟。为了验证我们的概念，所提出的监视器是用65纳米完全耗尽绝缘体上硅(FD-SOI)技术制造的。实际芯片实验表明，自动化版图设计能够实现合理的延时仿真能力。实际上，当仿真CNN加速器的最大工作频率时，所提出的SDM实现了性能跟踪误差的几个百分点。此外，它的电力开销只有几个百分点。

引用次数: 0

A Preliminary Evaluation of Building Block Computing Systems 构建块计算系统的初步评价

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00051

Sayaka Terashima, Takuya Kojima, Hayate Okuhara, Kazusa Musha, H. Amano, Ryuichi Sakamoto, Masaaki Kondo, M. Namiki

A building block computing system with inductive coupling Through Chip Interface (TCI) consists of 3-D chip stack, each of which is small dedicated chips. By changing the combination of stacked chips, various types of systems can be built. A MIPS R3000 compatible processor GeyserTT, a neural network accelerator SNACC and the shared memory for building the twin-tower of chips SMTT have been developed with a Renesas 65nm low leakage CMOS process. They provide the TCI IP (Intellectual Property), and an escalator network is built just by stacking them. This paper shows each chip evaluation results and performance estimation of stacking them with the RTL simulator. The performance of the single-tower and twin-tower configuration is estimated by RTL simulation when a part of Alexnet is implemented. The evaluation results showed that the single-tower configuration with GeyserTT+SNACC achieved about twice performance as the case with GeyserTT. Also, experimental results using each of the single real chip showed that all of them work at least 50MHz with extremely low power consumption. The twin-tower configuration achieved about 2x of the single-tower, that is about 6x of GeyserTT. The power consumption was about 276mW for the single-tower and 496mW for the twin-tower.

基于电感耦合芯片接口(TCI)的构建块计算系统由三维芯片堆栈组成，每个芯片堆栈都是小型专用芯片。通过改变堆叠芯片的组合，可以构建各种类型的系统。采用瑞萨65nm低漏CMOS工艺，开发了兼容MIPS R3000的处理器GeyserTT、神经网络加速器SNACC和用于构建双塔芯片SMTT的共享存储器。它们提供TCI IP(知识产权)，通过将它们堆叠起来，就可以构建一个自动扶梯网络。本文给出了每个芯片的评估结果以及用RTL模拟器对其进行叠加后的性能估计。通过RTL仿真，对Alexnet部分实现时的单塔和双塔配置的性能进行了估计。评价结果表明，使用GeyserTT+SNACC的单塔结构的性能是使用GeyserTT的两倍左右。此外，使用单个真实芯片的实验结果表明，它们都以极低的功耗工作至少50MHz。双塔结构达到了单塔的2倍左右，是GeyserTT的6倍左右。单塔的耗电量约为276兆瓦，双塔的耗电量约为496兆瓦。

{"title":"A Preliminary Evaluation of Building Block Computing Systems","authors":"Sayaka Terashima, Takuya Kojima, Hayate Okuhara, Kazusa Musha, H. Amano, Ryuichi Sakamoto, Masaaki Kondo, M. Namiki","doi":"10.1109/MCSoC.2019.00051","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00051","url":null,"abstract":"A building block computing system with inductive coupling Through Chip Interface (TCI) consists of 3-D chip stack, each of which is small dedicated chips. By changing the combination of stacked chips, various types of systems can be built. A MIPS R3000 compatible processor GeyserTT, a neural network accelerator SNACC and the shared memory for building the twin-tower of chips SMTT have been developed with a Renesas 65nm low leakage CMOS process. They provide the TCI IP (Intellectual Property), and an escalator network is built just by stacking them. This paper shows each chip evaluation results and performance estimation of stacking them with the RTL simulator. The performance of the single-tower and twin-tower configuration is estimated by RTL simulation when a part of Alexnet is implemented. The evaluation results showed that the single-tower configuration with GeyserTT+SNACC achieved about twice performance as the case with GeyserTT. Also, experimental results using each of the single real chip showed that all of them work at least 50MHz with extremely low power consumption. The twin-tower configuration achieved about 2x of the single-tower, that is about 6x of GeyserTT. The power consumption was about 276mW for the single-tower and 496mW for the twin-tower.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116871098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Real-Time Fault-Tolerant and Power-Efficient Multicore System on Chip 一种实时容错且节能的片上多核系统

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00057

A. M. Gruzlikov, N. Kolesov, D. Kostygov, M. Tolmacheva

An approach to designing fault-tolerant and power-efficient multicore systems on chip for realtime information processing and control is proposed. It is assumed that a multicore system has a reserve on the chip, allowing for additional information processing. The approach is based on the rules of introducing redundancy aimed at reducing power consumption and the principles of system-level fault diagnosis, making it possible to decentralize the system recovery in case of failure.

提出了一种用于实时信息处理和控制的容错、节能的片上多核系统设计方法。假设一个多核系统在芯片上有一个预留，允许额外的信息处理。该方法基于以降低功耗为目标的冗余引入规则和系统级故障诊断原则，使系统在发生故障时的分散恢复成为可能。

引用次数: 0

A STDM (Static Time Division Multiplexing) Switch on a Multi-FPGA System 多fpga系统中的STDM(静态时分复用)开关

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00053

Keita Azegami, Kazusa Musha, Kazuei Hironaka, Akram Ben Ahmed, M. Koibuchi, Yao Hu, H. Amano

FPGAs can be a promising accelerator used for MEC (Multi-access Edge Computing) which provides timing critical services for a number of terminals at the base stations near from edges. Although a high-end FPGA can support a fixed latency computation with a relatively small power consumption, they are expensive and the available acceleration circuits are limited into a size of single FPGA. FiC (Flow-in-Cloud) has been developed for building a virtual large FPGA from a number of middle-range economical FPGAs connected with high speed serial links. Although the current target of FiC is cloud computing, it is more suitable for the future MEC, because huge hardware resource can be supported with small cost. One of the problem to use such multi-FPGA systems for timing critical computation is network uncertainty. With a common packet switching, the computation speed is influenced with the network traffic. That is, the fixed latency computation which could be supported by a single FPGA is hard to be supported with multi-FPGA systems using common packet switching networks. In order to address this problem, we introduced STDM (Static Time Division Multiplexing) switch in the FiC system. Since the STDM always supports a constant communication latency, transfer time can be estimated beforehand. Through the implementation of the STDM switch on the FPGA board for FiC, it appeared that the utilization ratio of the LUTs for the STDM switch is smaller than 14%. The required number of slots is less than 16 even for a system with 256 nodes. We implemented the Conjugate Gradient method, which includes all-to-all communication, on 4x2 FiC system. It achieved 17.9 times performance improvement of Intel E5-2667 2.90GHz CPU with 6 cores.

fpga是一种很有前途的加速器，用于MEC(多接入边缘计算)，它为靠近边缘的基站的许多终端提供定时关键服务。尽管高端FPGA可以以相对较小的功耗支持固定延迟计算，但它们价格昂贵，并且可用的加速电路限于单个FPGA的大小。FiC (Flow-in-Cloud)是一种将多个中程经济型FPGA通过高速串行链路连接而成的虚拟大型FPGA。虽然FiC目前的目标是云计算，但它更适合未来的MEC，因为巨大的硬件资源可以用很小的成本来支持。使用多fpga系统进行时序关键计算的一个问题是网络的不确定性。在普通分组交换中，计算速度受网络流量的影响。也就是说，单个FPGA可以支持的固定延迟计算，在使用普通分组交换网络的多FPGA系统中很难得到支持。为了解决这个问题，我们在FiC系统中引入了STDM(静态时分复用)交换机。由于STDM始终支持恒定的通信延迟，因此可以预先估计传输时间。通过在FiC的FPGA板上实现STDM开关，STDM开关的lut利用率小于14%。对于256节点的系统，所需槽位数也不超过16个。我们在4x2 FiC系统上实现了包含全对全通信的共轭梯度法。与Intel E5-2667 2.90GHz 6核CPU相比，性能提升17.9倍。

{"title":"A STDM (Static Time Division Multiplexing) Switch on a Multi-FPGA System","authors":"Keita Azegami, Kazusa Musha, Kazuei Hironaka, Akram Ben Ahmed, M. Koibuchi, Yao Hu, H. Amano","doi":"10.1109/MCSoC.2019.00053","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00053","url":null,"abstract":"FPGAs can be a promising accelerator used for MEC (Multi-access Edge Computing) which provides timing critical services for a number of terminals at the base stations near from edges. Although a high-end FPGA can support a fixed latency computation with a relatively small power consumption, they are expensive and the available acceleration circuits are limited into a size of single FPGA. FiC (Flow-in-Cloud) has been developed for building a virtual large FPGA from a number of middle-range economical FPGAs connected with high speed serial links. Although the current target of FiC is cloud computing, it is more suitable for the future MEC, because huge hardware resource can be supported with small cost. One of the problem to use such multi-FPGA systems for timing critical computation is network uncertainty. With a common packet switching, the computation speed is influenced with the network traffic. That is, the fixed latency computation which could be supported by a single FPGA is hard to be supported with multi-FPGA systems using common packet switching networks. In order to address this problem, we introduced STDM (Static Time Division Multiplexing) switch in the FiC system. Since the STDM always supports a constant communication latency, transfer time can be estimated beforehand. Through the implementation of the STDM switch on the FPGA board for FiC, it appeared that the utilization ratio of the LUTs for the STDM switch is smaller than 14%. The required number of slots is less than 16 even for a system with 256 nodes. We implemented the Conjugate Gradient method, which includes all-to-all communication, on 4x2 FiC system. It achieved 17.9 times performance improvement of Intel E5-2667 2.90GHz CPU with 6 cores.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125801543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Statistical Analysis for Shared Resources Effects with Multi-Core Real-Time Systems 多核实时系统共享资源效应的统计分析

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00058

Julien Durand, Y. Bouchebaba, L. Santinelli

Today's multi-core and many-core COTS platforms make available a large amount of computational resource for real-time applications. As they aim at increasing performance for real-time, their challenges are the guarantees for timing constraints. Real time modeling and analysis are thus facing shared resources, optimization mechanisms, and sophisticated functionalities which all combine into complex system dynamics that are extremely costly to characterize. This paper proposes a measurement-based approach and a statistical analysis applied to define average and worst-case models to task executions under different possible execution conditions. The framework is formalized and then used to investigate different families of shared resources interference effects occurring on multi-core platforms; such effects are quantified with statistical metrics applied to measurements of tasks execution times. The focus of the work is on effects due to shared memories within the NXP T4240 multi core platform and the PikeOS hypervisor. A set of experiments is conducted to validate the framework proposed.

当今的多核和多核COTS平台为实时应用提供了大量的计算资源。当他们的目标是提高实时性能时，他们面临的挑战是对时间限制的保证。因此，实时建模和分析面临着共享资源、优化机制和复杂的功能，所有这些都结合到复杂的系统动力学中，这是非常昂贵的特征。本文提出了一种基于度量和统计分析的方法来定义不同可能执行条件下任务执行的平均和最坏情况模型。将该框架形式化，然后用于研究不同家族的共享资源在多核平台上的干扰效应;这些影响可以通过应用于任务执行时间测量的统计度量来量化。工作的重点是由于NXP T4240多核平台和PikeOS管理程序中的共享内存的影响。通过一组实验验证了所提出的框架。

{"title":"Statistical Analysis for Shared Resources Effects with Multi-Core Real-Time Systems","authors":"Julien Durand, Y. Bouchebaba, L. Santinelli","doi":"10.1109/MCSoC.2019.00058","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00058","url":null,"abstract":"Today's multi-core and many-core COTS platforms make available a large amount of computational resource for real-time applications. As they aim at increasing performance for real-time, their challenges are the guarantees for timing constraints. Real time modeling and analysis are thus facing shared resources, optimization mechanisms, and sophisticated functionalities which all combine into complex system dynamics that are extremely costly to characterize. This paper proposes a measurement-based approach and a statistical analysis applied to define average and worst-case models to task executions under different possible execution conditions. The framework is formalized and then used to investigate different families of shared resources interference effects occurring on multi-core platforms; such effects are quantified with statistical metrics applied to measurements of tasks execution times. The focus of the work is on effects due to shared memories within the NXP T4240 multi core platform and the PikeOS hypervisor. A set of experiments is conducted to validate the framework proposed.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128276771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Designing Application-Specific Heterogeneous Architectures from Performance Models 从性能模型设计特定于应用程序的异构体系结构

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2019-10-01 DOI: 10.1109/MCSoC.2019.00045

Thanh Cong, François Charot

In this paper, we propose an approach for designing application-specific heterogeneous systems based on performance models through combining accelerator and processor core models. An application-specific program is profiled by the dynamic execution trace and is used to construct a data flow model of the accelerator. Modeling of the processor is partitioned into an instruction set architecture (ISA) execution and a micro-architecture specific timing model. These models are implemented on FPGAs to take advantage of their parallelism and speed up the simulation when architecture complexity increases. This approach aims to ease the design of multi-core multi-accelerator architecture, consequently contributes to explore the design space by automating the design steps. A case study is conducted to confirm that presented design flow can model the accelerator starting from an algorithm, validate its integration in a simulation framework, allowing precise performance to be estimated. We also assess the performance of our RISC-V single-core and RISC-V-based heterogeneous architecture models.

本文提出了一种结合加速器和处理器核心模型的基于性能模型的异构系统设计方法。通过动态执行跟踪对特定应用程序进行分析，并用于构建加速器的数据流模型。处理器的建模分为指令集体系结构(ISA)执行和特定于微体系结构的时序模型。这些模型在fpga上实现，以利用其并行性，并在架构复杂性增加时加快仿真速度。该方法旨在简化多核多加速器架构的设计，从而通过自动化设计步骤来探索设计空间。通过实例研究，验证了所提出的设计流程可以从算法开始对加速器进行建模，验证了其在仿真框架中的集成，从而可以准确地估计加速器的性能。我们还评估了我们的RISC-V单核和基于RISC-V的异构架构模型的性能。

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀