首页 > 最新文献

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

英文 中文
Parity-Based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication 基于奇偶校验的ECC及片上通信软错误检测与校正机制
K. Dang, Xuan-Tu Tran
Soft errors are expecting to be accelerated with the shrinking of feature sizes due to low operating voltages and high circuit density. However, soft error rates per single-bit is expectedly reduced with technology scaling. With tight requirements for the area and energy consumption, using a low complexity and high coding rate error correction code (ECC) to handle soft errors in on-chip communication is necessary. In this work, we use Parity Product Code (PPC) and propose several supporting mechanisms to detect and correct soft errors. First, PPC can work as a parity check to detect single event upset (SEU) inside each flit. Then, to reduce the needed retransmission, a Razor flip-flop with parity check (RFF-w-P) is proposed to work with PPC. Since PPC can act like forward error correction (FEC), we also present a selective transmission in bit-indexes by using a transposable FIFO. Therefore, the proposed mechanism not only guarantee single error detection/correction but also provide 2+ error correction as FEC. The proposed work also reduce the area cost of FIFO in comparison to traditional coding methods and adapt too multiple error rates.
由于低工作电压和高电路密度,软误差预计会随着特征尺寸的缩小而加速。然而,随着技术的扩展,每比特的软错误率有望降低。在对面积和能耗要求严格的情况下,采用低复杂度、高编码率的ECC (error correction code)来处理片上通信中的软错误是必要的。在这项工作中,我们使用奇偶产品码(PPC)并提出了几种支持机制来检测和纠正软错误。首先,PPC可以作为奇偶校验来检测每次飞行中的单事件干扰(SEU)。然后,为了减少所需的重传,提出了带奇偶校验(RFF-w-P)的Razor触发器与PPC一起工作。由于PPC可以像前向纠错(FEC)一样起作用,我们还通过使用转座FIFO提出了位索引的选择性传输。因此,所提出的机制既能保证单次错误检测/纠错,又能提供2+纠错作为FEC。与传统的编码方法相比,所提出的工作还降低了FIFO的面积成本,并适应了太多的错误率。
{"title":"Parity-Based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication","authors":"K. Dang, Xuan-Tu Tran","doi":"10.1109/MCSoC2018.2018.00035","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00035","url":null,"abstract":"Soft errors are expecting to be accelerated with the shrinking of feature sizes due to low operating voltages and high circuit density. However, soft error rates per single-bit is expectedly reduced with technology scaling. With tight requirements for the area and energy consumption, using a low complexity and high coding rate error correction code (ECC) to handle soft errors in on-chip communication is necessary. In this work, we use Parity Product Code (PPC) and propose several supporting mechanisms to detect and correct soft errors. First, PPC can work as a parity check to detect single event upset (SEU) inside each flit. Then, to reduce the needed retransmission, a Razor flip-flop with parity check (RFF-w-P) is proposed to work with PPC. Since PPC can act like forward error correction (FEC), we also present a selective transmission in bit-indexes by using a transposable FIFO. Therefore, the proposed mechanism not only guarantee single error detection/correction but also provide 2+ error correction as FEC. The proposed work also reduce the area cost of FIFO in comparison to traditional coding methods and adapt too multiple error rates.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130001173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Designing Compact Convolutional Neural Network for Embedded Stereo Vision Systems 嵌入式立体视觉系统的紧凑卷积神经网络设计
Mohammad Loni, Amin Majd, A. Loni, M. Daneshtalab, Mikael Sjödin, E. Troubitsyna
Autonomous systems are used in a wide range of domains from indoor utensils to autonomous robot surgeries and self-driving cars. Stereo vision cameras probably are the most flexible sensing way in these systems since they can extract depth, luminance, color, and shape information. However, stereo vision based applications suffer from huge image sizes and computational complexity leading system to higher power consumption. To tackle these challenges, in the first step, GIMME2 stereo vision system [1] is employed. GIMME2 is a high-throughput and cost efficient FPGA-based stereo vision embedded system. In the next step, we present a framework for designing an optimized Deep Convolutional Neural Network (DCNN) for time constraint applications and/or limited resource budget platforms. Our framework tries to automatically generate a highly robust DCNN architecture for image data receiving from stereo vision cameras. Our proposed framework takes advantage of a multi-objective evolutionary optimization approach to design a near-optimal network architecture for both the accuracy and network size objectives. Unlike recent works aiming to generate a highly accurate network, we also considered the network size parameters to build a highly compact architecture. After designing a robust network, our proposed framework maps generated network on a multi/many core heterogeneous System-on-Chip (SoC). In addition, we have integrated our framework to the GIMME2 processing pipeline such that it can also estimate the distance of detected objects. The generated network by our framework offers up to 24x compression rate while losing only 5% accuracy compare to the best result on the CIFAR-10 dataset.
自主系统被广泛应用于从室内器具到自主机器人手术和自动驾驶汽车的各个领域。立体视觉相机可能是这些系统中最灵活的传感方式,因为它们可以提取深度、亮度、颜色和形状信息。然而,基于立体视觉的应用受到巨大图像尺寸和计算复杂性的影响,导致系统功耗更高。为了应对这些挑战,首先采用GIMME2立体视觉系统[1]。GIMME2是一种高吞吐量和低成本的基于fpga的立体视觉嵌入式系统。在下一步中,我们提出了一个框架,用于设计优化的深度卷积神经网络(DCNN),用于时间约束应用和/或资源预算有限的平台。我们的框架试图自动生成一个高度鲁棒的DCNN架构,用于从立体视觉相机接收图像数据。我们提出的框架利用多目标进化优化方法为精度和网络大小目标设计了一个接近最优的网络架构。与最近旨在生成高度精确网络的工作不同,我们还考虑了网络大小参数来构建高度紧凑的架构。在设计了一个健壮的网络后,我们提出的框架将生成的网络映射到多核/多核异构片上系统(SoC)。此外,我们已经将我们的框架集成到GIMME2处理管道中,这样它也可以估计检测到的物体的距离。与CIFAR-10数据集的最佳结果相比,我们的框架生成的网络提供了高达24倍的压缩率,同时只损失了5%的准确性。
{"title":"Designing Compact Convolutional Neural Network for Embedded Stereo Vision Systems","authors":"Mohammad Loni, Amin Majd, A. Loni, M. Daneshtalab, Mikael Sjödin, E. Troubitsyna","doi":"10.1109/MCSoC2018.2018.00049","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00049","url":null,"abstract":"Autonomous systems are used in a wide range of domains from indoor utensils to autonomous robot surgeries and self-driving cars. Stereo vision cameras probably are the most flexible sensing way in these systems since they can extract depth, luminance, color, and shape information. However, stereo vision based applications suffer from huge image sizes and computational complexity leading system to higher power consumption. To tackle these challenges, in the first step, GIMME2 stereo vision system [1] is employed. GIMME2 is a high-throughput and cost efficient FPGA-based stereo vision embedded system. In the next step, we present a framework for designing an optimized Deep Convolutional Neural Network (DCNN) for time constraint applications and/or limited resource budget platforms. Our framework tries to automatically generate a highly robust DCNN architecture for image data receiving from stereo vision cameras. Our proposed framework takes advantage of a multi-objective evolutionary optimization approach to design a near-optimal network architecture for both the accuracy and network size objectives. Unlike recent works aiming to generate a highly accurate network, we also considered the network size parameters to build a highly compact architecture. After designing a robust network, our proposed framework maps generated network on a multi/many core heterogeneous System-on-Chip (SoC). In addition, we have integrated our framework to the GIMME2 processing pipeline such that it can also estimate the distance of detected objects. The generated network by our framework offers up to 24x compression rate while losing only 5% accuracy compare to the best result on the CIFAR-10 dataset.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A Fuzzy-Based Approach for Modelling Preferences of Users in Multi-Criteria Recommender Systems 基于模糊的多准则推荐系统用户偏好建模方法
Mohamed Hamada, N. Odu, Mohammed Hassan
Recommender systems (RSs) are web-based tools that use various machine learning and filtering methods to propose useful items for users. Several techniques have been used to develop such a system for generating a list of useful recommendations. Traditionally, RSs use a single rating to represent preferences of a user on an item. A multi-criteria recommendation is a new technique that recommends items to users based on multiple attributes of the items. This technique has been used to solve many recommendation problems. Its predictive performance has been tested and proved to be more efficient than the traditional approach. However, this paper presents a model that is based on the architecture and main features of fuzzy sets and systems. Fuzzy logic (FL) is widely known for its application in different fields of study with its main advantage being that it does not need a lot of training data and its ability to combine human heuristics into the computer-assisted decision making process. FL is highly applicable in the domain of RS. The proposed study is to test and provide the predictive performance of the fuzzy-based multi-criteria technique and compare it with a single rating RS. Experimental results on real-world datasets from Yahoo! Movies proved that the proposed technique has remarkably improved the accuracy of the system
推荐系统(RSs)是基于web的工具,它使用各种机器学习和过滤方法为用户推荐有用的项目。已经使用了几种技术来开发这样一个生成有用推荐列表的系统。传统上,RSs使用单个评级来表示用户对某项的偏好。多标准推荐是一种基于物品的多个属性向用户推荐物品的新技术。该技术已被用于解决许多推荐问题。该方法的预测性能已经过测试,证明比传统方法更有效。然而,本文提出了一个基于模糊集和系统的结构和主要特征的模型。模糊逻辑(FL)因其在不同研究领域的应用而广为人知,其主要优点是不需要大量的训练数据,并且能够将人类的启发式方法结合到计算机辅助决策过程中。FL在RS领域非常适用。本文提出的研究是测试和提供基于模糊的多标准技术的预测性能,并将其与单一评级RS进行比较。实验证明,该方法显著提高了系统的精度
{"title":"A Fuzzy-Based Approach for Modelling Preferences of Users in Multi-Criteria Recommender Systems","authors":"Mohamed Hamada, N. Odu, Mohammed Hassan","doi":"10.1109/MCSoC2018.2018.00026","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00026","url":null,"abstract":"Recommender systems (RSs) are web-based tools that use various machine learning and filtering methods to propose useful items for users. Several techniques have been used to develop such a system for generating a list of useful recommendations. Traditionally, RSs use a single rating to represent preferences of a user on an item. A multi-criteria recommendation is a new technique that recommends items to users based on multiple attributes of the items. This technique has been used to solve many recommendation problems. Its predictive performance has been tested and proved to be more efficient than the traditional approach. However, this paper presents a model that is based on the architecture and main features of fuzzy sets and systems. Fuzzy logic (FL) is widely known for its application in different fields of study with its main advantage being that it does not need a lot of training data and its ability to combine human heuristics into the computer-assisted decision making process. FL is highly applicable in the domain of RS. The proposed study is to test and provide the predictive performance of the fuzzy-based multi-criteria technique and compare it with a single rating RS. Experimental results on real-world datasets from Yahoo! Movies proved that the proposed technique has remarkably improved the accuracy of the system","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"403 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122787254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
On the Complexity of Mapping Feasibility in Many-Core Architectures 多核体系结构中映射可行性的复杂性研究
T. Schwarzer, Sascha Roloff, Valentina Richthammer, R. Khaldi, S. Wildermann, M. Glaß, J. Teich
Many-core architectures enable the concurrent execution of multiple application programs. In this context, the well-known problem of feasibly mapping applications, i.e., their tasks and communication, to such architectures has gained importance due to the large number of cores and limited inter-processor communication capacities. This challenge is tackled by so-called Hybrid Application Mapping (HAM) approaches: These combine a design-time analysis to extract sets of mapping constraints that characterize feasible, respectively optimal mappings with the runtime determination of a concrete mapping in dependence of these mapping constraints and the set of currently available resources. A major strength of HAM approaches has been shown as their ability to give real-time and other guarantees for statically characterized application programs even in highly dynamic workload scenarios while avoiding the pessimism of static resource partitionings. However, finding a feasible mapping is an NP-complete problem. This work discusses arising implications for HAM approaches in general and investigates two exact techniques for solving the mapping constraints at runtime in particular: (I) a problem-specific backtracking approach, and (II) an approach that adopts a general-purpose SAT solver. Experimental results show that the overhead of the general-purpose solver and, in particular, processing and solving the required SAT formulation becomes significant, whereas the problem-specific backtracking technique achieves significantly lower execution times.
多核体系结构支持多个应用程序的并发执行。在这种情况下,由于大量的内核和有限的处理器间通信能力,将应用程序(即它们的任务和通信)映射到这种体系结构的可行性问题变得越来越重要。这一挑战是通过所谓的混合应用映射(HAM)方法解决的:这些方法结合了设计时分析,以提取映射约束集,这些映射约束集表征可行的,分别是最优映射,以及依赖于这些映射约束和当前可用资源集的具体映射的运行时确定。HAM方法的一个主要优点是,即使在高度动态的工作负载场景中,它们也能够为静态特征的应用程序提供实时和其他保证,同时避免静态资源分区的悲观情绪。然而,寻找可行映射是一个np完全问题。这项工作讨论了一般的HAM方法产生的影响,并研究了在运行时解决映射约束的两种确切技术:(I)特定问题的回溯方法,(II)采用通用SAT求解器的方法。实验结果表明,通用求解器的开销,特别是处理和求解所需SAT公式的开销变得显著,而特定问题回溯技术的执行时间显著降低。
{"title":"On the Complexity of Mapping Feasibility in Many-Core Architectures","authors":"T. Schwarzer, Sascha Roloff, Valentina Richthammer, R. Khaldi, S. Wildermann, M. Glaß, J. Teich","doi":"10.1109/MCSoC2018.2018.00038","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00038","url":null,"abstract":"Many-core architectures enable the concurrent execution of multiple application programs. In this context, the well-known problem of feasibly mapping applications, i.e., their tasks and communication, to such architectures has gained importance due to the large number of cores and limited inter-processor communication capacities. This challenge is tackled by so-called Hybrid Application Mapping (HAM) approaches: These combine a design-time analysis to extract sets of mapping constraints that characterize feasible, respectively optimal mappings with the runtime determination of a concrete mapping in dependence of these mapping constraints and the set of currently available resources. A major strength of HAM approaches has been shown as their ability to give real-time and other guarantees for statically characterized application programs even in highly dynamic workload scenarios while avoiding the pessimism of static resource partitionings. However, finding a feasible mapping is an NP-complete problem. This work discusses arising implications for HAM approaches in general and investigates two exact techniques for solving the mapping constraints at runtime in particular: (I) a problem-specific backtracking approach, and (II) an approach that adopts a general-purpose SAT solver. Experimental results show that the overhead of the general-purpose solver and, in particular, processing and solving the required SAT formulation becomes significant, whereas the problem-specific backtracking technique achieves significantly lower execution times.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127424851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
VLSI Design of Floating-Point Twiddle Factor Using Adaptive CORDIC on Various Iteration Limitations 基于自适应CORDIC的浮点抖动因子VLSI设计
Trong-Thuc Hoang, Duc-Hung Le, C. Pham
The design of 32-bit floating-point Fast Fourier Transform (FFT) Twiddle Factor (TF) is proposed in this paper. The architecture was developed based on the adaptive algorithm of COordinate Rotation DIgital Computer (CORDIC). The CORDIC method is a well-known approach for approximating the complex-number multiplication in FFT implementations, also known as TF. An iterative process does the calculations of adaptive CORDIC. Therefore, by limiting the number of iterations, the accuracy performances can be sacrificed for the better outcome of throughput rates. As a result, there are three different FFT TF implementations were presented in this paper. They are TF-4, TF-8, and TF-16 for the design of TF implemented on four, eight, and 16 iteration limitations, respectively. The results of the three implementations were reported on both Field Programmable Gate Array (FPGA) and Application Specific Integrated Chip (ASIC) level. The FPGA results were examined on the Altera Stratix IV development kit, and the ASIC results were reported by the Synopsys tools with the Silicon On Thin Buried-oxide (SOTB) 65nm process library.
提出了一种32位浮点快速傅里叶变换(FFT)抖动因子(TF)的设计方法。该体系结构是基于坐标旋转数字计算机(CORDIC)自适应算法开发的。CORDIC方法是FFT实现(也称为TF)中近似复数乘法的一种众所周知的方法。自适应CORDIC的计算是一个迭代过程。因此,通过限制迭代次数,可以牺牲精度性能以获得更好的吞吐率结果。因此,本文提出了三种不同的FFT TF实现。它们分别是TF-4、TF-8和TF-16,用于设计在4、8和16个迭代限制上实现的TF。在现场可编程门阵列(FPGA)和专用集成芯片(ASIC)两级上分别报道了这三种实现的结果。FPGA结果在Altera Stratix IV开发套件上进行了测试,ASIC结果通过Synopsys工具与硅薄埋氧化物(SOTB) 65nm工艺库报告。
{"title":"VLSI Design of Floating-Point Twiddle Factor Using Adaptive CORDIC on Various Iteration Limitations","authors":"Trong-Thuc Hoang, Duc-Hung Le, C. Pham","doi":"10.1109/MCSoC2018.2018.00044","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00044","url":null,"abstract":"The design of 32-bit floating-point Fast Fourier Transform (FFT) Twiddle Factor (TF) is proposed in this paper. The architecture was developed based on the adaptive algorithm of COordinate Rotation DIgital Computer (CORDIC). The CORDIC method is a well-known approach for approximating the complex-number multiplication in FFT implementations, also known as TF. An iterative process does the calculations of adaptive CORDIC. Therefore, by limiting the number of iterations, the accuracy performances can be sacrificed for the better outcome of throughput rates. As a result, there are three different FFT TF implementations were presented in this paper. They are TF-4, TF-8, and TF-16 for the design of TF implemented on four, eight, and 16 iteration limitations, respectively. The results of the three implementations were reported on both Field Programmable Gate Array (FPGA) and Application Specific Integrated Chip (ASIC) level. The FPGA results were examined on the Altera Stratix IV development kit, and the ASIC results were reported by the Synopsys tools with the Silicon On Thin Buried-oxide (SOTB) 65nm process library.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128824803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Efficient Hardware Implementation of Activation Functions Using Stochastic Computing for Deep Neural Networks 基于随机计算的深度神经网络激活函数的有效硬件实现
Van-Tinh Nguyen, Tieu-Khanh Luong, Han Le Duc, Van‐Phuc Hoang
In this paper, we present a new approximation method for non-linear activation functions including tanh and sigmoid functions using stochastic computing (SC) logic based on the piecewise-linear approximation (PWL) for the full range of [-1, 1]. SC implementations with PWL approximation expansions for non-linear functions are based on a 90nm CMOS process. The implementation results shown that the proposed SC circuits can provide better performance compared with the previous methods such as the well-known Maclaurin expansions based, Bernstein polynomial based and finite-state-machine (FSM) based implementations. The implementation results are also presented and discussed.
本文利用随机计算(SC)逻辑,基于[- 1,1]全范围的分段线性逼近(PWL),提出了一种新的tanh和sigmoid函数非线性激活函数的逼近方法。SC实现与非线性函数的PWL近似展开是基于90纳米CMOS工艺。实现结果表明,与基于Maclaurin展开、基于Bernstein多项式和基于有限状态机(FSM)的实现方法相比,所提出的SC电路具有更好的性能。最后给出了实施结果并进行了讨论。
{"title":"An Efficient Hardware Implementation of Activation Functions Using Stochastic Computing for Deep Neural Networks","authors":"Van-Tinh Nguyen, Tieu-Khanh Luong, Han Le Duc, Van‐Phuc Hoang","doi":"10.1109/MCSoC2018.2018.00045","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00045","url":null,"abstract":"In this paper, we present a new approximation method for non-linear activation functions including tanh and sigmoid functions using stochastic computing (SC) logic based on the piecewise-linear approximation (PWL) for the full range of [-1, 1]. SC implementations with PWL approximation expansions for non-linear functions are based on a 90nm CMOS process. The implementation results shown that the proposed SC circuits can provide better performance compared with the previous methods such as the well-known Maclaurin expansions based, Bernstein polynomial based and finite-state-machine (FSM) based implementations. The implementation results are also presented and discussed.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128140556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
MARTE and IP-XACT Based Approach for Run-Time Scalable NoC 基于MARTE和IP-XACT的运行时可扩展NoC方法
H. Kidane, E. Bourennane
The Networks on chip (NoC) based communication is increasingly used as a solution for multi-IP system-on-Chip. There have been tremendous works to improve the adaptation of the NoC for FPGA based dynamically reconfigurable IPs. The Dynamic Partial Reconfiguration (DPR) based run-time scalable NoC is one way to reduce the power consumption by idle components of the NoC. However, the absence of custom HDL NoC generation tools which separate the NoC rows and columns into independent components remains open. In this paper, we have introduced a UML/MARTE and IPXACT based approach to model and generated run-time scalable NoC components targeting Xilinx FPGAs. The NoC is modeled by splitting into static sub-NoC and a series of run-time scalable rows and columns as a component. First, both the static and run-time scalable sub-NoC are defined at a high level using the UML/MARTE. Then, they are transformed into an intermediate level of XML description respecting the IP-XACT standard. Next, all XML description of the top level NoC, the reconfigurable rows and columns are transformed into VHDL. Finally, the HDL files of the NoC are imported to Xilinx EDK to implement the dynamically scalable NoC by mixing with the FPGA based reconfigurable IPs. The proposed approach is validated by modeling a 3x3 NoC splitting into three components as 2x2 static sub-NoC, 2x1 reconfigurable column and 1x3 reconfigurable row. Then, a user-defined small IPs are used to connect with the NoC routers and implement the full system.
基于片上网络(NoC)的通信越来越多地被用作多ip片上系统的解决方案。为了提高NoC对基于FPGA的动态可重构ip的适应性,已经进行了大量的工作。基于动态部分重新配置(DPR)的运行时可扩展NoC是减少NoC空闲组件功耗的一种方法。然而,缺乏将NoC行和列分离为独立组件的定制HDL NoC生成工具仍然是开放的。在本文中,我们介绍了一种基于UML/MARTE和IPXACT的方法来建模和生成针对赛灵思fpga的运行时可扩展的NoC组件。NoC通过将静态子NoC和一系列运行时可扩展的行和列作为组件进行建模。首先,静态和运行时可伸缩的子noc都是使用UML/MARTE在高层次上定义的。然后,将它们转换为遵循IP-XACT标准的中间级别的XML描述。接下来,将顶层NoC的所有XML描述、可重构的行和列转换为VHDL。最后,将NoC的HDL文件导入到Xilinx EDK中,通过与基于FPGA的可重构ip混合实现可动态扩展的NoC。通过将3x3 NoC建模为2x2静态子NoC、2x1可重构列和1x3可重构行三个组件,验证了该方法的有效性。然后使用自定义的小ip与NoC路由器连接,实现整个系统。
{"title":"MARTE and IP-XACT Based Approach for Run-Time Scalable NoC","authors":"H. Kidane, E. Bourennane","doi":"10.1109/MCSoC2018.2018.00036","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00036","url":null,"abstract":"The Networks on chip (NoC) based communication is increasingly used as a solution for multi-IP system-on-Chip. There have been tremendous works to improve the adaptation of the NoC for FPGA based dynamically reconfigurable IPs. The Dynamic Partial Reconfiguration (DPR) based run-time scalable NoC is one way to reduce the power consumption by idle components of the NoC. However, the absence of custom HDL NoC generation tools which separate the NoC rows and columns into independent components remains open. In this paper, we have introduced a UML/MARTE and IPXACT based approach to model and generated run-time scalable NoC components targeting Xilinx FPGAs. The NoC is modeled by splitting into static sub-NoC and a series of run-time scalable rows and columns as a component. First, both the static and run-time scalable sub-NoC are defined at a high level using the UML/MARTE. Then, they are transformed into an intermediate level of XML description respecting the IP-XACT standard. Next, all XML description of the top level NoC, the reconfigurable rows and columns are transformed into VHDL. Finally, the HDL files of the NoC are imported to Xilinx EDK to implement the dynamically scalable NoC by mixing with the FPGA based reconfigurable IPs. The proposed approach is validated by modeling a 3x3 NoC splitting into three components as 2x2 static sub-NoC, 2x1 reconfigurable column and 1x3 reconfigurable row. Then, a user-defined small IPs are used to connect with the NoC routers and implement the full system.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121241864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
FPGA Acceleration to Solve Maximum Clique Problems Encoded into Partial MaxSAT FPGA加速求解部分MaxSAT编码的最大团问题
K. Kanazawa, Shaowei Cai
In this paper, we propose an FPGA solver for the maximum clique problems encoded into the partial maximum satisfiability (partial MaxSAT). Given a Boolean formula with hard constraints that required to be satisfied and soft constraints that are desired to be satisfied, the goal of partial MaxSAT is to find a truth assignment that satisfies all hard constraints and as many soft constraints as possible. The maximum clique problem involves finding a clique with the maximum possible number of vertices in a given graph, which can be formulated as partial MaxSAT in a natural way. The Dist algorithm is one of the best performing local search algorithms for solving partial MaxSAT. In this paper, we reconstruct the Dist algorithm to leverage its inherent parallelism while maintaining the accuracy of the algorithm for maximum clique problems and then describe the implementation of the algorithm on FPGA. Our FPGA solver can solve partial MaxSAT-encoded maximum clique problems up to 22 times faster than the Dist algorithm on CPU.
在本文中,我们提出了一个FPGA求解器,用于将最大团问题编码为部分最大可满足性(partial MaxSAT)。给定一个布尔公式,其中需要满足硬约束和希望满足软约束,部分MaxSAT的目标是找到一个满足所有硬约束和尽可能多的软约束的真值赋值。最大团问题涉及在给定图中寻找具有最大可能顶点数的团,可以用自然的方式表示为部分MaxSAT。Dist算法是求解局部MaxSAT问题中性能最好的局部搜索算法之一。在本文中,我们重构了Dist算法,以利用其固有的并行性,同时保持算法对最大团问题的准确性,然后描述了该算法在FPGA上的实现。我们的FPGA求解器可以解决部分maxsat编码的最大团问题,比CPU上的Dist算法快22倍。
{"title":"FPGA Acceleration to Solve Maximum Clique Problems Encoded into Partial MaxSAT","authors":"K. Kanazawa, Shaowei Cai","doi":"10.1109/MCSoC2018.2018.00043","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00043","url":null,"abstract":"In this paper, we propose an FPGA solver for the maximum clique problems encoded into the partial maximum satisfiability (partial MaxSAT). Given a Boolean formula with hard constraints that required to be satisfied and soft constraints that are desired to be satisfied, the goal of partial MaxSAT is to find a truth assignment that satisfies all hard constraints and as many soft constraints as possible. The maximum clique problem involves finding a clique with the maximum possible number of vertices in a given graph, which can be formulated as partial MaxSAT in a natural way. The Dist algorithm is one of the best performing local search algorithms for solving partial MaxSAT. In this paper, we reconstruct the Dist algorithm to leverage its inherent parallelism while maintaining the accuracy of the algorithm for maximum clique problems and then describe the implementation of the algorithm on FPGA. Our FPGA solver can solve partial MaxSAT-encoded maximum clique problems up to 22 times faster than the Dist algorithm on CPU.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128859765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Title Page i 第1页
{"title":"Title Page i","authors":"","doi":"10.1109/mcsoc2018.2018.00001","DOIUrl":"https://doi.org/10.1109/mcsoc2018.2018.00001","url":null,"abstract":"","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123298284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Communication-Avoiding Tile QR Decomposition on CPU/GPU Heterogeneous Cluster System CPU/GPU异构集群系统中避免通信的平铺QR分解
M. Takayanagi, Tomohiro Suzuki
The tile algorithm for matrix decompositions is attracting attention as a method for the latest multicore/many-core architecture because it can generate many fine-grained tasks which can be executed in parallel. Exploiting many parallel computing resources effectively with a fork-join paradigm is difficult. CPU/GPU heterogeneous cluster system is mainstream for a supercomputer system in recent years. Using the CPU/GPU cluster system efficiently is more difficult than efficiently utilizing the multicore cluster system. We implemented the tile CAQR decomposition algorithm on the CPU/GPU cluster system with OpenMP 4.0, MPI and cuBLAS, and proposed new approaches to utilize GPUs efficiently. In this paper, we show the performance result of our implementation on the Reedbush-H heterogeneous supercomputer.
矩阵分解的tile算法作为最新的多核/多核架构的一种方法,因为它可以生成许多可以并行执行的细粒度任务而受到关注。使用fork-join范式有效地利用许多并行计算资源是很困难的。CPU/GPU异构集群系统是近年来超级计算机系统的主流。有效地利用CPU/GPU集群系统比有效地利用多核集群系统更困难。利用openmp4.0、MPI和cuBLAS在CPU/GPU集群系统上实现了tile CAQR分解算法,并提出了高效利用GPU的新方法。在本文中,我们展示了我们在Reedbush-H异构超级计算机上实现的性能结果。
{"title":"Communication-Avoiding Tile QR Decomposition on CPU/GPU Heterogeneous Cluster System","authors":"M. Takayanagi, Tomohiro Suzuki","doi":"10.1109/MCSoC2018.2018.00031","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00031","url":null,"abstract":"The tile algorithm for matrix decompositions is attracting attention as a method for the latest multicore/many-core architecture because it can generate many fine-grained tasks which can be executed in parallel. Exploiting many parallel computing resources effectively with a fork-join paradigm is difficult. CPU/GPU heterogeneous cluster system is mainstream for a supercomputer system in recent years. Using the CPU/GPU cluster system efficiently is more difficult than efficiently utilizing the multicore cluster system. We implemented the tile CAQR decomposition algorithm on the CPU/GPU cluster system with OpenMP 4.0, MPI and cuBLAS, and proposed new approaches to utilize GPUs efficiently. In this paper, we show the performance result of our implementation on the Reedbush-H heterogeneous supercomputer.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126502732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1