2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

英文中文

Portability of Vectorization-aware Performance Tuning Expertise across System Generations 跨系统代的矢量感知性能调优专业知识的可移植性

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00043

Shunpei Sugawara, Yoichi Shimomura, Ryusuke Egawa, H. Takizawa

Even HPC expert programmers need to invest considerable time and effort in empirically establishing effective performance tuning strategies for their target systems. When the target system is changed and/or updated, it is thus preferable for expert programmers if their performance tuning expertise can be ported to the new system as much as possible. In this paper, we focus on multiple generations of NEC SX series vector systems. We have documented the performance tuning expertise for the previous generations and built a machine-usable database of performance tuning cases. Therefore, this paper investigates how much the recorded expertise in the database can contribute to performance tuning for the latest generation, NEC SX-Aurora TSUBASA (SX-AT). Since the system architecture as well as the software stack such as compilers are totally renewed for SX-AT, this paper discusses the differences in performance tuning across system generations. In addition, this paper also discusses how to express performance tuning techniques in a machine-usable way. The case study in this paper indicates that the Xevolver's approach of using user-defined code transformations can express most of the vectorization-aware performance tuning techniques, and is thus promising for recording the performance tuning expertise in a future-proof fashion.

即使是HPC专家程序员也需要投入大量的时间和精力，以经验为目标系统建立有效的性能调优策略。当目标系统发生更改和/或更新时，如果专业程序员的性能调优专业知识能够尽可能多地移植到新系统上，那将是更好的选择。在本文中，我们的重点是多代NEC SX系列矢量系统。我们已经记录了前几代人的性能调优专业知识，并构建了一个机器可用的性能调优案例数据库。因此，本文研究了数据库中记录的专业知识对最新一代NEC SX-Aurora TSUBASA (SX-AT)的性能调优有多大贡献。由于系统架构以及软件堆栈(如编译器)都针对SX-AT进行了完全更新，因此本文将讨论各代系统之间性能调优的差异。此外，本文还讨论了如何以机器可用的方式表达性能调优技术。本文中的案例研究表明，Xevolver使用用户定义代码转换的方法可以表达大多数向量化感知的性能调优技术，因此有望以一种面向未来的方式记录性能调优专业知识。

{"title":"Portability of Vectorization-aware Performance Tuning Expertise across System Generations","authors":"Shunpei Sugawara, Yoichi Shimomura, Ryusuke Egawa, H. Takizawa","doi":"10.1109/MCSoC51149.2021.00043","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00043","url":null,"abstract":"Even HPC expert programmers need to invest considerable time and effort in empirically establishing effective performance tuning strategies for their target systems. When the target system is changed and/or updated, it is thus preferable for expert programmers if their performance tuning expertise can be ported to the new system as much as possible. In this paper, we focus on multiple generations of NEC SX series vector systems. We have documented the performance tuning expertise for the previous generations and built a machine-usable database of performance tuning cases. Therefore, this paper investigates how much the recorded expertise in the database can contribute to performance tuning for the latest generation, NEC SX-Aurora TSUBASA (SX-AT). Since the system architecture as well as the software stack such as compilers are totally renewed for SX-AT, this paper discusses the differences in performance tuning across system generations. In addition, this paper also discusses how to express performance tuning techniques in a machine-usable way. The case study in this paper indicates that the Xevolver's approach of using user-defined code transformations can express most of the vectorization-aware performance tuning techniques, and is thus promising for recording the performance tuning expertise in a future-proof fashion.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"46 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134420045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Text Compression Based on an Alternative Approach of Run-Length Coding Using Burrows-Wheeler Transform and Arithmetic Coding 基于Burrows-Wheeler变换和算术编码的游程编码替代方法的文本压缩

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00049

Md.Atiqur Rahman, Mohamed Hamada, Md. Asfaqur Rahman

In modern life, communication via text is becoming one of the most popular means of communication. As a result, storing text in a small format or transferring it quickly over the internet has become a challenging issue, and text compression has become an important research field. Many algorithms for text compression have already been developed, and new algorithms are being devised to fulfil the demands of current technology. This research article proposes a text compression technique based on: (i) the Burrows-Wheeler transform; (ii) an alternative method of run-length coding; (iii) finding repeated patterns more frequently; and (iv) arithmetic coding. The proposed approach is compared with other state-of-the-art methods, and gives better performance in terms of compression ratios.

在现代生活中，通过文本进行交流正成为最流行的交流方式之一。因此，以小格式存储文本或在互联网上快速传输文本已成为一个具有挑战性的问题，文本压缩已成为一个重要的研究领域。许多文本压缩算法已经被开发出来，新的算法正在被设计以满足当前技术的要求。本文提出了一种基于Burrows-Wheeler变换的文本压缩技术;(ii)另一种游程编码方法;(iii)更频繁地发现重复模式;(四)算术编码。该方法与其他最先进的方法进行了比较，并在压缩比方面给出了更好的性能。

引用次数: 0

Accelerated on-Chip Algorithm Based on Semantic Region-Based Partial Difference Detection for LiDAR-Vision Depth Data Transmission Reduction in Lightweight Controller Systems of Autonomous Vehicle 基于语义区域偏差分检测的加速片上算法在自动驾驶汽车轻量化控制器系统中减少激光雷达视觉深度数据传输

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00011

Dong-gill Jung, Dae-Geun Park

LiDAR sensors are one type of sensor used in autonomous driving vehicles that obtain distance data through the flight time of light. A LiDAR sensor can measure data at high speeds, and the precision of the data is higher than with other sensors. A large amount of data per sensing time is transmitted from sensors. Autonomous driving vehicles use man electronic devices, so the data channels they use and the domain control unit resources that control the system are limited. In this environment, if LiDAR sensor data can be reduced without compromising the original data, it can have a quite positive impact on autonomous vehicle systems. In this paper, we propose a differential partial update for data reduction of LiDAR sensors and a semantic detection to eliminate the resulting noise and increase the reliability of the data. The sensor processor extracts only the changed parts of the continuous distance data, excluding the same parts, and transmit them to the host. The high-difference noise is eliminated by filtering through a window-sliding operation. Semantic detection marks only parts that change and detects movement in the field of view. Simple differential partial updates reduce the amount of data by 59.31% based on a simple case. A semantic detection partial update can reduce the amount of data by 83.41%. This process can also reduce computing time by 61.36% with graphics processing unit acceleration.

激光雷达传感器是通过光的飞行时间获取距离数据的自动驾驶车辆中使用的一种传感器。激光雷达传感器可以高速测量数据，并且数据的精度高于其他传感器。每个传感时间从传感器传输大量数据。自动驾驶车辆使用的电子设备较多，因此使用的数据通道和控制系统的域控制单元资源有限。在这种环境下，如果可以在不影响原始数据的情况下减少激光雷达传感器数据，则可以对自动驾驶汽车系统产生相当积极的影响。在本文中，我们提出了一种差分部分更新，用于激光雷达传感器的数据减少和语义检测，以消除由此产生的噪声并提高数据的可靠性。传感器处理器只提取连续距离数据中变化的部分，不包括相同的部分，并将其传输给主机。通过窗口滑动操作滤波消除了高差噪声。语义检测只标记变化的部分，并检测视野中的运动。基于一个简单的案例，简单的微分部分更新减少了59.31%的数据量。语义检测部分更新可以减少83.41%的数据量。在图形处理单元加速的情况下，该过程还可以减少61.36%的计算时间。

{"title":"Accelerated on-Chip Algorithm Based on Semantic Region-Based Partial Difference Detection for LiDAR-Vision Depth Data Transmission Reduction in Lightweight Controller Systems of Autonomous Vehicle","authors":"Dong-gill Jung, Dae-Geun Park","doi":"10.1109/MCSoC51149.2021.00011","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00011","url":null,"abstract":"LiDAR sensors are one type of sensor used in autonomous driving vehicles that obtain distance data through the flight time of light. A LiDAR sensor can measure data at high speeds, and the precision of the data is higher than with other sensors. A large amount of data per sensing time is transmitted from sensors. Autonomous driving vehicles use man electronic devices, so the data channels they use and the domain control unit resources that control the system are limited. In this environment, if LiDAR sensor data can be reduced without compromising the original data, it can have a quite positive impact on autonomous vehicle systems. In this paper, we propose a differential partial update for data reduction of LiDAR sensors and a semantic detection to eliminate the resulting noise and increase the reliability of the data. The sensor processor extracts only the changed parts of the continuous distance data, excluding the same parts, and transmit them to the host. The high-difference noise is eliminated by filtering through a window-sliding operation. Semantic detection marks only parts that change and detects movement in the field of view. Simple differential partial updates reduce the amount of data by 59.31% based on a simple case. A semantic detection partial update can reduce the amount of data by 83.41%. This process can also reduce computing time by 61.36% with graphics processing unit acceleration.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"34 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132761948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UI Method to Support Knowledge Creation in Hybrid Museum Experience 支持混合博物馆体验中知识创造的UI方法

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00050

Toru Tamahashi, R. Yoshioka, Takayuki Hoshino

A user interaction method to support knowledge creation in a hybrid museum experience is proposed and evaluated. The method incorporates a knowledge creation process of visitor experiences to the interaction scheme on the user interface based on two intentions. The first intention is to invoke user actions required for an effective knowledge experience, including individual learning. The second intention is to document the knowledge with sufficient information for sharing and reuse. The method is designed as part of an application for a hybrid museum experience such that the digital device does not distract the visitor from the museum exhibit. This paper presents the proposed UI interaction method, its implementation in the application, and an evaluation study of its effects. The evaluation was conducted with a group of curators to obtain professional feedback on the method's effect on observation behavior and knowledge creation. As a result, we found that the user interface of expressing one's own impressions and seeing the impressions of others helped to deepen the understanding of the exhibits.

提出并评价了一种支持混合博物馆体验中知识创造的用户交互方法。该方法将访问者体验的知识生成过程结合到基于两种意图的用户界面交互方案中。第一个目的是调用有效的知识体验所需的用户操作，包括个人学习。第二个目的是用足够的信息记录知识，以便共享和重用。该方法被设计为混合博物馆体验的应用程序的一部分，这样数字设备就不会分散游客对博物馆展品的注意力。本文介绍了提出的UI交互方法及其在应用中的实现，并对其效果进行了评价研究。与一组策展人一起进行评估，以获得关于该方法对观察行为和知识创造的影响的专业反馈。结果，我们发现表达自己的印象和看到别人的印象的用户界面有助于加深对展品的理解。

{"title":"UI Method to Support Knowledge Creation in Hybrid Museum Experience","authors":"Toru Tamahashi, R. Yoshioka, Takayuki Hoshino","doi":"10.1109/MCSoC51149.2021.00050","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00050","url":null,"abstract":"A user interaction method to support knowledge creation in a hybrid museum experience is proposed and evaluated. The method incorporates a knowledge creation process of visitor experiences to the interaction scheme on the user interface based on two intentions. The first intention is to invoke user actions required for an effective knowledge experience, including individual learning. The second intention is to document the knowledge with sufficient information for sharing and reuse. The method is designed as part of an application for a hybrid museum experience such that the digital device does not distract the visitor from the museum exhibit. This paper presents the proposed UI interaction method, its implementation in the application, and an evaluation study of its effects. The evaluation was conducted with a group of curators to obtain professional feedback on the method's effect on observation behavior and knowledge creation. As a result, we found that the user interface of expressing one's own impressions and seeing the impressions of others helped to deepen the understanding of the exhibits.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114867350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms 异构平台上高级数据流程序的性能评估

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00018

Aurelien Bloch, S. Brunet, M. Mattavelli

The performance of programs written in languages following the dataflow model of computation (MoC) largely depends on the configuration (partitioning, mapping, scheduling, buffer dimensioning) chosen during the synthesis stages. Furthermore, this programming paradigm is particularly well suited for heterogeneous parallel systems because it is inherently free of memory contention and exposes parallel opportunities. Both of these statements show the necessity for a way to easily and automatically evaluate and find good design configurations. The paper describes the methodology required for clock-accurate profiling of high-level dataflow programs written in RVL-CAL when synthesized on heterogeneous CPU/GPU co-processing platforms. It also extends to the heterogeneous paradigm an existing methodology for qualitatively estimating the performance of such programs as a function of the provided configuration. This, without the need to synthesize and profile every single configuration on the actual hardware platform. This approach is validated using two application programs and several configurations.

遵循数据流计算模型(MoC)的语言编写的程序的性能在很大程度上取决于在合成阶段选择的配置(分区、映射、调度、缓冲区尺寸)。此外，这种编程范式特别适合于异构并行系统，因为它本质上不存在内存争用，并提供了并行的机会。这两种说法都表明，需要一种方法来轻松、自动地评估和找到良好的设计配置。本文描述了在异构CPU/GPU协同处理平台上合成时，用RVL-CAL编写的高级数据流程序的时钟精确分析所需的方法。它还扩展到异构范式，这是一种现有的方法，用于定性地估计这些程序作为所提供配置的功能的性能。这样就不需要综合和分析实际硬件平台上的每个配置。使用两个应用程序和几种配置验证了该方法。

{"title":"Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms","authors":"Aurelien Bloch, S. Brunet, M. Mattavelli","doi":"10.1109/MCSoC51149.2021.00018","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00018","url":null,"abstract":"The performance of programs written in languages following the dataflow model of computation (MoC) largely depends on the configuration (partitioning, mapping, scheduling, buffer dimensioning) chosen during the synthesis stages. Furthermore, this programming paradigm is particularly well suited for heterogeneous parallel systems because it is inherently free of memory contention and exposes parallel opportunities. Both of these statements show the necessity for a way to easily and automatically evaluate and find good design configurations. The paper describes the methodology required for clock-accurate profiling of high-level dataflow programs written in RVL-CAL when synthesized on heterogeneous CPU/GPU co-processing platforms. It also extends to the heterogeneous paradigm an existing methodology for qualitatively estimating the performance of such programs as a function of the provided configuration. This, without the need to synthesize and profile every single configuration on the actual hardware platform. This approach is validated using two application programs and several configurations.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124366462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Data Fusion Driven Lane-level Precision Data Transmission for V2X Road Applications 数据融合驱动车道级精确数据传输的V2X道路应用

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00031

Albert Budi Christian, Chih-Yu Lin, Lan-Da Van, Y. Tseng

Inter-vehicle communication is being developed continuously in order to accomplish a better driving experience. Through the exchange of information between vehicles and Road Side Unit (RSU), number of accidents can be reduced by notifying the driver through the facts obtained. In general, broadcast information for vehicles is sent in an ad hoc manner. However, unfiltered information may be useless and wasted for most vehicles. Thus, a raised question is whether precise information can be delivered only to the target vehicles without interfering with other non-target vehicles. A computer vision (CV) and sensor fusion-based transmission system are exchanged by RSU and Vehicle On-board Unit (OBU) is developed to attain this objective. In order to correctly transmit the specific information to the target vehicles, we propose a data fusion driven lane-level precision data transmission system that utilizes three kinds of sensory inputs: Road Side Camera (RSC), GPS, and magnetometer. By combining common features from these sensory inputs, our system is able to select the receiver of specific information on the road. Our system focuses on the scenario where a message can be transmitted to the target vehicles located in a certain lane. The experimental evaluation shows a recognition rate of 87.34% and the generated messages have a total delay less than 72 ms.

为了实现更好的驾驶体验，车际通信正在不断发展。通过车辆与路侧单元(RSU)之间的信息交换，通过获得的事实通知驾驶员，可以减少事故的数量。一般来说，车辆的广播信息是以一种特别的方式发送的。然而，对于大多数车辆来说，未经过滤的信息可能是无用的和浪费的。因此，提出了一个问题，即是否可以只向目标车辆传递精确的信息而不干扰其他非目标车辆。为了实现这一目标，RSU交换了基于计算机视觉和传感器融合的传输系统，并开发了车载单元(OBU)。为了正确地将特定信息传输给目标车辆，我们提出了一种数据融合驱动的车道级精确数据传输系统，该系统利用三种感官输入:路边摄像头(RSC)、GPS和磁力计。通过结合这些感官输入的共同特征，我们的系统能够选择道路上特定信息的接收器。我们的系统专注于将信息传输到特定车道上的目标车辆的场景。实验结果表明，该算法的识别率为87.34%，生成的消息总延迟小于72ms。

{"title":"Data Fusion Driven Lane-level Precision Data Transmission for V2X Road Applications","authors":"Albert Budi Christian, Chih-Yu Lin, Lan-Da Van, Y. Tseng","doi":"10.1109/MCSoC51149.2021.00031","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00031","url":null,"abstract":"Inter-vehicle communication is being developed continuously in order to accomplish a better driving experience. Through the exchange of information between vehicles and Road Side Unit (RSU), number of accidents can be reduced by notifying the driver through the facts obtained. In general, broadcast information for vehicles is sent in an ad hoc manner. However, unfiltered information may be useless and wasted for most vehicles. Thus, a raised question is whether precise information can be delivered only to the target vehicles without interfering with other non-target vehicles. A computer vision (CV) and sensor fusion-based transmission system are exchanged by RSU and Vehicle On-board Unit (OBU) is developed to attain this objective. In order to correctly transmit the specific information to the target vehicles, we propose a data fusion driven lane-level precision data transmission system that utilizes three kinds of sensory inputs: Road Side Camera (RSC), GPS, and magnetometer. By combining common features from these sensory inputs, our system is able to select the receiver of specific information on the road. Our system focuses on the scenario where a message can be transmitted to the target vehicles located in a certain lane. The experimental evaluation shows a recognition rate of 87.34% and the generated messages have a total delay less than 72 ms.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131081192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Memory-Access-Minimized BCNN Accelerator Using Nonvolatile FPGA with Only-Once- Write Shifting 基于非易失FPGA的单写移位最小化内存访问BCNN加速器

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00021

D. Suzuki, Takahiro Oka, T. Hanyu

A binary convolutional neural network (BCNN) accelerator using a nonvolatile field-programmable gate array (NV-FPGA) with only-once-write shifting is presented. During the basic operation of the BCNN, the feature maps and weights are read from the block RAM (BRAM) and serially transferred to processing elements. The use of only-once-write shifting makes it possible to greatly reduce write power consumption such serial data transfer in the NV-FPGA. Meanwhile, since the BCNN computing is composed of the nested loop, the memory access potentially has a temporal locality. This means that once the data is read from the BRAM, it can be reused among several layers. By focusing this feature and performing loop interchange, the number of memory access can be minimized and the idle time is maximized. If the BRAM is nonvolatile, wasted standby energy consumption during idle state is completely eliminated by the use of power gating technique. As a result, the proposed BCNN accelerator is 66.5% lower energy consumption than a conventional volatile-FPGA-based BCNN accelerator in typical digit recognition task with MNIST dataset.

提出了一种基于非易失性现场可编程门阵列(NV-FPGA)的双卷积神经网络(BCNN)加速器。在BCNN的基本操作过程中，从块RAM (BRAM)中读取特征映射和权重并串行传输到处理单元。在NV-FPGA中使用单次写入移位可以大大降低写入功耗，从而实现串行数据传输。同时，由于BCNN计算是由嵌套循环组成的，因此内存访问可能具有时间局部性。这意味着一旦从BRAM中读取数据，它就可以在多个层之间重用。通过聚焦此特性并执行循环交换，可以最小化内存访问的数量并最大化空闲时间。如果BRAM是非易失性的，则利用功率门控技术完全消除了空闲状态时浪费的待机能耗。结果表明，在具有MNIST数据集的典型数字识别任务中，所提出的BCNN加速器比传统的基于易失性fpga的BCNN加速器能耗低66.5%。

{"title":"A Memory-Access-Minimized BCNN Accelerator Using Nonvolatile FPGA with Only-Once- Write Shifting","authors":"D. Suzuki, Takahiro Oka, T. Hanyu","doi":"10.1109/MCSoC51149.2021.00021","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00021","url":null,"abstract":"A binary convolutional neural network (BCNN) accelerator using a nonvolatile field-programmable gate array (NV-FPGA) with only-once-write shifting is presented. During the basic operation of the BCNN, the feature maps and weights are read from the block RAM (BRAM) and serially transferred to processing elements. The use of only-once-write shifting makes it possible to greatly reduce write power consumption such serial data transfer in the NV-FPGA. Meanwhile, since the BCNN computing is composed of the nested loop, the memory access potentially has a temporal locality. This means that once the data is read from the BRAM, it can be reused among several layers. By focusing this feature and performing loop interchange, the number of memory access can be minimized and the idle time is maximized. If the BRAM is nonvolatile, wasted standby energy consumption during idle state is completely eliminated by the use of power gating technique. As a result, the proposed BCNN accelerator is 66.5% lower energy consumption than a conventional volatile-FPGA-based BCNN accelerator in typical digit recognition task with MNIST dataset.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127967456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Variable Bit-Precision Vector Extension for RISC-V Based Processors 基于RISC-V处理器的可变位精度矢量扩展

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00024

RK Risikesh, Sharad Sinha, N. Rao

Neural Network model execution is becoming an increasingly compute intensive task. With advances in optimisation techniques such as using lower-bit width precision, need for quantization and model compression, we need to find efficient ways of implementing these techniques. Most Instruction Set Architectures(ISA) do not support low bit-width vector instructions. In this work, we present an extension for the vector specification of the RISC-V ISA, which is targeted towards supporting the lower bit-widths or variable precision (1 to 16 bits) Multiply and Accumulate (MAC) operations. We demonstrate our proposed ISA extension by integrating it with a RISC-V processor named PicoRV32, which is considered as the baseline processor in the proposed work. We introduce the feature of bit-serial multiplication along with variable bit precision support to demonstrate the advantage over a 16 bit baseline processor model. We also build an assembler for the proposed instructions for easier integration into the testbench of the RTL model. We implement the processor on to a Xilinx Zynq based FPGA. We observe that, compared to the baseline RISC-V Vector processor which only supports 8, 16 and 32-bit vector instructions, our processor with variable precision support (1 to16 bits) performs 1.14x faster on an average on a matrix multiplication test program. The proposed processor architecture reduces the memory footprint by up to 1.88x as compared with a baseline 16-bit vector processor.

神经网络模型的执行越来越成为一项计算密集型的任务。随着优化技术的进步，如使用低比特宽度精度，需要量化和模型压缩，我们需要找到实现这些技术的有效方法。大多数指令集架构(ISA)不支持低位宽矢量指令。在这项工作中，我们提出了RISC-V ISA矢量规范的扩展，其目标是支持较低位宽或可变精度(1至16位)乘法和累加(MAC)操作。我们通过将我们提议的ISA扩展与名为PicoRV32的RISC-V处理器集成来演示它，该处理器被认为是提议工作中的基准处理器。我们介绍了位串行乘法的特性以及可变位精度支持，以展示其优于16位基准处理器模型的优势。我们还为建议的指令构建了一个汇编程序，以便更容易地集成到RTL模型的测试台中。我们在基于Xilinx Zynq的FPGA上实现该处理器。我们观察到，与仅支持8、16和32位矢量指令的基准RISC-V矢量处理器相比，我们的具有可变精度支持(1到16位)的处理器在矩阵乘法测试程序上的平均执行速度提高了1.14倍。与基准16位矢量处理器相比，所提出的处理器架构最多可减少1.88倍的内存占用。

{"title":"Variable Bit-Precision Vector Extension for RISC-V Based Processors","authors":"RK Risikesh, Sharad Sinha, N. Rao","doi":"10.1109/MCSoC51149.2021.00024","DOIUrl":"https://doi.org/10.1109/MCSoC51149.2021.00024","url":null,"abstract":"Neural Network model execution is becoming an increasingly compute intensive task. With advances in optimisation techniques such as using lower-bit width precision, need for quantization and model compression, we need to find efficient ways of implementing these techniques. Most Instruction Set Architectures(ISA) do not support low bit-width vector instructions. In this work, we present an extension for the vector specification of the RISC-V ISA, which is targeted towards supporting the lower bit-widths or variable precision (1 to 16 bits) Multiply and Accumulate (MAC) operations. We demonstrate our proposed ISA extension by integrating it with a RISC-V processor named PicoRV32, which is considered as the baseline processor in the proposed work. We introduce the feature of bit-serial multiplication along with variable bit precision support to demonstrate the advantage over a 16 bit baseline processor model. We also build an assembler for the proposed instructions for easier integration into the testbench of the RTL model. We implement the processor on to a Xilinx Zynq based FPGA. We observe that, compared to the baseline RISC-V Vector processor which only supports 8, 16 and 32-bit vector instructions, our processor with variable precision support (1 to16 bits) performs 1.14x faster on an average on a matrix multiplication test program. The proposed processor architecture reduces the memory footprint by up to 1.88x as compared with a baseline 16-bit vector processor.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"14 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128645511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parasitic-Aware Modelling for Neural Networks Implemented with Memristor Crossbar Array 忆阻交叉栅阵列神经网络的寄生感知建模

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/MCSoC51149.2021.00025

T. Cao, Chen Liu, Yuan Gao, W. Goh

This paper presents a parasitic-aware modelling approach called αβ-matrix model for the simulation of neural network (NN) implemented with memristor crossbar array. The line resistance, which is the key parasitic in a memristor crossbar array is analyzed and incorporated into the model. The proposed method estimates the line resistance IR drop with computation complexity of O(mn), in contrast to O(m2n2) required by the classical matrix based Kirchhoff's Current Law (KCL) equations solver. The impact of the crossbar array parasitics to the vector-matrix multiplication (VMM) computation and multi-layer NN classification accuracy are also analyzed. The advantages of the proposed parasitic-aware model are demonstrated through an example of 2-layer perceptron implemented with resistive random access memory (RRAM) crossbar array for MNIST written digits classification. 97.3% classification accuracy is achieved on 64×64 6-bit RRAM crossbar arrays. Compared to the KCL solver, the classification accuracy degradation is less than 0.4% with line resistance up to 4.5Ω.

本文提出了一种具有寄生意识的αβ-矩阵模型，用于记忆电阻交叉棒阵列神经网络仿真。对忆阻交叉栅阵列中最主要的寄生线电阻进行了分析，并将其纳入模型中。该方法估计线路电阻IR下降的计算复杂度为0 (mn)，而传统的基于Kirchhoff电流定律(KCL)方程求解器的计算复杂度为0 (m2n2)。分析了交叉棒阵列对向量矩阵乘法(VMM)计算和多层神经网络分类精度的影响。通过一个基于电阻性随机存取存储器(RRAM)交叉棒阵列的两层感知器实现MNIST写数字分类的实例，证明了该模型的优越性。在64×64 6位RRAM横条阵列上实现了97.3%的分类准确率。与KCL求解器相比，当线阻达到4.5Ω时，分类精度下降小于0.4%。

引用次数: 2

[Title page] (标题页)

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2021-12-01 DOI: 10.1109/mcsoc51149.2021.00002

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀