2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)最新文献

英文中文

Deep Learning Inference on Embedded Devices: Fixed-Point vs Posit 嵌入式设备上的深度学习推理:定点vs定点

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-05-22 DOI: 10.1109/EMC2.2018.00012

Seyed Hamed Fatemi Langroudi, Tej Pandit, D. Kudithipudi

Performing the inference step of deep learning in resource constrained environments, such as embedded devices, is challenging. Success requires optimization at both software and hardware levels. Low precision arithmetic and specifically low precision fixed-point number systems have become the standard for performing deep learning inference. However, representing non-uniform data and distributed parameters (e.g. weights) by using uniformly distributed fixed-point values is still a major drawback when using this number system. Recently, the posit number system was proposed, which represents numbers in a non-uniform manner. Therefore, in this paper we are motivated to explore using the posit number system to represent the weights of Deep Convolutional Neural Networks. However, we do not apply any quantization techniques and hence the network weights do not require re-training. The results of this exploration show that using the posit number system outperformed the fixed point number system in terms of accuracy and memory utilization.

在资源受限的环境(如嵌入式设备)中执行深度学习的推理步骤是具有挑战性的。成功需要在软件和硬件层面进行优化。低精度算法，特别是低精度定点数系统已经成为执行深度学习推理的标准。然而，使用均匀分布的定点值来表示非均匀数据和分布参数(例如权重)仍然是使用该数字系统时的一个主要缺点。最近提出了一种以非均匀方式表示数字的正数系统。因此，在本文中，我们有动机探索使用正数系统来表示深度卷积神经网络的权重。然而，我们没有应用任何量化技术，因此网络权重不需要重新训练。研究结果表明，正数系统在精度和内存利用率方面优于定点系统。

引用次数: 36

Event Prediction in Processors Using Deep Temporal Models 基于深度时间模型的处理器事件预测

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-25 DOI: 10.1109/EMC2.2018.00014

Tharindu Mathew, Aswin Raghavan, S. Chai

In order to achieve high processing efficiencies, next generation computer architecture designs need an effective Artificial Intelligence (AI)-framework to learn large-scale processor interactions. In this short paper, we present Deep Temporal Models (DTMs) that offer effective and scalable time-series representations to addresses key challenges for learning processor data: high data rate, cyclic patterns, and high dimensionality. We present our approach using DTMs to learn and predict processor events. We show comparisons using these learning models with promising initial simulation results.

为了实现高处理效率，下一代计算机架构设计需要一个有效的人工智能(AI)框架来学习大规模处理器交互。在这篇短文中，我们提出了深度时间模型(dtm)，它提供了有效和可扩展的时间序列表示，以解决学习处理器数据的关键挑战:高数据率、循环模式和高维度。我们提出了使用dtm来学习和预测处理器事件的方法。我们展示了使用这些学习模型与有希望的初始模拟结果的比较。

引用次数: 0

A High Efficiency Accelerator for Deep Neural Networks 一种高效的深度神经网络加速器

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/EMC2.2018.00010

Aliasger Zaidy, Andre Xian Ming Chang, Vinayak Gokhale, E. Culurciello

Deep Neural Networks (DNNs) are the current state of the art for various tasks such as object detection, natural language processing and semantic segmentation. These networks are massively parallel, hierarchical models with each level of hierarchy performing millions of operations on a single input. The enormous amount of parallel computation makes these DNNs suitable for custom acceleration. Custom accelerators can provide real time inference of DNNs at low power thus enabling widespread embedded deployment. In this paper, we present Snowflake, a high efficiency, low power accelerator for DNNs. Snowflake was designed to achieve optimum occupancy at low bandwidths and it is agnostic to the network architecture. Snowflake was implemented on the Xilinx Zynq XC7Z045 APSoC and achieves a peak performance of 128 G-ops/s. Snowflake is able to maintain a throughput of 98 FPS on AlexNet while averaging 1.2 GB/s of memory bandwidth.

深度神经网络(dnn)是目前各种任务的最新技术，如对象检测，自然语言处理和语义分割。这些网络是大规模并行的分层模型，每一层对单个输入执行数百万个操作。大量的并行计算使得这些深度神经网络适合自定义加速。定制加速器可以在低功耗下提供dnn的实时推理，从而实现广泛的嵌入式部署。在本文中，我们提出了Snowflake，一个高效，低功耗的深度神经网络加速器。Snowflake的设计目的是在低带宽下实现最佳占用，并且与网络架构无关。Snowflake在Xilinx Zynq XC7Z045 APSoC上实现，峰值性能达到128 G-ops/s。雪花能够在AlexNet上保持98 FPS的吞吐量，同时平均1.2 GB/s的内存带宽。

引用次数: 0

A Quantization-Friendly Separable Convolution for MobileNets 一种量化友好的移动网络可分离卷积

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/EMC2.2018.00011

Tao Sheng, Chen Feng, Shaojie Zhuo, Xiaopeng Zhang, Liang Shen, M. Aleksic

As deep learning (DL) is being rapidly pushed to edge computing, researchers invented various ways to make inference computation more efficient on mobile/IoT devices, such as network pruning, parameter compression, and etc. Quantization, as one of the key approaches, can effectively offload GPU, and make it possible to deploy DL on fixed-point pipeline. Unfortunately, not all existing networks design are friendly to quantization. For example, the popular lightweight MobileNetV1, while it successfully reduces parameter size and computation latency with separable convolution, our experiment shows its quantized models have large performance gap against its float point models. To resolve this, we analyzed the root cause of quantization loss and proposed a quantization-friendly separable convolution architecture. By evaluating the image classification task on ImageNet2012 dataset, our modified MobileNetV1 model can archive 8-bit inference top-1 accuracy in 68.03%, almost closed the gap to the float pipeline.

随着深度学习(DL)被迅速推向边缘计算，研究人员发明了各种方法来提高移动/物联网设备上的推理计算效率，如网络修剪、参数压缩等。量化作为关键方法之一，可以有效地卸载GPU，使深度学习部署在定点流水线上成为可能。不幸的是，并不是所有现有的网络设计都对量化友好。例如，流行的轻量级MobileNetV1，虽然它通过可分卷积成功地减少了参数大小和计算延迟，但我们的实验表明，它的量化模型与浮点模型相比有很大的性能差距。为了解决这个问题，我们分析了量化损失的根本原因，并提出了一种量化友好的可分离卷积架构。通过对ImageNet2012数据集上的图像分类任务进行评估，我们改进的MobileNetV1模型的8位推理top-1准确率为68.03%，几乎弥补了与浮动管道的差距。

引用次数: 101

A Case for Dynamic Activation Quantization in CNNs cnn中动态激活量化的一个例子

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/EMC2.2018.00009

Karl Taht, Surya Narayanan, R. Balasubramonian

It is a well-established fact that CNNs are robust enough to tolerate low precision computations without any significant loss in accuracy. There have been works that exploit this fact, and try to allocate different precision for different layers (for both weights and activations), depending on the importance of a layer's precision in dictating the prediction accuracy. In all these works, the layer-wise precision of weights and activations is decided for a network by performing an offline design space exploration as well as retraining of weights. While these approaches show significant energy improvements, they make global decisions for precision requirements. In this project, we try to answer the question "Can we vary the inter-and intra-layer bit-precision based on the region-wise importance of the individual input?". The intuition behind this is that for a particular image, there might be regions that can be considered as background or unimportant for the network to make its final prediction. As these inputs propagate through the network, the regions of less importance in the same feature map can tolerate lower precision. Using metrics such as entropy, color gradient, and points of interest, we argue that a region of an image can be labeled important or unimportant, thus enabling lower precision for unimportant pixels. We show that per-input activation quantization can reduce computational energy up to 33.5% or 42.0% while maintaining original Top-1 and Top-5 accuracies respectively.

众所周知，cnn具有足够的鲁棒性，可以承受低精度的计算而不会有任何明显的精度损失。已经有一些工作利用了这一事实，并尝试为不同的层分配不同的精度(对于权重和激活)，这取决于层的精度在决定预测精度中的重要性。在所有这些工作中，权重和激活的分层精度是通过执行离线设计空间探索以及权重的再训练来决定的。虽然这些方法显示出显著的能源改进，但它们为精度要求做出了全局决策。在这个项目中，我们试图回答这样一个问题:“我们能否根据单个输入的区域重要性来改变层间和层内的比特精度?”这背后的直觉是，对于一个特定的图像，可能有一些区域可以被认为是背景，或者对网络做出最终预测不重要。当这些输入通过网络传播时，同一特征映射中不太重要的区域可以容忍较低的精度。使用熵、颜色梯度和兴趣点等指标，我们认为图像的一个区域可以被标记为重要或不重要，从而使不重要像素的精度降低。结果表明，每个输入激活量化可以在保持原始Top-1和Top-5精度的情况下分别减少33.5%或42.0%的计算能量。

{"title":"A Case for Dynamic Activation Quantization in CNNs","authors":"Karl Taht, Surya Narayanan, R. Balasubramonian","doi":"10.1109/EMC2.2018.00009","DOIUrl":"https://doi.org/10.1109/EMC2.2018.00009","url":null,"abstract":"It is a well-established fact that CNNs are robust enough to tolerate low precision computations without any significant loss in accuracy. There have been works that exploit this fact, and try to allocate different precision for different layers (for both weights and activations), depending on the importance of a layer's precision in dictating the prediction accuracy. In all these works, the layer-wise precision of weights and activations is decided for a network by performing an offline design space exploration as well as retraining of weights. While these approaches show significant energy improvements, they make global decisions for precision requirements. In this project, we try to answer the question \"Can we vary the inter-and intra-layer bit-precision based on the region-wise importance of the individual input?\". The intuition behind this is that for a particular image, there might be regions that can be considered as background or unimportant for the network to make its final prediction. As these inputs propagate through the network, the regions of less importance in the same feature map can tolerate lower precision. Using metrics such as entropy, color gradient, and points of interest, we argue that a region of an image can be labeled important or unimportant, thus enabling lower precision for unimportant pixels. We show that per-input activation quantization can reduce computational energy up to 33.5% or 42.0% while maintaining original Top-1 and Top-5 accuracies respectively.","PeriodicalId":377872,"journal":{"name":"2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131624957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Invited Talk Abstract: Introducing ReQuEST: An Open Platform for Reproducible and Quality-Efficient Systems-ML Tournaments 摘要:介绍ReQuEST:一个可复制和高质量系统的开放平台- ml锦标赛

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/emc2.2018.00008

G. Fursin

Co-designing efficient machine learning based systems across the whole application/hardware/software stack to trade off speed, accuracy, energy and costs is becoming extremely complex and time consuming. Researchers often struggle to evaluate and compare different published works across rapidly evolving software frameworks, heterogeneous hardware platforms, compilers, libraries, algorithms, data sets, models, and environments. I will present our community effort to develop an open co-design tournament platform with an online public scoreboard based on Collective Knowledge workflow framework (CK). It gradually incorporates best research practices while providing a common way for multidisciplinary researchers to optimize and compare the quality vs. efficiency Pareto optimality of various workloads on diverse and complete hardware/software systems. All the winning solutions will be made available to the community as portable and customizable "plug&play" components with a common API to accelerate research and innovation! I will then discuss how our open competition and collaboration can help to achieve energy efficiency for cognitive workloads based on energy-efficient submissions from the 1st ReQuEST tournament co-located with ASPLOS'18. Further details: http://cKnowledge.org/request

在整个应用程序/硬件/软件堆栈中共同设计高效的基于机器学习的系统，以权衡速度、准确性、能源和成本，这变得非常复杂和耗时。研究人员经常努力评估和比较快速发展的软件框架、异构硬件平台、编译器、库、算法、数据集、模型和环境中不同的已发表作品。我将介绍我们的社区努力开发一个开放的共同设计比赛平台，该平台基于集体知识工作流框架(CK)，具有在线公共计分板。它逐渐融合了最佳研究实践，同时为多学科研究人员提供了一种通用的方法来优化和比较不同和完整的硬件/软件系统上各种工作负载的质量与效率的帕累托最优性。所有获胜的解决方案都将作为可移植和可定制的“即插即用”组件提供给社区，并使用通用API来加速研究和创新!然后，我将讨论我们的开放竞争和协作如何帮助实现基于与ASPLOS'18共同举办的第一届请求锦标赛的节能提交的认知工作负载的能源效率。更多详细信息:http://cKnowledge.org/request

{"title":"Invited Talk Abstract: Introducing ReQuEST: An Open Platform for Reproducible and Quality-Efficient Systems-ML Tournaments","authors":"G. Fursin","doi":"10.1109/emc2.2018.00008","DOIUrl":"https://doi.org/10.1109/emc2.2018.00008","url":null,"abstract":"Co-designing efficient machine learning based systems across the whole application/hardware/software stack to trade off speed, accuracy, energy and costs is becoming extremely complex and time consuming. Researchers often struggle to evaluate and compare different published works across rapidly evolving software frameworks, heterogeneous hardware platforms, compilers, libraries, algorithms, data sets, models, and environments. I will present our community effort to develop an open co-design tournament platform with an online public scoreboard based on Collective Knowledge workflow framework (CK). It gradually incorporates best research practices while providing a common way for multidisciplinary researchers to optimize and compare the quality vs. efficiency Pareto optimality of various workloads on diverse and complete hardware/software systems. All the winning solutions will be made available to the community as portable and customizable \"plug&play\" components with a common API to accelerate research and innovation! I will then discuss how our open competition and collaboration can help to achieve energy efficiency for cognitive workloads based on energy-efficient submissions from the 1st ReQuEST tournament co-located with ASPLOS'18. Further details: http://cKnowledge.org/request","PeriodicalId":377872,"journal":{"name":"2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116544641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Moving CNN Accelerator Computations Closer to Data 移动CNN加速器计算更接近数据

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/EMC2.2018.00015

Sumanth Gudaparthi, Surya Narayanan, R. Balasubramonian

A significant fraction of energy in recent CNN accelerators is dissipated in moving operands between storage and compute units. In this work, we re-purpose the CPU's last level cache to perform in-situ dot-product computations, thus significantly reducing data movement. Since a last level cache has several subarrays, many such dot-products can be performed in parallel, thus boosting throughput as well. The in-situ operation does not require analog circuits; it is performed with a bit-wise AND of two subarray rows, followed by digital aggregation of partial sums. The proposed architecture yields a 2.74× improvement in throughput and a 6.31× improvement in energy, relative to a DaDianNao baseline. This is primarily because the proposed architecture eliminates a large fraction of data transfers over H-Tree interconnects in the cache.

在最近的CNN加速器中，很大一部分能量消耗在存储和计算单元之间的操作数移动上。在这项工作中，我们重新利用CPU的最后一级缓存来执行原位点积计算，从而显着减少数据移动。由于最后一级缓存有几个子数组，因此许多这样的点积可以并行执行，从而也提高了吞吐量。现场操作不需要模拟电路;它是通过对两个子数组行进行逐位与运算来执行的，然后是部分和的数字聚合。与DaDianNao基线相比，该架构的吞吐量提高了2.74倍，能耗提高了6.31倍。这主要是因为所提出的架构消除了高速缓存中H-Tree互连上的大部分数据传输。

引用次数: 1

Keynote Abstract: Safety and Security at the Heart of Autonomous Driving 主题演讲摘要:安全与安保是自动驾驶的核心

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/EMC2.2018.00006

K. Khouri

The automotive industry is undergoing a revolution with connected, autonomous and electric vehicles and the benefits they can bring to the public. Drivers enjoying their daily commute, fewer road fatalities and less pollution are all possible thanks to new technologies. Car makers need to offer these features but at the same time make sure vehicles are safe and secure. In the coming years, there will be various levels of automation until we have fully autonomous vehicles. To achieve any level of automation, cars need to connect to other vehicles, connect to the infrastructure, sense the environment through various sensors such as camera and radar and then make maneuvering decisions based on all these inputs. Artificial intelligence is and will be deployed heavily to accomplish many of the tasks of autonomous driving. Perception and decision-making based on artificial intelligence introduces an entirely new set of challenges to car makers to ensure no security compromises as well as proving the decisions being made are functionally, behaviorally and environmentally safe. The challenge can be described in a simple question: "If a machine learning based car system is accurate 99% of the time, are you willing to ride this car knowing that it will be wrong 1% of the time? What is the consequence of that incorrect decision?" Deep expertise and research in the safety and security aspects of AI are needed to ensure future mass deployment and success in the area of autonomous driving.

汽车行业正在经历一场联网、自动驾驶和电动汽车的革命，以及它们可以给公众带来的好处。驾车者享受他们的日常通勤，更少的交通事故和更少的污染都是可能的，这要归功于新技术。汽车制造商需要提供这些功能，但同时要确保车辆安全可靠。在未来的几年里，在我们拥有完全自动驾驶的汽车之前，将会有不同程度的自动化。为了实现任何程度的自动化，汽车需要连接到其他车辆，连接到基础设施，通过各种传感器(如摄像头和雷达)感知环境，然后根据所有这些输入做出机动决策。人工智能已经并将被大量部署，以完成自动驾驶的许多任务。基于人工智能的感知和决策给汽车制造商带来了一系列全新的挑战，既要确保不存在安全隐患，又要证明所做的决策在功能、行为和环境方面是安全的。这个挑战可以用一个简单的问题来描述:“如果一个基于机器学习的汽车系统在99%的时间里是准确的，你愿意在知道它在1%的时间里是错误的情况下乘坐这辆车吗?”这个错误决定的后果是什么?”为了确保未来在自动驾驶领域的大规模部署和成功，需要在人工智能的安全和安保方面进行深入的专业知识和研究。

{"title":"Keynote Abstract: Safety and Security at the Heart of Autonomous Driving","authors":"K. Khouri","doi":"10.1109/EMC2.2018.00006","DOIUrl":"https://doi.org/10.1109/EMC2.2018.00006","url":null,"abstract":"The automotive industry is undergoing a revolution with connected, autonomous and electric vehicles and the benefits they can bring to the public. Drivers enjoying their daily commute, fewer road fatalities and less pollution are all possible thanks to new technologies. Car makers need to offer these features but at the same time make sure vehicles are safe and secure. In the coming years, there will be various levels of automation until we have fully autonomous vehicles. To achieve any level of automation, cars need to connect to other vehicles, connect to the infrastructure, sense the environment through various sensors such as camera and radar and then make maneuvering decisions based on all these inputs. Artificial intelligence is and will be deployed heavily to accomplish many of the tasks of autonomous driving. Perception and decision-making based on artificial intelligence introduces an entirely new set of challenges to car makers to ensure no security compromises as well as proving the decisions being made are functionally, behaviorally and environmentally safe. The challenge can be described in a simple question: \"If a machine learning based car system is accurate 99% of the time, are you willing to ride this car knowing that it will be wrong 1% of the time? What is the consequence of that incorrect decision?\" Deep expertise and research in the safety and security aspects of AI are needed to ensure future mass deployment and success in the area of autonomous driving.","PeriodicalId":377872,"journal":{"name":"2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126335145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Efficient Compiler Code Generation for Deep Learning Snowflake Co-Processor 深度学习雪花协处理器的高效编译代码生成

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/EMC2.2018.00013

Andre Xian Ming Chang, Aliasger Zaidy, E. Culurciello

Deep Neural Networks (DNNs) are widely used in various applications including image classification, semantic segmentation and natural language processing. Various DNN models were developed to achieve high accuracy on different tasks. Efficiently mapping the workflow of those models onto custom accelerators requires a programmable hardware and a custom compiler. In this work, we use Snowflake, which is a programmable DNN targeted accelerator. We also present a compiler that correctly generated code for Snowflake. Our system were evaluated on various convolution layers present in AlexNet, ResNet and LightCNN. Snowflake with 256 processing units was implemented on Xilinx FPGA, and it achieved 70 frames/s for AlexNet without linear layers.

深度神经网络广泛应用于图像分类、语义分割和自然语言处理等领域。为了在不同的任务上达到较高的精度，开发了不同的深度神经网络模型。有效地将这些模型的工作流映射到自定义加速器需要可编程硬件和自定义编译器。在这项工作中，我们使用了Snowflake，这是一个可编程的深度神经网络目标加速器。我们还提供了一个编译器，它可以正确地为Snowflake生成代码。我们的系统在AlexNet, ResNet和LightCNN中存在的各种卷积层上进行了评估。在Xilinx FPGA上实现了256个处理单元的Snowflake，在没有线性层的情况下实现了70帧/秒的AlexNet。

引用次数: 2

Invited Talk Abstract: Challenges and Solutions for Embedding Vision AI 摘要:嵌入视觉人工智能的挑战与解决方案

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

Pub Date : 2018-03-01 DOI: 10.1109/EMC2.2018.00007

Charles Qi

Recently computer vision and neural network based AI technology have seen explosive demands in embedded systems such as robots, drones, autonomous vehicles, etc. Due to cost and power constraints, it remains quite challenging to achieve satisfactory performance, while maintaining power efficiency and scalability for embedded vision AI. This presentation first analyzes the technical challenges of embedding vision AI, from the perspectives of algorithm complexity, computation and memory BW demands, and constrains of power consumption profile. The analysis shows that modern neural networks for vision AI contain complex topology and diversified computation steps. These neural networks are often part of a large embedded vision processing pipeline, intermixed with conventional vision algorithms. As a result, the vision AI implementation demands several TOPS computation performance and ten's of GB memory BW. Subsequently the architecture of Tensilica Vision AI DSP processor technology is presented with three distinctive advantages: The optimized instruction sets of Vision P6 and Vision C5 DSP are explained as examples of achieving instruction level computation efficiency and performance. This is coupled with unique processor architecture features for achieving SoC level data processing efficiency and scalability that lead to a high-performance vision AI sub-system. The fully automated AI optimization framework, software libraries and tools provide practical performance tuning methodology and rapid turn-around time for embedded vision AI system design. In conclusion, the presentation offers considerations for future research and development to bring embedded vision AI to the next performance level.

最近，基于计算机视觉和神经网络的人工智能技术在机器人、无人机、自动驾驶汽车等嵌入式系统中出现了爆炸式的需求。由于成本和功率的限制，在保持嵌入式视觉人工智能的功率效率和可扩展性的同时，实现令人满意的性能仍然是相当具有挑战性的。本报告首先从算法复杂度、计算和内存BW需求以及功耗限制等方面分析了嵌入视觉人工智能的技术挑战。分析表明，用于视觉人工智能的现代神经网络拓扑结构复杂，计算步骤多样。这些神经网络通常是大型嵌入式视觉处理管道的一部分，与传统的视觉算法混合在一起。因此，视觉人工智能的实现需要几个TOPS的计算性能和几十GB的内存BW。随后，介绍了Tensilica Vision AI DSP处理器技术的架构，该技术具有三个独特的优势:以Vision P6和Vision C5 DSP优化指令集为例，说明了如何实现指令级计算效率和性能。这与独特的处理器架构功能相结合，实现了SoC级的数据处理效率和可扩展性，从而形成了高性能的视觉AI子系统。全自动人工智能优化框架、软件库和工具为嵌入式视觉人工智能系统设计提供了实用的性能调优方法和快速的周转时间。综上所述，该演讲为未来的研究和开发提供了考虑，以将嵌入式视觉人工智能提升到下一个性能水平。

{"title":"Invited Talk Abstract: Challenges and Solutions for Embedding Vision AI","authors":"Charles Qi","doi":"10.1109/EMC2.2018.00007","DOIUrl":"https://doi.org/10.1109/EMC2.2018.00007","url":null,"abstract":"Recently computer vision and neural network based AI technology have seen explosive demands in embedded systems such as robots, drones, autonomous vehicles, etc. Due to cost and power constraints, it remains quite challenging to achieve satisfactory performance, while maintaining power efficiency and scalability for embedded vision AI. This presentation first analyzes the technical challenges of embedding vision AI, from the perspectives of algorithm complexity, computation and memory BW demands, and constrains of power consumption profile. The analysis shows that modern neural networks for vision AI contain complex topology and diversified computation steps. These neural networks are often part of a large embedded vision processing pipeline, intermixed with conventional vision algorithms. As a result, the vision AI implementation demands several TOPS computation performance and ten's of GB memory BW. Subsequently the architecture of Tensilica Vision AI DSP processor technology is presented with three distinctive advantages: The optimized instruction sets of Vision P6 and Vision C5 DSP are explained as examples of achieving instruction level computation efficiency and performance. This is coupled with unique processor architecture features for achieving SoC level data processing efficiency and scalability that lead to a high-performance vision AI sub-system. The fully automated AI optimization framework, software libraries and tools provide practical performance tuning methodology and rapid turn-around time for embedded vision AI system design. In conclusion, the presentation offers considerations for future research and development to bring embedded vision AI to the next performance level.","PeriodicalId":377872,"journal":{"name":"2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131134768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀