ACM Transactions on Embedded Computing Systems最新文献_第7页

Sound Mixed Fixed-Point Quantization of Neural Networks 神经网络的声音混合不动点量化

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609118

Debasmita Lohar, Clothilde Jeangoudoux, Anastasia Volkova, Eva Darulova

Neural networks are increasingly being used as components in safety-critical applications, for instance, as controllers in embedded systems. Their formal safety verification has made significant progress but typically considers only idealized real-valued networks. For practical applications, such neural networks have to be quantized, i.e., implemented in finite-precision arithmetic, which inevitably introduces roundoff errors. Choosing a suitable precision that is both guaranteed to satisfy a roundoff error bound to ensure safety and that is as small as possible to not waste resources is highly nontrivial to do manually. This task is especially challenging when quantizing a neural network in fixed-point arithmetic, where one can choose among a large number of precisions and has to ensure overflow-freedom explicitly. This paper presents the first sound and fully automated mixed-precision quantization approach that specifically targets deep feed-forward neural networks. Our quantization is based on mixed-integer linear programming (MILP) and leverages the unique structure of neural networks and effective over-approximations to make MILP optimization feasible. Our approach efficiently optimizes the number of bits needed to implement a network while guaranteeing a provided error bound. Our evaluation on existing embedded neural controller benchmarks shows that our optimization translates into precision assignments that mostly use fewer machine cycles when compiled to an FPGA with a commercial HLS compiler than code generated by (sound) state-of-the-art. Furthermore, our approach handles significantly more benchmarks substantially faster, especially for larger networks.

神经网络越来越多地被用作安全关键应用的组件，例如，作为嵌入式系统中的控制器。他们的正式安全验证取得了重大进展，但通常只考虑理想化的实值网络。在实际应用中，这种神经网络必须被量化，即在有限精度算法中实现，这不可避免地引入了舍入误差。选择一个合适的精度，既要保证满足舍入误差界限以确保安全，又要尽可能小以避免浪费资源，这是手工完成的非常重要的工作。当使用定点算法对神经网络进行量化时，这一任务尤其具有挑战性，因为人们可以在大量精度中进行选择，并且必须明确地确保溢出自由。本文提出了第一个健全的全自动混合精度量化方法，专门针对深度前馈神经网络。我们的量化基于混合整数线性规划(MILP)，并利用神经网络的独特结构和有效的过逼近使MILP优化可行。我们的方法有效地优化了实现网络所需的比特数，同时保证了提供的错误界限。我们对现有嵌入式神经控制器基准的评估表明，当使用商业HLS编译器编译到FPGA时，我们的优化转化为精度分配，与使用(声音)最先进的代码生成的代码相比，大多数情况下使用更少的机器周期。此外，我们的方法处理更多基准测试的速度明显更快，特别是对于较大的网络。

{"title":"Sound Mixed Fixed-Point Quantization of Neural Networks","authors":"Debasmita Lohar, Clothilde Jeangoudoux, Anastasia Volkova, Eva Darulova","doi":"10.1145/3609118","DOIUrl":"https://doi.org/10.1145/3609118","url":null,"abstract":"Neural networks are increasingly being used as components in safety-critical applications, for instance, as controllers in embedded systems. Their formal safety verification has made significant progress but typically considers only idealized real-valued networks. For practical applications, such neural networks have to be quantized, i.e., implemented in finite-precision arithmetic, which inevitably introduces roundoff errors. Choosing a suitable precision that is both guaranteed to satisfy a roundoff error bound to ensure safety and that is as small as possible to not waste resources is highly nontrivial to do manually. This task is especially challenging when quantizing a neural network in fixed-point arithmetic, where one can choose among a large number of precisions and has to ensure overflow-freedom explicitly. This paper presents the first sound and fully automated mixed-precision quantization approach that specifically targets deep feed-forward neural networks. Our quantization is based on mixed-integer linear programming (MILP) and leverages the unique structure of neural networks and effective over-approximations to make MILP optimization feasible. Our approach efficiently optimizes the number of bits needed to implement a network while guaranteeing a provided error bound. Our evaluation on existing embedded neural controller benchmarks shows that our optimization translates into precision assignments that mostly use fewer machine cycles when compiled to an FPGA with a commercial HLS compiler than code generated by (sound) state-of-the-art. Furthermore, our approach handles significantly more benchmarks substantially faster, especially for larger networks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Probabilistic Black-Box Checking via Active MDP Learning 基于主动MDP学习的概率黑盒检验

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609127

Junya Shijubo, Masaki Waga, Kohei Suenaga

We introduce a novel methodology for testing stochastic black-box systems, frequently encountered in embedded systems. Our approach enhances the established black-box checking (BBC) technique to address stochastic behavior. Traditional BBC primarily involves iteratively identifying an input that breaches the system’s specifications by executing the following three phases: the learning phase to construct an automaton approximating the black box’s behavior, the synthesis phase to identify a candidate counterexample from the learned automaton, and the validation phase to validate the obtained candidate counterexample and the learned automaton against the original black-box system. Our method, ProbBBC, refines the conventional BBC approach by (1) employing an active Markov Decision Process (MDP) learning method during the learning phase, (2) incorporating probabilistic model checking in the synthesis phase, and (3) applying statistical hypothesis testing in the validation phase. ProbBBC uniquely integrates these techniques rather than merely substituting each method in the traditional BBC; for instance, the statistical hypothesis testing and the MDP learning procedure exchange information regarding the black-box system’s observation with one another. The experiment results suggest that ProbBBC outperforms an existing method, especially for systems with limited observation.

我们介绍了一种新的方法来测试随机黑盒系统，经常在嵌入式系统中遇到。我们的方法增强了既定的黑盒检查(BBC)技术来解决随机行为。传统的BBC主要涉及通过执行以下三个阶段来迭代地识别违反系统规范的输入:学习阶段，构建一个近似黑箱行为的自动机;综合阶段，从学习的自动机中识别候选反例;验证阶段，根据原始黑箱系统验证获得的候选反例和学习的自动机。我们的方法ProbBBC通过(1)在学习阶段采用主动马尔可夫决策过程(MDP)学习方法来改进传统的BBC方法，(2)在综合阶段结合概率模型检查，(3)在验证阶段应用统计假设检验。ProbBBC独特地整合了这些技术，而不仅仅是取代传统BBC中的每种方法;例如，统计假设检验和MDP学习过程相互交换关于黑箱系统观察的信息。实验结果表明，ProbBBC优于现有的方法，特别是对于有限观测的系统。

{"title":"Probabilistic Black-Box Checking via Active MDP Learning","authors":"Junya Shijubo, Masaki Waga, Kohei Suenaga","doi":"10.1145/3609127","DOIUrl":"https://doi.org/10.1145/3609127","url":null,"abstract":"We introduce a novel methodology for testing stochastic black-box systems, frequently encountered in embedded systems. Our approach enhances the established black-box checking (BBC) technique to address stochastic behavior. Traditional BBC primarily involves iteratively identifying an input that breaches the system’s specifications by executing the following three phases: the learning phase to construct an automaton approximating the black box’s behavior, the synthesis phase to identify a candidate counterexample from the learned automaton, and the validation phase to validate the obtained candidate counterexample and the learned automaton against the original black-box system. Our method, ProbBBC, refines the conventional BBC approach by (1) employing an active Markov Decision Process (MDP) learning method during the learning phase, (2) incorporating probabilistic model checking in the synthesis phase, and (3) applying statistical hypothesis testing in the validation phase. ProbBBC uniquely integrates these techniques rather than merely substituting each method in the traditional BBC; for instance, the statistical hypothesis testing and the MDP learning procedure exchange information regarding the black-box system’s observation with one another. The experiment results suggest that ProbBBC outperforms an existing method, especially for systems with limited observation.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rectifying Skewed Kernel Page Reclamation in Mobile Devices for Improving User-Perceivable Latency 纠正移动设备中倾斜的内核页面回收，以改善用户可感知的延迟

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607937

Yi-Quan Chou, Lin-Wei Shen, Li-Pin Chang

A crucial design factor for users of smart mobile devices is the latency of graphical interface interaction. Switching a background app to foreground is a frequent operation on mobile devices and the latency of this process is highly perceivable to users. Based on an Android smartphone, through analysis of memory reference generated during the app-switching process, we observe that file (virtual) pages and anonymous pages are both heavily involved. However, to our surprise, the amounts of the two types of pages in the main memory are highly imbalanced, and frequent I/O operations on file pages noticeably slows down the app-switching process. In this study, we advocate to improve the app-switching latency by rectifying the skewed kernel page reclaiming. Our approach involves two parts: proactive identification of unused anonymous pages and adaptive balance between file pages and anonymous pages. As mobile apps are found inflating their anonymous pages, we propose identifying unused anonymous pages in sync with the app-switching events. In addition, Android devices replaces the swap device with RAM-based zram, and swapping on zram is much faster than file accessing on flash storage. Without causing thrashing, we propose swapping out as many anonymous pages to zram as possible for caching more file pages. We conduct experiments on a Google Pixel phone with realistic user workloads, and results confirm that our method is adaptive to different memory requirements and greatly improves the app-switching latency by up to 43% compared with the original kernel.

对于智能移动设备的用户来说，一个重要的设计因素是图形界面交互的延迟。在移动设备上切换后台应用到前台是一个频繁的操作，这个过程的延迟对用户来说是高度可感知的。基于Android智能手机，通过分析应用切换过程中产生的内存引用，我们观察到文件(虚拟)页面和匿名页面都被大量涉及。然而，令我们惊讶的是，主内存中这两种类型页面的数量是高度不平衡的，频繁的文件页面I/O操作明显减慢了应用程序切换过程。在本研究中，我们主张通过纠正扭曲的内核页面回收来改善应用程序切换延迟。我们的方法包括两个部分:主动识别未使用的匿名页面以及文件页面和匿名页面之间的自适应平衡。当发现移动应用程序膨胀其匿名页面时，我们建议与应用程序切换事件同步识别未使用的匿名页面。此外，Android设备用基于ram的zram取代了交换设备，在zram上交换比在闪存上访问文件要快得多。在不引起抖动的情况下，我们建议将尽可能多的匿名页面交换到zram，以缓存更多的文件页面。我们在Google Pixel手机上进行了实际用户工作负载的实验，结果证实了我们的方法可以适应不同的内存需求，并且与原始内核相比，我们的方法大大提高了应用程序切换延迟，最高可提高43%。

{"title":"Rectifying Skewed Kernel Page Reclamation in Mobile Devices for Improving User-Perceivable Latency","authors":"Yi-Quan Chou, Lin-Wei Shen, Li-Pin Chang","doi":"10.1145/3607937","DOIUrl":"https://doi.org/10.1145/3607937","url":null,"abstract":"A crucial design factor for users of smart mobile devices is the latency of graphical interface interaction. Switching a background app to foreground is a frequent operation on mobile devices and the latency of this process is highly perceivable to users. Based on an Android smartphone, through analysis of memory reference generated during the app-switching process, we observe that file (virtual) pages and anonymous pages are both heavily involved. However, to our surprise, the amounts of the two types of pages in the main memory are highly imbalanced, and frequent I/O operations on file pages noticeably slows down the app-switching process. In this study, we advocate to improve the app-switching latency by rectifying the skewed kernel page reclaiming. Our approach involves two parts: proactive identification of unused anonymous pages and adaptive balance between file pages and anonymous pages. As mobile apps are found inflating their anonymous pages, we propose identifying unused anonymous pages in sync with the app-switching events. In addition, Android devices replaces the swap device with RAM-based zram, and swapping on zram is much faster than file accessing on flash storage. Without causing thrashing, we propose swapping out as many anonymous pages to zram as possible for caching more file pages. We conduct experiments on a Google Pixel phone with realistic user workloads, and results confirm that our method is adaptive to different memory requirements and greatly improves the app-switching latency by up to 43% compared with the original kernel.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs 一种消除基于NoC的mpsoc缓存污染的动态技术

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609113

Dipika Deb, John Jose

Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention for shared resources such as LLC and the underlying network. The paper proposes Zero Pollution Prefetcher (ZPP) that eliminates cache pollution for NUCA architecture. For this purpose, ZPP uses L1 prefetcher and places the prefetched blocks in the data locations of LLC where modified blocks are stored. Since modified blocks in LLC are stale and request for such blocks are served from the exclusively owned private cache, their space unnecessary consumes power to maintain such stale data in the cache. The benefits of ZPP are (a) Eliminates cache pollution in L1 and LLC by storing prefetched blocks in LLC locations where stale blocks are stored. (b) Insufficient cache space is solved by placing prefetched blocks in LLC as LLCs are larger in size than L1 cache. This helps in prefetching more cache blocks, thereby increasing prefetch aggressiveness. (c) Increasing prefetch aggressiveness increases its coverage. (d) It also maintains an equivalent lookup latency to L1 cache for prefetched blocks. Experimentally it has been found that ZPP increases weighted speedup by 2.19x as compared to a system with no prefetching while prefetch coverage and prefetch accuracy increases by 50%, and 12%, respectively compared to the baseline.1

数据预取有效地减少了NUCA架构中的内存访问延迟，因为最后一级缓存(LLC)是跨多个核心共享和分布的。但是由预取器产生的缓存污染会导致共享资源(如LLC和底层网络)的争用，从而降低其效率。提出了零污染预取器(Zero Pollution Prefetcher, ZPP)来消除NUCA架构中的缓存污染。为此，ZPP使用L1预取器，并将预取的块放置在LLC中存储修改块的数据位置。由于在LLC中修改的块是陈旧的，并且对这些块的请求是从独占的私有缓存中提供的，因此它们不必要的空间消耗了在缓存中维护这些陈旧数据的功率。ZPP的好处是:(a)通过将预取的块存储在存储陈旧块的LLC位置，消除L1和LLC中的缓存污染。(b)由于LLC的大小大于L1缓存，通过在LLC中放置预取块来解决缓存空间不足的问题。这有助于预取更多的缓存块，从而提高预取的主动性。(c)增加预取积极性增加其覆盖范围。(d)对于预取的块，它还保持与L1缓存相同的查找延迟。实验发现，与没有预取的系统相比，ZPP将加权加速提高了2.19倍，而预取覆盖率和预取精度分别比基线提高了50%和12%。1

{"title":"ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs","authors":"Dipika Deb, John Jose","doi":"10.1145/3609113","DOIUrl":"https://doi.org/10.1145/3609113","url":null,"abstract":"Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention for shared resources such as LLC and the underlying network. The paper proposes Zero Pollution Prefetcher (ZPP) that eliminates cache pollution for NUCA architecture. For this purpose, ZPP uses L1 prefetcher and places the prefetched blocks in the data locations of LLC where modified blocks are stored. Since modified blocks in LLC are stale and request for such blocks are served from the exclusively owned private cache, their space unnecessary consumes power to maintain such stale data in the cache. The benefits of ZPP are (a) Eliminates cache pollution in L1 and LLC by storing prefetched blocks in LLC locations where stale blocks are stored. (b) Insufficient cache space is solved by placing prefetched blocks in LLC as LLCs are larger in size than L1 cache. This helps in prefetching more cache blocks, thereby increasing prefetch aggressiveness. (c) Increasing prefetch aggressiveness increases its coverage. (d) It also maintains an equivalent lookup latency to L1 cache for prefetched blocks. Experimentally it has been found that ZPP increases weighted speedup by 2.19x as compared to a system with no prefetching while prefetch coverage and prefetch accuracy increases by 50%, and 12%, respectively compared to the baseline.1","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Verified Compilation of Synchronous Dataflow with State Machines 状态机同步数据流的验证编译

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3608102

Timothy Bourke, Basile Pesin, Marc Pouzet

Safety-critical embedded software is routinely programmed in block-diagram languages. Recent work in the Vélus project specifies such a language and its compiler in the Coq proof assistant. It builds on the CompCert verified C compiler to give an end-to-end proof linking the dataflow semantics of source programs to traces of the generated assembly code. We extend this work with switched blocks, shared variables, reset blocks, and state machines; define a relational semantics to integrate these block- and mode-based constructions into the existing stream-based model; adapt the standard source-to-source rewriting scheme to compile the new constructions; and reestablish the correctness theorem.

安全关键型嵌入式软件通常是用框图语言编写的。vsamus项目最近的工作在Coq证明助手中指定了这样一种语言及其编译器。它建立在经过CompCert验证的C编译器上，提供端到端的证明，将源程序的数据流语义链接到生成的汇编代码的跟踪。我们用交换块、共享变量、重置块和状态机扩展了这项工作;定义一个关系语义，将这些基于块和模型的结构集成到现有的基于流的模型中;采用标准的源到源重写方案来编译新结构;重新建立正确性定理。

引用次数: 1

Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks 面向小芯片的小花:用于CNN推理任务的数据流感知的高性能节能网络-中介器

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3608098

Harsh Sharma, Lukas Pfromm, Rasit Onur Topaloglu, Janardhan Rao Doppa, Umit Y. Ogras, Ananth Kalyanraman, Partha Pratim Pande

Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.

2.5D芯片平台的最新进展为新兴的计算和数据密集型应用(包括机器学习)的紧凑扩展实现提供了新的途径。中间层网络(NoI)可以在2.5D系统上集成多个小芯片。虽然这些多核平台可以通过并发运行多个专门任务来提供高计算吞吐量和能源效率，但传统的NoI架构由于其固有的多跳拓扑结构而具有有限的计算吞吐量。在本文中，我们提出了一种基于空间填充曲线(sfc)的新型NoI架构小花。小花架构利用适当的任务映射，利用数据流模式，优化芯片间数据交换，为并发运行的多种类型卷积神经网络(CNN)推理任务提取高性能。我们证明，在同时执行涉及多个CNN任务的数据中心规模的工作负载时，与最先进的NoI架构相比，小花架构将延迟和能量分别降低了58%和64%。Floret通过利用CNN推理任务的数据流感知，实现了高性能和显著的节能，并且降低了制造成本。

{"title":"Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks","authors":"Harsh Sharma, Lukas Pfromm, Rasit Onur Topaloglu, Janardhan Rao Doppa, Umit Y. Ogras, Ananth Kalyanraman, Partha Pratim Pande","doi":"10.1145/3608098","DOIUrl":"https://doi.org/10.1145/3608098","url":null,"abstract":"Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Overflow-free Compute Memories for Edge AI Acceleration 边缘AI加速的无溢出计算内存

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609387

Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza

Compute memories are memory arrays augmented with dedicated logic to support arithmetic. They support the efficient execution of data-centric computing patterns, such as those characterizing Artificial Intelligence (AI) algorithms. These architectures can provide computing capabilities as part of the memory array structures (In-Memory Computing, IMC) or at their immediate periphery (Near-Memory Computing, NMC). By bringing the processing elements inside (or very close to) storage, compute memories minimize the cost of data access. Moreover, highly parallel (and, hence, high-performance) computations are enabled by exploiting the regular structure of memory arrays. However, the regular layout of memory elements also constrains the data range of inputs and outputs, since the bitwidths of operands and results stored at each address cannot be freely varied. Addressing this challenge, we herein propose a HW/SW co-design methodology combining careful per-layer quantization and inter-layer scaling with lightweight hardware support for overflow-free computation of dot-vector operations. We demonstrate their use to implement the convolutional and fully connected layers of AI models. We embody our strategy in two implementations, based on IMC and NMC, respectively. Experimental results highlight that an area overhead of only 10.5% (for IMC) and 12.9% (for NMC) is required when interfacing with a 2KB subarray. Furthermore, inferences on benchmark CNNs show negligible accuracy degradation due to quantization for equivalent floating-point implementations.

计算存储器是扩充了专用逻辑以支持算术的存储器阵列。它们支持以数据为中心的计算模式的有效执行，例如那些表征人工智能(AI)算法的模式。这些架构可以提供计算能力作为内存阵列结构的一部分(In-Memory computing, IMC)或在其直接外围(Near-Memory computing, NMC)。通过将处理元素置于(或非常接近)存储器中，计算存储器可以将数据访问的成本降至最低。此外，通过利用内存阵列的规则结构，可以实现高度并行(因此是高性能)的计算。然而，内存元素的常规布局也限制了输入和输出的数据范围，因为存储在每个地址的操作数和结果的位宽不能自由改变。为了应对这一挑战，我们在此提出了一种硬件/软件协同设计方法，将谨慎的逐层量化和层间缩放与轻量级硬件支持相结合，以实现点向量操作的无溢出计算。我们演示了它们在实现人工智能模型的卷积和完全连接层中的应用。我们以两种实现方式体现我们的战略，分别基于IMC和NMC。实验结果表明，当与2KB子阵列接口时，仅需要10.5%(对于IMC)和12.9%(对于NMC)的面积开销。此外，对基准cnn的推断显示，由于等效浮点实现的量化，精度下降可以忽略不计。

{"title":"Overflow-free Compute Memories for Edge AI Acceleration","authors":"Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza","doi":"10.1145/3609387","DOIUrl":"https://doi.org/10.1145/3609387","url":null,"abstract":"Compute memories are memory arrays augmented with dedicated logic to support arithmetic. They support the efficient execution of data-centric computing patterns, such as those characterizing Artificial Intelligence (AI) algorithms. These architectures can provide computing capabilities as part of the memory array structures (In-Memory Computing, IMC) or at their immediate periphery (Near-Memory Computing, NMC). By bringing the processing elements inside (or very close to) storage, compute memories minimize the cost of data access. Moreover, highly parallel (and, hence, high-performance) computations are enabled by exploiting the regular structure of memory arrays. However, the regular layout of memory elements also constrains the data range of inputs and outputs, since the bitwidths of operands and results stored at each address cannot be freely varied. Addressing this challenge, we herein propose a HW/SW co-design methodology combining careful per-layer quantization and inter-layer scaling with lightweight hardware support for overflow-free computation of dot-vector operations. We demonstrate their use to implement the convolutional and fully connected layers of AI models. We embody our strategy in two implementations, based on IMC and NMC, respectively. Experimental results highlight that an area overhead of only 10.5% (for IMC) and 12.9% (for NMC) is required when interfacing with a 2KB subarray. Furthermore, inferences on benchmark CNNs show negligible accuracy degradation due to quantization for equivalent floating-point implementations.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Consistency vs. Availability in Distributed Cyber-Physical Systems 分布式信息物理系统中的一致性与可用性

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609119

Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard

In distributed applications, Brewer’s CAP theorem tells us that when networks become partitioned (P), one must give up either consistency (C) or availability (A). Consistency is agreement on the values of shared variables; availability is the ability to respond to reads and writes accessing those shared variables. Availability is a real-time property whereas consistency is a logical property. We extend consistency and availability to refer to cyber-physical properties such as the state of the physical system and delays in actuation. We have further extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design cyber-physical systems. We develop a methodology for systematically trading off availability and consistency in application-specific ways and to guide the system designer when putting functionality in end devices, in edge computers, or in the cloud. We build on the Lingua Franca coordination language to provide system designers with concrete analysis and design tools to make the required tradeoffs in deployable embedded software.

在分布式应用程序中，Brewer的CAP定理告诉我们，当网络被分割(P)时，必须放弃一致性(C)或可用性(A)。一致性是对共享变量值的一致;可用性是对访问这些共享变量的读和写作出响应的能力。可用性是实时属性，而一致性是逻辑属性。我们将一致性和可用性扩展到网络物理属性，如物理系统的状态和驱动延迟。我们进一步扩展了CAP定理，将这两个性质的定量度量与通信和计算延迟(L)的定量度量联系起来，得到了一个称为CAL定理的关系，它在max-plus代数中是线性的。本文展示了如何以各种方式使用CAL定理来帮助设计网络物理系统。我们开发了一种方法，以特定于应用程序的方式系统地权衡可用性和一致性，并指导系统设计人员在将功能放入终端设备、边缘计算机或云中。我们以Lingua Franca协调语言为基础，为系统设计人员提供具体的分析和设计工具，以便在可部署的嵌入式软件中进行所需的权衡。

{"title":"Consistency vs. Availability in Distributed Cyber-Physical Systems","authors":"Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard","doi":"10.1145/3609119","DOIUrl":"https://doi.org/10.1145/3609119","url":null,"abstract":"In distributed applications, Brewer’s CAP theorem tells us that when networks become partitioned (P), one must give up either consistency (C) or availability (A). Consistency is agreement on the values of shared variables; availability is the ability to respond to reads and writes accessing those shared variables. Availability is a real-time property whereas consistency is a logical property. We extend consistency and availability to refer to cyber-physical properties such as the state of the physical system and delays in actuation. We have further extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design cyber-physical systems. We develop a methodology for systematically trading off availability and consistency in application-specific ways and to guide the system designer when putting functionality in end devices, in edge computers, or in the cloud. We build on the Lingua Franca coordination language to provide system designers with concrete analysis and design tools to make the required tradeoffs in deployable embedded software.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136191886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Kryptonite: Worst-Case Program Interference Estimation on Multi-Core Embedded Systems 多核嵌入式系统的最坏情况程序干扰估计

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609128

Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader

Due to the low costs and energy needed, cyber-physical systems are adopting multi-core processors for their embedded computing requirements. In order to guarantee safety when the application has real-time constraints, a critical requirement is to estimate the worst-case interference from other executing programs. However, the complexity of multi-core hardware inhibits precisely determining the Worst-Case Program Interference. Existing solutions are either prone to overestimate the interference or are not scalable to different hardware sizes and designs. In this paper we present Kryptonite , an automated framework to synthesize Worst-Case Program Interference (WCPI) environments for multi-core systems. Fundamental to Kryptonite is a set of tiny hardware-specific code gadgets that are crafted to maximize interference locally. The gadgets are arranged using a greedy approach and then molded using a Reinforcement Learning algorithm to create the WCPI environment. We demonstrate Kryptonite on the automotive grade Infineon AURIX TC399 processor with a wide range of programs that includes a commercial real-time automotive application. We show that, while being easily scalable and tunable, Kryptonite creates WCPI environments increasing the runtime by up to 58% for benchmark applications and 26% for the automotive application.

由于低成本和能源需求，网络物理系统正在采用多核处理器来满足其嵌入式计算需求。当应用程序有实时限制时，为了保证安全性，一个关键的要求是估计来自其他执行程序的最坏情况干扰。然而，多核硬件的复杂性阻碍了最坏情况下程序干扰的精确确定。现有的解决方案要么容易高估干扰，要么不能扩展到不同的硬件尺寸和设计。在本文中，我们提出了Kryptonite，一个自动合成多核系统的最坏情况程序干扰(WCPI)环境的框架。Kryptonite的基础是一组特定于硬件的小型代码装置，它们被精心设计以最大限度地干扰本地。这些小工具使用贪婪方法进行排列，然后使用强化学习算法进行建模，以创建WCPI环境。我们在汽车级英飞凌AURIX TC399处理器上展示了Kryptonite，该处理器具有广泛的程序，包括商业实时汽车应用程序。我们表明，Kryptonite创建的WCPI环境虽然易于扩展和调优，但可将基准应用程序的运行时提高58%，将汽车应用程序的运行时提高26%。

{"title":"Kryptonite: Worst-Case Program Interference Estimation on Multi-Core Embedded Systems","authors":"Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader","doi":"10.1145/3609128","DOIUrl":"https://doi.org/10.1145/3609128","url":null,"abstract":"Due to the low costs and energy needed, cyber-physical systems are adopting multi-core processors for their embedded computing requirements. In order to guarantee safety when the application has real-time constraints, a critical requirement is to estimate the worst-case interference from other executing programs. However, the complexity of multi-core hardware inhibits precisely determining the Worst-Case Program Interference. Existing solutions are either prone to overestimate the interference or are not scalable to different hardware sizes and designs. In this paper we present Kryptonite , an automated framework to synthesize Worst-Case Program Interference (WCPI) environments for multi-core systems. Fundamental to Kryptonite is a set of tiny hardware-specific code gadgets that are crafted to maximize interference locally. The gadgets are arranged using a greedy approach and then molded using a Reinforcement Learning algorithm to create the WCPI environment. We demonstrate Kryptonite on the automotive grade Infineon AURIX TC399 processor with a wide range of programs that includes a commercial real-time automotive application. We show that, while being easily scalable and tunable, Kryptonite creates WCPI environments increasing the runtime by up to 58% for benchmark applications and 26% for the automotive application.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FSIMR: File-system-aware Data Management for Interlaced Magnetic Recording 交错磁记录的文件系统感知数据管理

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607922

Yi-Han Lien, Yen-Ting Chen, Yuan-Hao Chang, Yu-Pei Liang, Wei-Kuan Shih

Interlaced Magnetic Recording (IMR) is an emerging recording technology for hard-disk drives (HDDs) that provides larger storage capacity at a lower cost. By partially overlapping (interlacing) each bottom track with two adjacent top tracks, IMR-based HDDs successfully increase the data density while incurring some hardware write constraints. To update each bottom track, the data on two adjacent top tracks must be read and rewritten to avoid losing their valid data, resulting in additional overhead for performing read-modify-write (RMW) operations. Therefore, researchers have proposed various data management schemes to mitigate such overhead in recent years, aiming at improving the write performance. However, these designs have not taken into account the data characteristics of the file system, which is a crucial layer of operating systems for storing/retrieving data into/from HDDs. Consequently, the write performance improvement is limited due to the unawareness of spatial locality and hotness of data. This paper proposes a file-system-aware data management scheme called FSIMR to improve system write performance. Noticing that data of the same directory may have higher spatial locality and are mostly updated at the same time, FSIMR logically partitions the IMR-based HDD into fixed-sized zones; data belonging to the same directory will be arranged to one zone to reduce the time of seeking to-be-updated data (seek time). Furthermore, cold data within a zone are arranged to bottom tracks and updated in an out-of-place manner to eliminate RMW operations. Our experimental results show that the proposed FSIMR could reduce the seek time by up to 14% without introducing additional RMW operations, compared to existing designs.

交错磁记录(IMR)是一种新兴的用于硬盘驱动器(hdd)的记录技术，它以更低的成本提供更大的存储容量。通过将每个底部磁道与两个相邻的顶部磁道部分重叠(交错)，基于imr的hdd成功地增加了数据密度，同时产生了一些硬件写入约束。为了更新每个底部磁道，必须读取和重写两个相邻顶部磁道上的数据，以避免丢失它们的有效数据，从而导致执行读-修改-写(RMW)操作的额外开销。因此，近年来研究人员提出了各种数据管理方案来减轻这种开销，旨在提高写性能。然而，这些设计并没有考虑到文件系统的数据特性，而文件系统是操作系统中用于向hdd中存储/检索数据的关键层。因此，由于不知道数据的空间局域性和热度，写性能的提高受到限制。为了提高系统的写性能，本文提出了一种文件系统感知的数据管理方案FSIMR。注意到同一目录的数据可能具有更高的空间局部性，并且大多同时更新，FSIMR在逻辑上将基于imr的HDD划分为固定大小的区域;属于同一目录的数据将被安排到一个区域，以减少查找待更新数据的时间(查找时间)。此外，区域内的冷数据被安排到底部轨道，并以异地方式更新，以消除RMW操作。我们的实验结果表明，与现有设计相比，所提出的FSIMR可以在不引入额外RMW操作的情况下减少高达14%的寻道时间。

{"title":"FSIMR: File-system-aware Data Management for Interlaced Magnetic Recording","authors":"Yi-Han Lien, Yen-Ting Chen, Yuan-Hao Chang, Yu-Pei Liang, Wei-Kuan Shih","doi":"10.1145/3607922","DOIUrl":"https://doi.org/10.1145/3607922","url":null,"abstract":"Interlaced Magnetic Recording (IMR) is an emerging recording technology for hard-disk drives (HDDs) that provides larger storage capacity at a lower cost. By partially overlapping (interlacing) each bottom track with two adjacent top tracks, IMR-based HDDs successfully increase the data density while incurring some hardware write constraints. To update each bottom track, the data on two adjacent top tracks must be read and rewritten to avoid losing their valid data, resulting in additional overhead for performing read-modify-write (RMW) operations. Therefore, researchers have proposed various data management schemes to mitigate such overhead in recent years, aiming at improving the write performance. However, these designs have not taken into account the data characteristics of the file system, which is a crucial layer of operating systems for storing/retrieving data into/from HDDs. Consequently, the write performance improvement is limited due to the unawareness of spatial locality and hotness of data. This paper proposes a file-system-aware data management scheme called FSIMR to improve system write performance. Noticing that data of the same directory may have higher spatial locality and are mostly updated at the same time, FSIMR logically partitions the IMR-based HDD into fixed-sized zones; data belonging to the same directory will be arranged to one zone to reduce the time of seeking to-be-updated data (seek time). Furthermore, cold data within a zone are arranged to bottom tracks and updated in an out-of-place manner to eliminate RMW operations. Our experimental results show that the proposed FSIMR could reduce the seek time by up to 14% without introducing additional RMW operations, compared to existing designs.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0