ACM Transactions on Embedded Computing Systems最新文献_第8页

ViT4Mal: Lightweight Vision Transformer for Malware Detection on Edge Devices ViT4Mal:用于边缘设备恶意软件检测的轻量级视觉转换器

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609112

Akshara Ravi, Vivek Chaturvedi, Muhammad Shafique

There has been a tremendous growth of edge devices connected to the network in recent years. Although these devices make our life simpler and smarter, they need to perform computations under severe resource and energy constraints, while being vulnerable to malware attacks. Once compromised, these devices are further exploited as attack vectors targeting critical infrastructure. Most existing malware detection solutions are resource and compute-intensive and hence perform poorly in protecting edge devices. In this paper, we propose a novel approach ViT4Mal that utilizes a lightweight vision transformer (ViT) for malware detection on an edge device. ViT4Mal first converts executable byte-code into images to learn malware features and later uses a customized lightweight ViT to detect malware with high accuracy. We have performed extensive experiments to compare our model with state-of-the-art CNNs in the malware detection domain. Experimental results corroborate that ViTs don’t demand deeper networks to achieve comparable accuracy of around 97% corresponding to heavily structured CNN models. We have also performed hardware deployment of our proposed lightweight ViT4Mal model on the Xilinx PYNQ Z1 FPGA board by applying specialized hardware optimizations such as quantization, loop pipelining, and array partitioning. ViT4Mal achieved an accuracy of ~94% and a 41x speedup compared to the original ViT model.

近年来，连接到网络的边缘设备有了巨大的增长。虽然这些设备使我们的生活更简单、更智能，但它们需要在严重的资源和能源限制下执行计算，同时容易受到恶意软件的攻击。一旦受到攻击，这些设备就会被进一步利用，成为针对关键基础设施的攻击媒介。大多数现有的恶意软件检测解决方案都是资源和计算密集型的，因此在保护边缘设备方面表现不佳。在本文中，我们提出了一种新的方法ViT4Mal，它利用轻量级视觉变压器(ViT)在边缘设备上进行恶意软件检测。ViT4Mal首先将可执行字节码转换为图像来学习恶意软件的功能，然后使用定制的轻量级ViT来高精度地检测恶意软件。我们进行了大量的实验，将我们的模型与恶意软件检测领域最先进的cnn进行比较。实验结果证实，vit不需要更深层的网络来达到与高度结构化的CNN模型相对应的97%左右的准确度。我们还在Xilinx PYNQ Z1 FPGA板上对我们提出的轻量级ViT4Mal模型进行了硬件部署，方法是应用专门的硬件优化，如量化、循环流水线和阵列分区。与最初的ViT模型相比，ViT4Mal实现了约94%的准确率和41倍的加速。

{"title":"ViT4Mal: Lightweight Vision Transformer for Malware Detection on Edge Devices","authors":"Akshara Ravi, Vivek Chaturvedi, Muhammad Shafique","doi":"10.1145/3609112","DOIUrl":"https://doi.org/10.1145/3609112","url":null,"abstract":"There has been a tremendous growth of edge devices connected to the network in recent years. Although these devices make our life simpler and smarter, they need to perform computations under severe resource and energy constraints, while being vulnerable to malware attacks. Once compromised, these devices are further exploited as attack vectors targeting critical infrastructure. Most existing malware detection solutions are resource and compute-intensive and hence perform poorly in protecting edge devices. In this paper, we propose a novel approach ViT4Mal that utilizes a lightweight vision transformer (ViT) for malware detection on an edge device. ViT4Mal first converts executable byte-code into images to learn malware features and later uses a customized lightweight ViT to detect malware with high accuracy. We have performed extensive experiments to compare our model with state-of-the-art CNNs in the malware detection domain. Experimental results corroborate that ViTs don’t demand deeper networks to achieve comparable accuracy of around 97% corresponding to heavily structured CNN models. We have also performed hardware deployment of our proposed lightweight ViT4Mal model on the Xilinx PYNQ Z1 FPGA board by applying specialized hardware optimizations such as quantization, loop pipelining, and array partitioning. ViT4Mal achieved an accuracy of ~94% and a 41x speedup compared to the original ViT model.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CABARRE: Request Response Arbitration for Shared Cache Management 共享缓存管理的请求响应仲裁

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3608096

Garima Modi, Aritra Bagchi, Neetu Jindal, Ayan Mandal, Preeti Ranjan Panda

Modern multi-processor systems-on-chip (MPSoCs) are characterized by caches shared by multiple cores. These shared caches receive requests issued by the processor cores. Requests that are subject to cache misses may result in the generation of responses . These responses are received from the lower level of the memory hierarchy and written to the cache. The outstanding requests and responses contend for the shared cache bandwidth. To mitigate the impact of the cache bandwidth contention on the overall system performance, an efficient request and response arbitration policy is needed. Research on shared cache management has neglected the additional cache contention caused by responses, which are written to the cache. We propose CABARRE , a novel request and response arbitration policy at shared caches, so as to improve the overall system performance. CABARRE shows a performance improvement of 23% on average across a set of SPEC workloads compared to straightforward adaptations of state-of-the-art solutions.

现代多处理器片上系统(mpsoc)的特点是由多个核心共享缓存。这些共享缓存接收处理器内核发出的请求。缓存丢失的请求可能导致生成响应。这些响应从内存层次结构的较低级别接收并写入缓存。未完成的请求和响应争用共享缓存带宽。为了减轻缓存带宽争用对系统整体性能的影响，需要一种有效的请求和响应仲裁策略。共享缓存管理的研究忽略了响应引起的额外缓存争用，这些响应被写入缓存。为了提高系统的整体性能，我们提出了一种基于共享缓存的请求和响应仲裁策略——CABARRE。与直接适应最先进的解决方案相比，CABARRE在一组SPEC工作负载上的性能平均提高了23%。

引用次数: 0

Hephaestus : Codesigning and Automating 3D Image Registration on Reconfigurable Architectures 赫菲斯托斯:在可重构架构上协同设计和自动化3D图像配准

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607928

Giuseppe Sorrentino, Marco Venere, Davide Conficconi, Eleonora D’Arnese, Marco Domenico Santambrogio

Healthcare is a pivotal research field, and medical imaging is crucial in many applications. Therefore finding new architectural and algorithmic solutions would benefit highly repetitive image processing procedures. One of the most complex tasks in this sense is image registration, which finds the optimal geometric alignment among 3D image stacks and is widely employed in healthcare and robotics. Given the high computational demand of such a procedure, hardware accelerators are promising real-time and energy-efficient solutions, but they are complex to design and integrate within software pipelines. Therefore, this work presents an automation framework called Hephaestus that generates efficient 3D image registration pipelines combined with reconfigurable accelerators. Moreover, to alleviate the burden from the software, we codesign software-programmable accelerators that can adapt at run-time to the image volume dimensions. Hephaestus features a cross-platform abstraction layer that enables transparently high-performance and embedded systems deployment. However, given the computational complexity of 3D image registration, the embedded devices become a relevant and complex setting being constrained in memory; thus, they require further attention and tailoring of the accelerators and registration application to reach satisfactory results. Therefore, with Hephaestus , we also propose an approximation mechanism that enables such devices to perform the 3D image registration and even achieve, in some cases, the accuracy of the high-performance ones. Overall, Hephaestus demonstrates 1.85× of maximum speedup, 2.35× of efficiency improvement with respect to the State of the Art, a maximum speedup of 2.51× and 2.76× efficiency improvements against our software, while attaining state-of-the-art accuracy on 3D registrations.

医疗保健是一个关键的研究领域，医学成像在许多应用中都是至关重要的。因此，寻找新的架构和算法解决方案将有利于高度重复的图像处理过程。在这个意义上，最复杂的任务之一是图像配准，它在3D图像堆栈之间找到最佳的几何对齐，并广泛应用于医疗保健和机器人。考虑到这样一个过程的高计算需求，硬件加速器是有希望的实时和节能的解决方案，但它们在软件管道中设计和集成是复杂的。因此，这项工作提出了一个名为Hephaestus的自动化框架，它可以生成高效的3D图像配准管道，并结合可重构加速器。此外，为了减轻软件的负担，我们共同设计了软件可编程加速器，可以在运行时适应图像体积尺寸。Hephaestus具有跨平台抽象层，支持透明的高性能和嵌入式系统部署。然而，由于三维图像配准的计算复杂性，嵌入式设备成为一个受内存约束的相关和复杂的设置;因此，它们需要进一步的关注和定制加速器和配准应用，以达到令人满意的效果。因此，对于Hephaestus，我们也提出了一种近似机制，使这些设备能够执行3D图像配准，甚至在某些情况下达到高性能设备的精度。总的来说，Hephaestus的最大加速提高了1.85倍，效率提高了2.35倍，最大加速提高了2.51倍，效率提高了2.76倍，同时在3D注册上达到了最先进的精度。

{"title":"<scp>Hephaestus</scp> : Codesigning and Automating 3D Image Registration on Reconfigurable Architectures","authors":"Giuseppe Sorrentino, Marco Venere, Davide Conficconi, Eleonora D’Arnese, Marco Domenico Santambrogio","doi":"10.1145/3607928","DOIUrl":"https://doi.org/10.1145/3607928","url":null,"abstract":"Healthcare is a pivotal research field, and medical imaging is crucial in many applications. Therefore finding new architectural and algorithmic solutions would benefit highly repetitive image processing procedures. One of the most complex tasks in this sense is image registration, which finds the optimal geometric alignment among 3D image stacks and is widely employed in healthcare and robotics. Given the high computational demand of such a procedure, hardware accelerators are promising real-time and energy-efficient solutions, but they are complex to design and integrate within software pipelines. Therefore, this work presents an automation framework called Hephaestus that generates efficient 3D image registration pipelines combined with reconfigurable accelerators. Moreover, to alleviate the burden from the software, we codesign software-programmable accelerators that can adapt at run-time to the image volume dimensions. Hephaestus features a cross-platform abstraction layer that enables transparently high-performance and embedded systems deployment. However, given the computational complexity of 3D image registration, the embedded devices become a relevant and complex setting being constrained in memory; thus, they require further attention and tailoring of the accelerators and registration application to reach satisfactory results. Therefore, with Hephaestus , we also propose an approximation mechanism that enables such devices to perform the 3D image registration and even achieve, in some cases, the accuracy of the high-performance ones. Overall, Hephaestus demonstrates 1.85× of maximum speedup, 2.35× of efficiency improvement with respect to the State of the Art, a maximum speedup of 2.51× and 2.76× efficiency improvements against our software, while attaining state-of-the-art accuracy on 3D registrations.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks BitSET:用于卷积神经网络减少计算量的位序列提前终止

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609093

Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke

Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at the cost of substantial computation and energy consumption, making it difficult to be deployed on mobile and embedded devices. In CNNs, the compute-intensive convolutional layers are usually followed by a ReLU activation layer, which clamps negative outputs to zeros, resulting in large activation sparsity. By exploiting such sparsity in CNN models, we propose a software-hardware co-design BitSET, that aggressively saves energy during CNN inference. The bit-serial BitSET accelerator adopts a prediction-based bit-level early termination technique that terminates the ineffectual computation of negative outputs early. To assist the algorithm, we propose a novel weight encoding that allows more accurate predictions with fewer bits. BitSET leverages the bit-level computation reduction both in the predictive early termination algorithm and in the non-predictive, energy-efficient bit-serial architecture. Compared to UNPU, an energy-efficient bit-serial CNN accelerator, BitSET yields an average 1.5× speedup and 1.4× energy efficiency improvement with no accuracy loss due to a 48% reduction in bit-level computations. Relaxing the allowed accuracy loss to 1% increases the gains to an average of 1.6× speedup and 1.4× energy efficiency improvement.

卷积神经网络(cnn)在广泛的机器学习任务中表现出了卓越的性能。然而，高精度通常是以大量的计算和能源消耗为代价的，这使得它很难部署在移动和嵌入式设备上。在cnn中，计算密集型的卷积层之后通常是一个ReLU激活层，它将负输出箝制为零，从而产生很大的激活稀疏性。通过利用CNN模型中的这种稀疏性，我们提出了一种软硬件协同设计的BitSET，它在CNN推理过程中大大节省了能量。位串行BitSET加速器采用基于预测的位级提前终止技术，提前终止负输出的无效计算。为了辅助算法，我们提出了一种新的权重编码，可以用更少的比特进行更准确的预测。BitSET在预测性早期终止算法和非预测性、节能的位串行体系结构中都利用了比特级计算的减少。与UNPU(一种节能的位串行CNN加速器)相比，BitSET的平均加速速度提高了1.5倍，能效提高了1.4倍，而且由于比特级计算减少了48%，没有精度损失。将允许的精度损失放宽到1%，可以将增益提高到平均1.6倍的加速和1.4倍的能效改进。

{"title":"BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks","authors":"Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke","doi":"10.1145/3609093","DOIUrl":"https://doi.org/10.1145/3609093","url":null,"abstract":"Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at the cost of substantial computation and energy consumption, making it difficult to be deployed on mobile and embedded devices. In CNNs, the compute-intensive convolutional layers are usually followed by a ReLU activation layer, which clamps negative outputs to zeros, resulting in large activation sparsity. By exploiting such sparsity in CNN models, we propose a software-hardware co-design BitSET, that aggressively saves energy during CNN inference. The bit-serial BitSET accelerator adopts a prediction-based bit-level early termination technique that terminates the ineffectual computation of negative outputs early. To assist the algorithm, we propose a novel weight encoding that allows more accurate predictions with fewer bits. BitSET leverages the bit-level computation reduction both in the predictive early termination algorithm and in the non-predictive, energy-efficient bit-serial architecture. Compared to UNPU, an energy-efficient bit-serial CNN accelerator, BitSET yields an average 1.5× speedup and 1.4× energy efficiency improvement with no accuracy loss due to a 48% reduction in bit-level computations. Relaxing the allowed accuracy loss to 1% increases the gains to an average of 1.6× speedup and 1.4× energy efficiency improvement.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Constructive State-based Semantics and Interpreter for a Synchronous Data-flow Language with State Machines 一种具有状态机的同步数据流语言的基于状态的结构化语义和解释器

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609131

Jean-Louis Colaço, Michael Mendler, Baptiste Pauget, Marc Pouzet

Scade is a domain-specific synchronous functional language used to implement safety-critical real-time software for more than twenty years. Two main approaches have been considered for its semantics: (i) an indirect collapsing semantics based on a source-to-source translation of high-level constructs into a data-flow core language whose semantics is precisely specified and is the entry for code generation; a relational synchronous semantics , akin to Esterel, that applies directly to the source. It defines what is a valid synchronous reaction but hides, on purpose, if a semantics exists, is unique and can be computed; hence, it is not executable. This paper presents, for the first time, an executable , state-based semantics for a language that has the key constructs of Scade all together, in particular the arbitrary combination of data-flow equations and hierarchical state machines. It can apply directly to the source language before static checks and compilation steps. It is constructive in the sense that the language in which the semantics is defined is a statically typed functional language with call-by-value and strong normalization, e.g., it is expressible in a proof-assistant where all functions terminate. It leads to a reference, purely functional, interpreter. This semantics is modular and can account for possible errors, allowing to establish what property is ensured by each static verification performed by the compiler. It also clarifies how causality is treated in Scade compared with Esterel. This semantics can serve as an oracle for compiler testing and validation; to prototype novel language constructs before they are implemented, to execute possibly unfinished models or that are correct but rejected by the compiler; to prove the correctness of compilation steps. The semantics given in the paper is implemented as an interpreter in a purely functional style, in OCaml.

Scade是一种特定于领域的同步函数式语言，二十多年来一直用于实现安全关键型实时软件。其语义考虑了两种主要方法:(i)基于高级结构的源到源转换到数据流核心语言的间接折叠语义，其语义是精确指定的，并且是代码生成的入口;一种关系同步语义，类似于Esterel，直接应用于源。它定义了什么是有效的同步反应，但故意隐藏了语义是否存在、是否唯一且可以计算;因此，它是不可执行的。本文首次提出了一种可执行的、基于状态的语义，这种语义包含了Scade的所有关键结构，特别是数据流方程和分层状态机的任意组合。它可以在静态检查和编译步骤之前直接应用于源语言。在定义语义的语言是具有按值调用和强规范化的静态类型函数语言的意义上，它是建设性的，例如，它在所有函数终止的证明辅助中是可表达的。它导致了一个纯功能的引用解释器。这种语义是模块化的，可以解释可能的错误，允许建立由编译器执行的每个静态验证确保的属性。它还澄清了Scade与Esterel相比如何处理因果关系。这个语义可以作为编译器测试和验证的oracle;在实现新的语言结构之前对其进行原型化，执行可能未完成的模型或正确但被编译器拒绝的模型;证明编译步骤的正确性。文中给出的语义是用OCaml实现的纯函数式解释器。

{"title":"A Constructive State-based Semantics and Interpreter for a Synchronous Data-flow Language with State Machines","authors":"Jean-Louis Colaço, Michael Mendler, Baptiste Pauget, Marc Pouzet","doi":"10.1145/3609131","DOIUrl":"https://doi.org/10.1145/3609131","url":null,"abstract":"Scade is a domain-specific synchronous functional language used to implement safety-critical real-time software for more than twenty years. Two main approaches have been considered for its semantics: (i) an indirect collapsing semantics based on a source-to-source translation of high-level constructs into a data-flow core language whose semantics is precisely specified and is the entry for code generation; a relational synchronous semantics , akin to Esterel, that applies directly to the source. It defines what is a valid synchronous reaction but hides, on purpose, if a semantics exists, is unique and can be computed; hence, it is not executable. This paper presents, for the first time, an executable , state-based semantics for a language that has the key constructs of Scade all together, in particular the arbitrary combination of data-flow equations and hierarchical state machines. It can apply directly to the source language before static checks and compilation steps. It is constructive in the sense that the language in which the semantics is defined is a statically typed functional language with call-by-value and strong normalization, e.g., it is expressible in a proof-assistant where all functions terminate. It leads to a reference, purely functional, interpreter. This semantics is modular and can account for possible errors, allowing to establish what property is ensured by each static verification performed by the compiler. It also clarifies how causality is treated in Scade compared with Esterel. This semantics can serve as an oracle for compiler testing and validation; to prototype novel language constructs before they are implemented, to execute possibly unfinished models or that are correct but rejected by the compiler; to prove the correctness of compilation steps. The semantics given in the paper is implemented as an interpreter in a purely functional style, in OCaml.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ObNoCs : Protecting Network-on-Chip Fabrics Against Reverse-Engineering Attacks obnoc:保护片上网络结构免受逆向工程攻击

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609107

Dipal Halder, Maneesh Merugu, Sandip Ray

Modern System-on-Chip designs typically use Network-on-Chip (NoC) fabrics to implement coordination among integrated hardware blocks. An important class of security vulnerabilities involves a rogue foundry reverse-engineering the NoC topology and routing logic. In this paper, we develop an infrastructure, ObNoCs , for protecting NoC fabrics against such attacks. ObNoCs systematically replaces router connections with switches that can be programmed after fabrication to induce the desired topology. Our approach provides provable redaction of NoC functionality: switch configurations induce a large number of legal topologies, only one of which corresponds to the intended topology. We implement the ObNoCs methodology on Intel Quartus™ Platform, and experimental results on realistic SoC designs show that the architecture incurs minimal overhead in power, resource utilization, and system latency.

现代片上系统设计通常使用片上网络(NoC)结构来实现集成硬件块之间的协调。一类重要的安全漏洞涉及恶意铸造厂对NoC拓扑和路由逻辑进行逆向工程。在本文中，我们开发了一个基础设施obnoc，用于保护NoC结构免受此类攻击。obnoc系统地用交换机取代路由器连接，这些交换机可以在制造后编程以诱导所需的拓扑结构。我们的方法提供了可证明的NoC功能编校:交换机配置导致大量合法拓扑，其中只有一个与预期拓扑对应。我们在Intel Quartus™平台上实现了obnoc方法，实际SoC设计的实验结果表明，该架构在功耗、资源利用率和系统延迟方面的开销最小。

引用次数: 0

Proactive Stripe Reconstruction to Improve Cache Use Efficiency of SSD-Based RAID Systems 主动分条重构提高ssd RAID系统缓存利用率

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609099

Zhibing Sha, Jiaojiao Wu, Jun Li, Balazs Gerofi, Zhigang Cai, Jianwei Liao

Solid-State Drives (SSDs) exhibit different failure characteristics compared to conventional hard disk drives. In particular, the Bit Error Rate (BER) of an SSD increases as it bears more writes. Then, Parity-based Redundant Array of Inexpensive Disks (RAID) arrays composed from SSDs are introduced to address correlated failures. In the RAID-5 implementation, specifically, the process of parity generation (or update) associating with a data stripe, consists of read and write operations to the SSDs. Whenever a new update request comes to the RAID system, the related parity must be also updated and flushed onto the RAID component of SSD. Such frequent parity updates result in poor RAID performance and shorten the life-time of the SSDs. Consequently, a DRAM cache is commonly equipped accompanying with the RAID controller, called the parity cache, and used to buffer the parity chunks that are most frequently updated data, for boosting I/O performance. To better improve the use efficiency of the parity cache, this paper proposes a stripe reconstruction approach to minimize the number of parity updates on SSDs, thus boosting I/O performance of the SSD RAID system. When the currently updated stripe has both cold and hot updated data chunks, it will proactively carry out stripe reconstruction if we can find another matched stripe that also includes cold and hot update data chunks on the complementary RAID components. In the reconstruction process, we first group the cold data chunks of two matched stripes, to build a new stripe and flush the parity chunk on the RAID component. After that, the hot data chunks are organized as a new stripe as well, and its parity chunk is buffered in the parity cache. This results in better cache use efficiency, as it can reduce the number of parity updates on RAID components of SSDs, as well as proactively free up cache space for quickly absorbing subsequent write requests. In addition, the proposed method adjusts the target SSD of write requests based on stripe reconstructions through considering the I/O workload balance of all SSDs. Experimental results show that our proposal can reduce the number of parity chunk updates in SSDs by 2.3% and overall I/O latency by 12.2% on average, compared to state-of-the-art parity cache management techniques.

与传统硬盘驱动器相比，固态硬盘驱动器(ssd)具有不同的故障特征。特别是，随着写入次数的增加，SSD的误码率也会增加。然后，引入由ssd组成的基于奇偶校验的RAID (Redundant Array of Inexpensive Disks)阵列来解决相关故障。在RAID-5实现中，与数据分条相关联的奇偶校验生成(或更新)过程包括对ssd硬盘的读和写操作。每当新的更新请求到达RAID系统时，相关的奇偶校验也必须更新并刷新到SSD的RAID组件上。频繁的奇偶校验更新会导致RAID性能下降，缩短ssd寿命。因此，DRAM缓存通常与RAID控制器一起配备，称为奇偶校验缓存，用于缓冲最频繁更新的数据奇偶校验块，以提高I/O性能。为了更好地提高奇偶校验缓存的使用效率，本文提出了一种条带重构方法，以减少SSD上的奇偶校验更新次数，从而提高SSD RAID系统的I/O性能。当当前更新的分条同时存在冷更新数据块和热更新数据块时，如果在互补RAID组件上找到另一个匹配的分条，同时存在冷更新数据块和热更新数据块，则会主动进行分条重构。在重构过程中，我们首先将两个匹配条带的冷数据块分组，建立一个新的条带，并在RAID组件上刷新校验块。之后，热数据块也被组织成一个新的分条，其奇偶校验块被缓冲在奇偶校验缓存中。这可以提高缓存使用效率，因为它可以减少ssd RAID组件上的奇偶更新次数，并主动释放缓存空间，以便快速吸收后续的写请求。此外，该方法通过考虑所有SSD的I/O负载均衡，根据分条重构调整写请求的目标SSD。实验结果表明，与最先进的奇偶校验缓存管理技术相比，我们的建议可以将ssd中的奇偶校验块更新次数减少2.3%，总体I/O延迟平均减少12.2%。

{"title":"Proactive Stripe Reconstruction to Improve Cache Use Efficiency of SSD-Based RAID Systems","authors":"Zhibing Sha, Jiaojiao Wu, Jun Li, Balazs Gerofi, Zhigang Cai, Jianwei Liao","doi":"10.1145/3609099","DOIUrl":"https://doi.org/10.1145/3609099","url":null,"abstract":"Solid-State Drives (SSDs) exhibit different failure characteristics compared to conventional hard disk drives. In particular, the Bit Error Rate (BER) of an SSD increases as it bears more writes. Then, Parity-based Redundant Array of Inexpensive Disks (RAID) arrays composed from SSDs are introduced to address correlated failures. In the RAID-5 implementation, specifically, the process of parity generation (or update) associating with a data stripe, consists of read and write operations to the SSDs. Whenever a new update request comes to the RAID system, the related parity must be also updated and flushed onto the RAID component of SSD. Such frequent parity updates result in poor RAID performance and shorten the life-time of the SSDs. Consequently, a DRAM cache is commonly equipped accompanying with the RAID controller, called the parity cache, and used to buffer the parity chunks that are most frequently updated data, for boosting I/O performance. To better improve the use efficiency of the parity cache, this paper proposes a stripe reconstruction approach to minimize the number of parity updates on SSDs, thus boosting I/O performance of the SSD RAID system. When the currently updated stripe has both cold and hot updated data chunks, it will proactively carry out stripe reconstruction if we can find another matched stripe that also includes cold and hot update data chunks on the complementary RAID components. In the reconstruction process, we first group the cold data chunks of two matched stripes, to build a new stripe and flush the parity chunk on the RAID component. After that, the hot data chunks are organized as a new stripe as well, and its parity chunk is buffered in the parity cache. This results in better cache use efficiency, as it can reduce the number of parity updates on RAID components of SSDs, as well as proactively free up cache space for quickly absorbing subsequent write requests. In addition, the proposed method adjusts the target SSD of write requests based on stripe reconstructions through considering the I/O workload balance of all SSDs. Experimental results show that our proposal can reduce the number of parity chunk updates in SSDs by 2.3% and overall I/O latency by 12.2% on average, compared to state-of-the-art parity cache management techniques.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Methods to Realize Preemption in Phased Execution Models 在阶段执行模型中实现抢占的方法

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609132

Thilanka Thilakasiri, Matthias Becker

Phased execution models are a good solution to tame the increased complexity and contention of commercial off-the-shelf (COTS) multi-core platforms, e.g., Acquisition-Execution-Restitution (AER) model, PRedictable Execution Model (PREM). Such models separate execution from access to shared resources on the platform to minimize contention. All data and instructions needed during an execution phase are copied into the local memory of the core before starting to execute. Phased execution models are generally used with non-preemptive scheduling to increase predictability. However, the blocking time in non-preemptive systems can reduce schedulability. Therefore, an investigation of preemption methods for phased execution models is warranted. Although, preemption for phased execution models must be carefully designed to retain its execution semantics, i.e., the handling of local memory during preemption becomes non-trivial. This paper investigates different methods to realize preemption in phased execution models while preserving their semantics. To the best of our knowledge, this is the first paper to explore different approaches to implement preemption in phased execution models from the perspective of data management. We introduce two strategies to realize preemption of execution phases based on different methods of handling local data of the preempted task. Heuristics are used to create time-triggered schedules for task sets that follow the proposed preemption methods. Additionally, a schedulability-aware preemption heuristic is proposed to reduce the number of preemptions by allowing preemption only when it is beneficial in terms of schedulability. Evaluations on a large number of synthetic task sets are performed to compare the proposed preemption models against each other and against a non-preemptive version. Furthermore, our schedulability-aware preemption heuristic has higher schedulability with a clear margin in all our experiments compared to the non-preemptive and fully-preemptive versions.

阶段执行模型是一种很好的解决方案，可以抑制商业现货(COTS)多核平台增加的复杂性和竞争，例如，获取-执行-恢复(AER)模型，可预测执行模型(PREM)。这些模型将执行与对平台上共享资源的访问分开，以最大限度地减少争用。在执行阶段所需的所有数据和指令在开始执行之前被复制到核心的本地存储器中。分阶段执行模型通常与非抢占式调度一起使用，以提高可预测性。然而，在非抢占系统中，阻塞时间会降低可调度性。因此，有必要对分阶段执行模型的抢占方法进行研究。尽管如此，对于分阶段执行模型的抢占必须仔细设计以保留其执行语义，也就是说，在抢占期间对本地内存的处理变得非常重要。本文研究了在保持阶段性执行模型语义的前提下实现抢占的不同方法。据我们所知，这是第一篇从数据管理的角度探讨在分阶段执行模型中实现抢占的不同方法的论文。根据被抢占任务的本地数据处理方法的不同，介绍了两种实现执行阶段抢占的策略。启发式用于为遵循建议的抢占方法的任务集创建时间触发调度。此外，提出了一种可调度性感知的抢占启发式算法，通过只在有利于可调度性的情况下允许抢占来减少抢占的数量。对大量合成任务集进行评估，以比较提出的抢占模型彼此之间以及与非抢占版本之间的比较。此外，与非抢占和完全抢占版本相比，我们的可调度性感知抢占启发式具有更高的可调度性，并且在所有实验中都有明显的余量。

{"title":"Methods to Realize Preemption in Phased Execution Models","authors":"Thilanka Thilakasiri, Matthias Becker","doi":"10.1145/3609132","DOIUrl":"https://doi.org/10.1145/3609132","url":null,"abstract":"Phased execution models are a good solution to tame the increased complexity and contention of commercial off-the-shelf (COTS) multi-core platforms, e.g., Acquisition-Execution-Restitution (AER) model, PRedictable Execution Model (PREM). Such models separate execution from access to shared resources on the platform to minimize contention. All data and instructions needed during an execution phase are copied into the local memory of the core before starting to execute. Phased execution models are generally used with non-preemptive scheduling to increase predictability. However, the blocking time in non-preemptive systems can reduce schedulability. Therefore, an investigation of preemption methods for phased execution models is warranted. Although, preemption for phased execution models must be carefully designed to retain its execution semantics, i.e., the handling of local memory during preemption becomes non-trivial. This paper investigates different methods to realize preemption in phased execution models while preserving their semantics. To the best of our knowledge, this is the first paper to explore different approaches to implement preemption in phased execution models from the perspective of data management. We introduce two strategies to realize preemption of execution phases based on different methods of handling local data of the preempted task. Heuristics are used to create time-triggered schedules for task sets that follow the proposed preemption methods. Additionally, a schedulability-aware preemption heuristic is proposed to reduce the number of preemptions by allowing preemption only when it is beneficial in terms of schedulability. Evaluations on a large number of synthetic task sets are performed to compare the proposed preemption models against each other and against a non-preemptive version. Furthermore, our schedulability-aware preemption heuristic has higher schedulability with a clear margin in all our experiments compared to the non-preemptive and fully-preemptive versions.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedHIL: Heterogeneity Resilient Federated Learning for Robust Indoor Localization with Mobile Devices 基于移动设备的鲁棒室内定位异构弹性联邦学习

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607919

Danish Gufran, Sudeep Pasricha

Indoor localization plays a vital role in applications such as emergency response, warehouse management, and augmented reality experiences. By deploying machine learning (ML) based indoor localization frameworks on their mobile devices, users can localize themselves in a variety of indoor and subterranean environments. However, achieving accurate indoor localization can be challenging due to heterogeneity in the hardware and software stacks of mobile devices, which can result in inconsistent and inaccurate location estimates. Traditional ML models also heavily rely on initial training data, making them vulnerable to degradation in performance with dynamic changes across indoor environments. To address the challenges due to device heterogeneity and lack of adaptivity, we propose a novel embedded ML framework called FedHIL . Our framework combines indoor localization and federated learning (FL) to improve indoor localization accuracy in device-heterogeneous environments while also preserving user data privacy. FedHIL integrates a domain-specific selective weight adjustment approach to preserve the ML model's performance for indoor localization during FL, even in the presence of extremely noisy data. Experimental evaluations in diverse real-world indoor environments and with heterogeneous mobile devices show that FedHIL outperforms state-of-the-art FL and non-FL indoor localization frameworks. FedHIL is able to achieve 1.62 × better localization accuracy on average than the best performing FL-based indoor localization framework from prior work.

室内定位在应急响应、仓库管理和增强现实体验等应用中发挥着至关重要的作用。通过在移动设备上部署基于机器学习(ML)的室内定位框架，用户可以在各种室内和地下环境中进行自我定位。然而，由于移动设备硬件和软件堆栈的异质性，实现准确的室内定位可能具有挑战性，这可能导致位置估计不一致和不准确。传统的机器学习模型也严重依赖于初始训练数据，这使得它们在室内环境的动态变化中容易受到性能下降的影响。为了解决设备异构和缺乏适应性带来的挑战，我们提出了一种新的嵌入式ML框架，称为FedHIL。我们的框架结合了室内定位和联邦学习(FL)，以提高设备异构环境中的室内定位精度，同时保护用户数据隐私。FedHIL集成了一种特定领域的选择性权重调整方法，即使在极度嘈杂的数据存在的情况下，也能保持ML模型在FL期间室内定位的性能。在不同的真实室内环境和不同的移动设备中进行的实验评估表明，FedHIL优于最先进的FL和非FL室内定位框架。FedHIL的定位精度平均比先前工作中表现最好的基于fl的室内定位框架高1.62倍。

{"title":"FedHIL: Heterogeneity Resilient Federated Learning for Robust Indoor Localization with Mobile Devices","authors":"Danish Gufran, Sudeep Pasricha","doi":"10.1145/3607919","DOIUrl":"https://doi.org/10.1145/3607919","url":null,"abstract":"Indoor localization plays a vital role in applications such as emergency response, warehouse management, and augmented reality experiences. By deploying machine learning (ML) based indoor localization frameworks on their mobile devices, users can localize themselves in a variety of indoor and subterranean environments. However, achieving accurate indoor localization can be challenging due to heterogeneity in the hardware and software stacks of mobile devices, which can result in inconsistent and inaccurate location estimates. Traditional ML models also heavily rely on initial training data, making them vulnerable to degradation in performance with dynamic changes across indoor environments. To address the challenges due to device heterogeneity and lack of adaptivity, we propose a novel embedded ML framework called FedHIL . Our framework combines indoor localization and federated learning (FL) to improve indoor localization accuracy in device-heterogeneous environments while also preserving user data privacy. FedHIL integrates a domain-specific selective weight adjustment approach to preserve the ML model's performance for indoor localization during FL, even in the presence of extremely noisy data. Experimental evaluations in diverse real-world indoor environments and with heterogeneous mobile devices show that FedHIL outperforms state-of-the-art FL and non-FL indoor localization frameworks. FedHIL is able to achieve 1.62 × better localization accuracy on average than the best performing FL-based indoor localization framework from prior work.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

B-AWARE: Blockage Aware RSU Scheduling for 5G Enabled Autonomous Vehicles B-AWARE: 5G自动驾驶车辆的阻塞感知RSU调度

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609133

Matthew Szeto, Edward Andert, Aviral Shrivastava, Martin Reisslein, Chung-Wei Lin, Christ Richmond

5G Millimeter Wave (mmWave) technology holds great promise for Connected Autonomous Vehicles (CAVs) due to its ability to achieve data rates in the Gbps range. However, mmWave suffers from a high beamforming overhead and requirement of line of sight (LOS) to maintain a strong connection. For Vehicle-to-Infrastructure (V2I) scenarios, where CAVs connect to roadside units (RSUs), these drawbacks become apparent. Because vehicles are dynamic, there is a large potential for link blockages. These blockages are detrimental to the connected applications running on the vehicle, such as cooperative perception and remote driver takeover. Existing RSU selection schemes base their decisions on signal strength and vehicle trajectory alone, which is not enough to prevent the blockage of links. Many modern CAVs motion planning algorithms routinely use other vehicle’s near-future path plans, either by explicit communication among vehicles, or by prediction. In this paper, we make use of the knowledge of other vehicle’s near future path plans to further improve the RSU association mechanism for CAVs. We solve the RSU association algorithm by converting it to a shortest path problem with the objective to maximize the total communication bandwidth. We evaluate our approach, titled B-AWARE, in simulation using Simulation of Urban Mobility (SUMO) and Digital twin for self-dRiving Intelligent VEhicles (DRIVE) on 12 highway and city street scenarios with varying traffic density and RSU placements. Simulations show B-AWARE results in a 1.05× improvement of the potential datarate in the average case and 1.28× in the best case vs. the state-of-the-art. But more impressively, B-AWARE reduces the time spent with no connection by 42% in the average case and 60% in the best case as compared to the state-of-the-art methods. This is a result of B-AWARE reducing nearly 100% of blockage occurrences.

5G毫米波(mmWave)技术由于能够实现Gbps范围内的数据速率，因此对联网自动驾驶汽车(cav)具有很大的前景。然而，毫米波的波束形成开销高，并且需要视距(LOS)来保持强连接。在车辆到基础设施(V2I)的场景中，自动驾驶汽车连接到路边单元(rsu)，这些缺点变得明显。由于车辆是动态的，因此存在很大的链接阻塞可能性。这些阻塞对车辆上运行的连接应用程序(如协作感知和远程驾驶员接管)是有害的。现有的RSU选择方案仅基于信号强度和车辆轨迹进行决策，不足以防止链路阻塞。许多现代自动驾驶汽车的运动规划算法通常使用其他车辆的近期路径规划，要么通过车辆之间的明确通信，要么通过预测。在本文中，我们利用其他车辆近期路径计划的知识，进一步改进了自动驾驶汽车的RSU关联机制。我们将RSU关联算法转化为以最大化总通信带宽为目标的最短路径问题来解决RSU关联算法。我们使用城市交通模拟(SUMO)和自动驾驶智能车辆数字孪生(DRIVE)在12个高速公路和城市街道场景中评估了我们名为B-AWARE的方法，这些场景具有不同的交通密度和RSU位置。模拟显示，B-AWARE在平均情况下可将潜在数据量提高1.05倍，在最佳情况下可将潜在数据量提高1.28倍。但更令人印象深刻的是，与最先进的方法相比，B-AWARE在无连接情况下的平均时间减少了42%，在最佳情况下减少了60%。这是B-AWARE减少近100%堵塞的结果。

{"title":"B-AWARE: Blockage Aware RSU Scheduling for 5G Enabled Autonomous Vehicles","authors":"Matthew Szeto, Edward Andert, Aviral Shrivastava, Martin Reisslein, Chung-Wei Lin, Christ Richmond","doi":"10.1145/3609133","DOIUrl":"https://doi.org/10.1145/3609133","url":null,"abstract":"5G Millimeter Wave (mmWave) technology holds great promise for Connected Autonomous Vehicles (CAVs) due to its ability to achieve data rates in the Gbps range. However, mmWave suffers from a high beamforming overhead and requirement of line of sight (LOS) to maintain a strong connection. For Vehicle-to-Infrastructure (V2I) scenarios, where CAVs connect to roadside units (RSUs), these drawbacks become apparent. Because vehicles are dynamic, there is a large potential for link blockages. These blockages are detrimental to the connected applications running on the vehicle, such as cooperative perception and remote driver takeover. Existing RSU selection schemes base their decisions on signal strength and vehicle trajectory alone, which is not enough to prevent the blockage of links. Many modern CAVs motion planning algorithms routinely use other vehicle’s near-future path plans, either by explicit communication among vehicles, or by prediction. In this paper, we make use of the knowledge of other vehicle’s near future path plans to further improve the RSU association mechanism for CAVs. We solve the RSU association algorithm by converting it to a shortest path problem with the objective to maximize the total communication bandwidth. We evaluate our approach, titled B-AWARE, in simulation using Simulation of Urban Mobility (SUMO) and Digital twin for self-dRiving Intelligent VEhicles (DRIVE) on 12 highway and city street scenarios with varying traffic density and RSU placements. Simulations show B-AWARE results in a 1.05× improvement of the potential datarate in the average case and 1.28× in the best case vs. the state-of-the-art. But more impressively, B-AWARE reduces the time spent with no connection by 42% in the average case and 60% in the best case as compared to the state-of-the-art methods. This is a result of B-AWARE reducing nearly 100% of blockage occurrences.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0