Software-Implemented Hardware Fault Tolerance (SIHFT) is a modern approach for tackling random hardware faults of dependable systems employing solely software solutions. This work extends an automatic compiler-based SIHFT hardening tool called ASPIS, enhancing it with novel protection mechanisms and overhead-reduction techniques, also providing an extensive analysis of its compliance with the non-trivial workload of the open-source Real-Time Operating System FreeRTOS. A thorough experimental fault-injection campaign on an STM32 board shows how the system achieves remarkably high tolerance to single-event upsets and a comparison between the SIHFT mechanisms implemented summarises the trade-off between the overhead introduced and the detection capabilities of the various solutions.
Graph processing is widely used for many modern applications, such as social networks, recommendation systems, and knowledge graphs. However, processing large-scale graphs on traditional Von Neumann architectures is challenging due to the irregular graph data and memory-bound graph algorithms. Processing-in-memory (PIM) architecture has emerged as a promising approach for accelerating graph processing by enabling computation to be performed directly on memory. Despite having many processing units and high local memory bandwidth, PIM often suffers from insufficient global communication bandwidth and high synchronization overhead due to load imbalance.
This paper proposes GraphB, a novel PIM-based graph processing system, to address all these issues. From the algorithm perspective, we propose a degree-aware graph partitioning algorithm that can generate balanced partitioning at a low cost. From the architecture perspective, we introduce tile buffers incorporated with an on-chip 2D-Mesh, which provides high bandwidth for inter-node data transfer. Dataflow in GraphB is designed to enable computation-communication overlap and dynamic load balancing. In a PyMTL3-based cycle-accurate simulator with five real-world graphs and three common algorithms, GraphB achieves an average 2.2 × and maximum 2.8 × speedup compared to the SOTA PIM-based graph processing system GraphQ.
Transistor aging leads to the deterioration of analog circuit performance over time. The worst aging degradation is used to evaluate the circuit reliability. It is extremely expensive to obtain it since several circuit stimuli need to be simulated. The worst degradation collection cost reduction brings an inaccurate training dataset when a machine learning (ML) model is used to fast perform the estimation. Motivated by the fact that there are many similar subcircuits in large-scale analog circuits, in this paper, we propose Wages to train an ML model on an inaccurate dataset for the worst aging degradation estimation via domain generalization technique. A sampling-based method on the feature space of the transistor and its neighborhood subcircuit is developed to replace inaccurate labels. A consistent estimation for the worst degradation is enforced to update model parameters. Label updating and model updating are performed alternately to train an ML model on the inaccurate dataset. Experimental results on the very advanced 5nm technology node show our Wages can significantly reduce the label collection cost with a negligible estimation error for the worst aging degradations compared to the traditional methods.
Continuous-flow microfluidic biochips are gaining increasing attention with promising applications for automatically executing various laboratory procedures in biology and biochemistry. Biochips with distributed channel-storage architectures enable each channel to switch between the roles of transportation and storage. Consequently, fluid transportation, caching, and fetch can occur concurrently through different flow paths. When two dissimilar types of fluidic flows occur through the same channels in a time-interleaved manner, it may cause contamination to the latter as some residues of the former flow may be stuck at the channel wall during transportation. To remove the residues, wash operations are introduced as an essential step to avoid incorrect assay outcomes. However, existing work has been considered that the washing capacity of a buffer fluid is unlimited. In the actual scenario, a fixed-volume buffer fluid irrefutably possesses a limited washing capacity, which can be successively consumed while washing away residues from the channels. Hence, capacity-aware wash scheme is a basic requirement to fulfil the dynamic fluid scheduling and channel storage. In this paper, we formulate a practical wash optimization problem for microfluidic biochips, which considers the requirements of dynamic fluid scheduling, channel storage, as well as washing capacity constraints of buffer fluids simultaneously, and present an efficient design flow to solve this problem systematically. Given the high-level synthesis result of a biochemical application and the corresponding component placement solution, our goal is to complete a contamination-aware flow-path planning with short flow-channel length. Meanwhile, the biochemical application can be executed efficiently and correctly with an optimized capacity-aware wash scheme. Experimental results show that compared to a state-of-the-art washing method, the proposed method achieves an average reduction of 26.1%, 43.1%, and 34.1% across all the benchmarks with respect to the total channel length, total wash time, and execution time of bioassays, respectively.
Large volumes of on-chip and off-chip memory are required by contemporary applications. Emerging non-volatile memory technologies including STT-RAM, PCM, and ReRAM are becoming popular for on-chip and off-chip memories as a result of their desirable properties. Compared to traditional memory technologies like SRAM and DRAM, they have minimal leakage current and high packing density. Non Volatile Memories (NVM), however, have a low write endurance, a high write latency, and high write energy. Non-volatile Single Level Cell (SLC) memories can store a single bit of data in each memory cell, whereas Multi Level Cells (MLC) can store two or more bits in each memory cell. Although MLC NVMs have substantially higher packing density than SLCs, their lifetime and access speed are key concerns. For a given cache size, MLC caches consume 1.84x less space and 2.62x less leakage power than SLC caches. We propose Trace buffer Assisted Non-volatile Memory Cache (TANC), an approach that increases the lifespan and performance of MLC-based last-level caches using the underutilised Embedded Trace Buffers (ETB). TANC improves the lifetime of MLC LLCs up to 4.36x, and decreases average memory access time by 4% compared to SLC NVM LLCs and by 6.41x and 11%, respectively, compared to baseline MLC LLCs.
Considering 3D NAND flash has a new property of process variation (PV), which causes different raw bit error rates (RBER) among different layers of the flash block. This paper builds a mathematical model for estimating the retention errors of flash cells, by considering the factor of layer-to-layer PV in 3D NAND flash memory, as well as the factors of program/erase (P/E) cycle and retention time of data. Then, it proposes classifying the layers of flash block in 3D NAND flash memory into profitable and unprofitable categories, according to the error correction overhead. After understanding the retention error variation of different layers in 3D NAND flash, we design a mechanism of data placement, which maps the write data onto a suitable layer of flash block, according to the data hotness and the error correction overhead of layers, to boost read performance of 3D NAND flash. The experimental results demonstrate that our proposed retention error estimation model can yield a R2 value of
To speed up the design closure and improve the QoR of FPGA, supervised single-task machine learning techniques have been used to predict individual design metric based on placement results. However, the design objective is to achieve optimal performance while considering multiple conflicting metrics. The single-task approaches predict each metric in isolation and neglect the potential correlations or dependencies among them. To address the limitations, this paper proposes a multi-task learning approach to jointly predict wirelength, congestion and power. By sharing the common feature representations and adopting the joint optimization strategy, the novel WCPNet models (including WCPNet-HS and WCPNet-SS) can not only predict the three metrics of different scales simultaneously, but also outperform the majority of single-task models in terms of both prediction performance and time cost, which are demonstrated by the results of the cross design experiment. By adopting the cross-stitch structure in the encoder, WCPNet-SS outperforms WCPNet-HS in prediction performance, but WCPNet-HS is faster because of the simpler parameters sharing structure. The significance of the feature imagepinUtilization on predicting power and wirelength are demonstrated by the ablation experiment.
HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as CPUs, GPUs, and NPUs. HMPSoCs can potentially overcome the limitation of low single-processor CNN inference performance and efficiency by cooperative use of multiple processors. However, standard inference frameworks for edge devices typically utilize only a single processor.
We present the ARM-CO-UP framework built on the ARM-CL library. The ARM-CO-UP framework supports two modes of operation – Pipeline and Switch. It optimizes inference throughput using pipelined execution of network partitions for consecutive input frames in the Pipeline mode. It improves inference latency through layer-switched inference for a single input frame in the Switch mode. Furthermore, it supports layer-wise CPU/GPU DVFS in both modes for improving power efficiency and energy consumption. ARM-CO-UP is a comprehensive framework for multi-processor CNN inference that automates CNN partitioning and mapping, pipeline synchronization, processor type switching, layer-wise DVFS, and closed-source NPU integration.