Self-testing and self-healing neural network accelerator design
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
With the dramatic growth of computation capability in this decade, Deep Neural Networks (DNNs) have regained their popularity and become the go-to tool for solving many real-world recognition and classification problems, from autonomous vehicles to cloud servers and from advanced manufacturing to disease diagnostics. Yet smart and intelligent data interpretation via deep learning is also extremely power-hungry. State-of-the-art DNNs have millions of weights and require billions of multiply-accumulate (MAC) operations which, without a dedicated hardware accelerator, require too much energy to be performed on a battery-constrained edge device. ☐ One promising solution to the development of energy-efficient neural network accelerators is processing-in-memory (PIM) accelerators built with emerging non-CMOS devices such as Resistive RAM (ReRAM), Phase Change Memory (PCM), and Spin-Transfer Torque Magnetic Memory (STT-MRAM). These devices offer high density and extremely low power consumption, but also face the challenge of high error rates due to immature fabricating processes, imprecise programming, process variations, and aging. When these errors accumulate in the on-chip and off-chip memories of DNN accelerators where millions of DNN weights are stored, they will degrade the inference accuracy and training efficiency. ☐ This dissertation aims to tackle the aforementioned reliability challenges of DNN accelerators at the algorithm level. It proposes a general framework integrating fault detection, location, and recovery functions. ☐ Fault detection is achieved with an online self-test framework that monitors the healthiness of a DNN model with a small set of test images selected from the test dataset. Multiple test image selection strategies are examined and compared, including both statistical methods that rely on fault injection and methods that use different numerical scores. ☐ The dissertation also includes a self-healing framework that can be invoked once an accelerator is detected as faulty. It employs a frugal checksum-based system to locate faults in DNN kernels. Within each identified faulty kernel, the proposed recovery process mitigates the impact of faults by correcting extreme values and proportionally distributing checksum differences among the weights. ☐ In addition to enhancing the inference reliability of DNN accelerators via selftesting and self-healing, this dissertation also investigates the viability of model retraining in edge devices. Considering the relatively high write energy and limited endurance of emerging memory devices, the weight update process is re-designed to reduce bit flips in memory cells. To further prolong memory lifetime, this dissertation also includes two additional techniques, namely, filter exchange and bitwise rotation, so as to balance writes to different weights and to different bits of one weight. These techniques together bring significant power savings and endurance improvements in DNN accelerators built with emerging memory devices.
Description
Keywords
Fault tolerant, Neural network, Neural network accelerator, Non-volatile memory, Reliable system design