Hardware-Accelerated DL with VTA
Optimizing and deploying Deep Learning models onto FPGA-based hardware accelerators using the Versatile Tensor Accelerator (VTA) framework for real-time applications.
- FPGA
- Deep Learning
- VTA
- Hardware
Selected work
A mix of academic, professional and side projects across embedded intelligence, TinyML, semiconductor reliability and the web.
Optimizing and deploying Deep Learning models onto FPGA-based hardware accelerators using the Versatile Tensor Accelerator (VTA) framework for real-time applications.
TinyML study deploying CNN, LSTM, Hybrid CNN-LSTM and DeepConvLSTM models on the Bangle.js 2 smartwatch (256 KB RAM, 1 MB Flash) for real-time activity classification across the HANG Time-HAR, PAMAP2 and WEAR datasets.
Smartwatches are everywhere, but running deep learning HAR models on them is hard: 256 KB of RAM and 1 MB of Flash on the Bangle.js 2 rule out standard DL stacks. TinyML closes this gap by keeping inference local, private and low-latency.
Classified 19 basketball activities from wrist-worn 3-axis accelerometer data (50 Hz, ±8 g) recorded by 24 players. Trained CNN (4,124 params), LSTM, Hybrid CNN-LSTM and DeepConvLSTM (41,324 params) with a 60/20/20 split and MinMax-scaled (20×3) windows. DeepConvLSTM was the strongest at 79.33% test accuracy; the LSTM collapsed at ~9.6%, confirming convolutional feature extraction is essential on this dataset.
Re-evaluated the same family of architectures on the PAMAP2 daily-activity benchmark to test generalization across sensor placements and activity types.
Adapted a Bangle.js gesture-recognition pipeline to the WEAR fitness dataset and compared a Baseline CNN, SeparableConv1D CNN, Multi-branch CNN and Multi-scale CNN in both FP32 and INT8. SeparableConv1D delivered the best accuracy-per-byte trade-off and was selected as the deployment candidate.
Trained in Keras, converted via TensorFlow Lite for Microcontrollers, then post-training quantized to int8. Models were validated in the Bangle.js emulator (TFLite → JSON) since the physical device was unavailable; processing speeds stayed in the 7–27 ms range per inference.
Semiconductor Electronics Design literature review on adaptive Error Correction Codes (ECC) for sub-28 nm memories, covering soft errors, multi-cell upsets and automated management of ECC features across chip revisions.
As critical charge (Q_crit) shrinks at 16 nm, 7 nm and below, even low-energy neutrons and alpha particles cause bit-flips. Manufacturing variation, NBTI and HCI aging further erode reliability, making ECC a first-class architectural concern rather than an afterthought.
Stefani et al. (SBCCI 2023) propose a Dynamic Fault-Tolerant Memory Controller that switches between Parity, Hamming and LPC per memory block based on an evaluator + threshold engine. In 28 nm CMOS at 1 V it reaches 99.956% correction efficacy while using ~41.35% less energy than a static LPC controller in low-error scenarios — at the cost of growing the controller footprint from 1,446 µm² (no ECC) to 8,205 µm².
Michelon et al. (SANER 2023) treat ECC variants as features in a Software Product Line gated by #ifdefs. Their tool mines feature commits, computes deltas between source and destination releases and propagates ECC implementations automatically — ~99% precision/recall and ~63 s per propagation, avoiding error-prone manual integration of reliability logic.
The review argues for hardware-software synergy at sub-28 nm: ML-driven error prediction inside the memory controller, version-control-integrated feature propagation, and energy-aware codes like LPC replacing heavyweight Chipkill in high-density 3D memories.