This paper introduces AtomicVAD, an ultra-lightweight, end-to-end voice activity detection (VAD) model designed for inference on resource-constrained microcontrollers at the extreme edge. Existing VAD models often rely on large architectures with thousands of trainable parameters, making them impractical for deployment on low-power microcontrollers commonly used in internet of things systems. Even with compression methods such as quantization or pruning, these models typically fail to achieve low-latency performance under strict power and memory limits. AtomicVAD overcomes these limitations through the introduction of the General Growing Cosine Unit, a trainable oscillatory activation function that embeds feature learning within periodic modulations. This design enables remarkable efficiency with approximately 0.3k trainable parameters, representing a 99.7 % reduction compared to commonly used baselines like MarbleNet, while maintaining competitive accuracy. Evaluated on the challenging AVA-Speech benchmark, AtomicVAD achieves an AUROC of 0.903 and an F2-score of 0.891, outperforming larger state-of-the-art systems and demonstrating robustness to background noise and music. Optimized for extreme efficiency, AtomicVAD enables ultra-low latency inference —as low as 26 ms on a 240 MHz Cortex-M7 and 1.22 s on a 64 MHz Cortex-M4F— facilitated by INT8 quantization. Its memory footprint remains below 75 kB Flash and 65 kB SRAM. A real-world LoRaWAN field trial further validated its practicality, showing that on-device speech gating eliminates unnecessary, bandwidth-intensive audio uploads, reducing over-the-air delays from minutes to milliseconds. Key use cases include remote monitoring, smart-home control, disaster-response sensor networks, and other long-range, low-power systems requiring efficient, always-on audio processing.
扫码关注我们
求助内容:
应助结果提醒方式:
