RISC-V customized instructions for AI-related operations
Reference
Degree Grantor
Abstract
Deep-learning workloads, particularly convolutional neural networks (CNNs), rely on computationally expensive operations such as matrix multiplication and convolutions. Traditional general-purpose processors struggle to efficiently execute these workloads due to high instruction counts, memory access overhead, and suboptimal vector utilization. While dedicated AI accelerators (e.g., GPUs, TPUs, NPUs) offer superior performance, they lack flexibility and consume significant power, making them impractical for embedded AI inference and real-time applications. This thesis explores the design and evaluation of custom RISC-V instructions for AI acceleration, focusing on fused compute operations that optimize key AI tasks. By introducing custom fused instructions—such as load + multiply-accumulate (MAC), fused load-multiply, and output-stationary direct convolution operations—this work reduces instruction count and execution cycles in AI workloads while maintaining the flexibility of general-purpose computing. To evaluate the effectiveness of these custom instructions, the research employs cycle-accurate simulation using gem5 and benchmarks scalar, vectorized (RVV), and custom-instruction implementations across different CNN layers. The results show that: Custom fused instructions reduce instruction count by 10–30%, improving execution efficiency. Direct convolution custom instructions outperform naïve vectorized implementations, achieving 2×–5× speedups and Winograd convolution, which was already 1.2×–1.5× faster than im2col, saw an additional 10–20% improvement using fused operations. These findings demonstrate that custom instruction-level optimizations can bridge the performance gap between general-purpose RISC-V processors and dedicated AI accelerators, offering a scalable and power-efficient alternative for AI workloads. The extensibility of RISC-V allows for continuous AI-specific optimizations, making it an attractive platform for future low-power AI inference hardware. This research reinforces the potential of customizable, open-source hardware solutions in advancing efficient, domain-specific AI acceleration.