DESIGN OVERVIEW

Background

Convolutional Neural Networks (CNNs) are a specialized type of artificial neural network, evolved from the concept of perceptrons. Perceptrons, introduced in the 1950s, are the simplest form of neural networks capable of binary classification. Over the decades, researchers expanded on this foundation, leading to the development of CNNs, which excel in handling spatial data like images. By introducing convolutional layers, pooling, and hierarchical feature extraction, CNNs have become a cornerstone of modern deep learning, particularly in image recognition and classification tasks.

Despite their effectiveness, CNNs often demand significant computational resources, which can limit their applicability in real-time systems or low-power environments. To address these limitations, CNN accelerators are designed to enhance the efficiency of inference, leveraging hardware-specific optimizations to deliver faster computation and reduced power consumption. These accelerators are especially valuable in scenarios requiring quick and reliable image classification.

Figure 25. MNIST Dataset

For this project, the MNIST dataset was chosen as the foundation for developing and testing the CNN accelerator. MNIST, consisting of 60,000 training images and 10,000 test images of handwritten digits (0-9), is a benchmark dataset widely used in the machine learning community. Its simplicity and structured format make it an ideal starting point for both algorithm development and hardware validation, providing a clear path to evaluate performance and accuracy.

By focusing on efficient inference of pre-trained CNN models using the MNIST dataset, this project showcases how domain-specific hardware can significantly improve the performance of AI applications while bridging the gap between algorithmic innovation and practical implementation.

Design Process

We started from scratch by first training and testing a Convolutional Neural Network (CNN) using PyTorch on the well-known MNIST dataset. This dataset, consisting of handwritten digit images, provided an ideal starting point for developing and testing our model. The training phase was the only stage in the process where the model parameters were updated. All subsequent steps utilized these pre-trained parameters directly, with no further adjustments or re-training.

Following the training, we rewrote the CNN model in Matlab, faithfully replicating its structure and integrating the parameters obtained from the PyTorch training phase. At this stage, we introduced 8-bit quantization to optimize the model for hardware deployment. This process significantly reduced the computational and memory requirements while maintaining acceptable accuracy. The quantized model was rigorously tested to ensure that its performance met the established benchmarks.

Once the quantized model's accuracy was validated, we proceeded to design the architecture of the hardware accelerator chip. The architecture incorporated pipeline stages to maximize throughput and optimize resource utilization.

After finalizing the design, the CNN structure was implemented in Verilog. We loaded the pre-trained parameters into the design and developed a comprehensive testbench to validate the functionality of the hardware implementation through detailed simulation.

The subsequent stages involved synthesis and place-and-route (PNR), during which we focused on achieving timing closure by addressing constraints such as setup and hold violations.

The final phase consisted of layout-level optimization and testing to ensure that the design was robust, efficient, and met all necessary requirements for fabrication.

The first semester concluded with the tape-out of the chip.

After the tape-out, we designed a PCB to test the fabricated chip. The PCB enabled comprehensive testing and performance validation of the chip in real-world scenarios.

Additionally, we conducted a performance comparison between the chip and other platforms, such as the Raspberry Pi and FPGA. These comparisons highlighted the efficiency and application-specific advantages of the custom-designed hardware accelerator.

This workflow not only demonstrated a comprehensive end-to-end approach to CNN hardware acceleration but also showcased the practical integration of software machine learning models with high-performance hardware design. The project serves as a valuable learning experience and a tangible step forward in the field of domain-specific hardware for AI applications.

EE6350 VLSI Design Lab
Spring 2024

NPU: Neural Processing Unit

DESIGN OVERVIEW

Background

Design Process

EE6350 VLSI Design Lab Spring 2024

NPU: Neural Processing Unit

DESIGN OVERVIEW

Background

Design Process

EE6350 VLSI Design Lab
Spring 2024