METHODOLOGY

Software Model

The software model serves as a foundational component of our NPU project, bridging the gap between high-level algorithm design and hardware implementation. It provides a functional simulation environment to validate the NPU's architecture and logic before physical deployment. Furthermore, the model serves as a reference for hardware verification, enabling seamless comparison between simulated outputs and actual hardware behavior.

Built using PyTorch, the model implements a simplified version of a classic three-layer convolutional neural network (CNN). The architecture consists of two convolutional layers followed by a fully connected layer.

To optimize performance without compromising prediction accuracy, we focused on reducing the number of model parameters and computational cycles. The iterative tests allowed us to refine the design parameters and verify the correctness of core algorithms. This streamlined design minimizes resource usage, making it more suitable for hardware implementation.

Additionally, the software model operates with 32-bit precision during development, while the final hardware implementation uses 8-bit precision, requiring careful consideration of quantization effects during the design process.

MATLAB Model

We used MATLAB's fixed-point number support to model the complete architecture of our chip, providing accuracy insights for the subsequent RTL design. Each layer was manually implemented using nested loops, with parameters (weights and biases) of the CNN architecture. The model emulates the chip's architecture by executing layers in sequence and applying the ReLU activation function. However, the MATLAB model lacks considerations for timing, pipelining, and bandwidth limitations, which are critical in hardware. Special care is advised when transitioning between bit-widths, as certain changes may misrepresent hardware behavior. The model achieved a prediction accuracy of 97%, which dropped to 94% after 8-bit quantization (down from 16-bit for all input and output signals).

Block Diagram & Datapath Structure

We created block diagram for the chip based on the Matlab Model. Since the limitations of die area and IOs, we created the architecture with 4 processing elements connected with each other, therefore we can do multiple computations and reduce the inputs at the same time. This is followed by quantization and relu module. Quantization is to reduce the outputs and relu is for activation and we used comparators for pooling. We totally have three layers in our design. The first 2 layers are convolution layers and the third is the fully connected layer.

RTL Coding

In our NPU core, we implemented all the functions in the block diagram using Verilog, with four modes in total. Timing plays a crucial role in this process, as the design must ensure that every block receives the correct signal at the appropriate cycle while maximizing the utilization of all clock windows to achieve the highest throughput. In manual mode, functions are implemented by manually controlling the timing. In automatic mode, an additional FSM (Finite State Machine) module is created to control the chip's timing automatically. In configuration mode, external signals are used to control the execution of operations for different layers. Finally, in test mode, the chip's operation can be halted, allowing us to inspect the data in each register.

TB Verification

Our Verilog testbench supports the behavioral model, synthesized netlist, placed-and-routed netlist, and extracted netlist. To ensure the reliability of our design, we adopted a statistical approach, utilizing a large dataset of 10,000 test vectors (e.g., images). The testbench automatically compares the results against our MATLAB-generated golden model. Since our system operates in multiple computing modes, we thoroughly tested all possible paths for each mode within the testbench. Detailed verification of each specific test case will be addressed in the verification chapter.

SYN and PNR

We synthesized the netlist using Genus to transform the behavioral model from the RTL stage, ensuring setup and hold times remained valid after back-annotation. This process utilized existing scripts from Prof. Kinget's lab, targeting a 100 MHz clock frequency with a single clock domain. During Clock Tree Synthesis, we set target setup slack to 1ns and hold slack to 2ns. Since timing closure was straightforward, no custom datapath structures were introduced, and the synthesizer's netlist solution was accepted without modifications. For the digital core, we used Innovus for automatic place-and-route based on the synthesized netlist, leveraging Prof. Kinget's scripts. The design was compiled into a single layout block, with placement density controlled to under 30% to allow for decoupling capacitors, resulting in a final chip core density of 16.77%. Decoupling capacitors included 52pF for VDD_core and 177pF for VDD_bf. The layout, unconstrained by area, is detailed further in Chapter Layout.

AMS

Cadence AMS (Analog Mixed-Signal) is a comprehensive design and verification solution for analog, digital, and mixed-signal designs. It integrates simulation, layout, and verification tools, enabling engineers to efficiently model, simulate, and verify complex circuits. We used Cadence AMS for simulation, performing mixed schematic/layout simulations. The synthesized netlist of the NPU core was connected to the buffer and pad frame's schematic and layout for verification. This process ensured that the entire circuit could still operate at the target frequency, even after accounting for layout-induced delays.

Custom Circuits

We designed several custom circuits for our system, including buffers and a Schmitt trigger. The buffers, implemented as typical inverter chains with varying numbers of stages, were used to buffer signals entering and exiting the chip. The Schmitt trigger was specifically designed for the clock signal, introducing voltage hysteresis to prevent multiple triggers in the internal clock tree due to glitches in the external clock source. These fully custom circuits were manually designed using Virtuoso for both schematic and layout, ensuring they met our requirements for driving strength, slew rate, voltage hysteresis, and robustness against PVT variations. Further details about these designs are provided in the Custom Circuits section.



Back to top