ARCHITECTURE

Before we were satisfied with all the blocks, we went through numerous architectural iterations. The final tape-out version's top-level block diagram is shown in Figure 1, and the expanded block diagram of the NPU Core is shown in Figure 2.For the purpose of this website tab, these two simplified diagrams are enough for us to introduce each functional block. We also included the detailed verison of our architecture diagrams at the end of this tab, in case someone wants to dig deeper. For detailed description of each block, please refer to the user manual that has been attached at the end of this tab.

Figure 1. Top-level block diagram

Figure 2. Block diagram of the NPU Core

There are three modules in the top level diagram in Figure 1, FSM, NPU Core and DFT. NPU Core is the core computing core that we designed for CNN inference, its expanded version is shown in Figure 2. FSM stands for finite state machine, which is a controlling system that we designed to provide correct controlling signals to NPU Core based on the timing requirement. The flowchart of the FSM module is shown in Figure 3. DFT stands for design for testability, which are techniques and methodologies implemented during the design phase to enhance the efficiency and effectiveness of chip testing. Our chip uses scan chain for DFT.

Figure 3. Flowchart of the FSM module

If we jump into NPU Core, there are many blocks working collaboratively to complete the entire inference. These blocks are introduced as follows:

PE(Process Element)

The PE is responsible for performing convolution operations and primarily consists of three multipliers, one adder, and a multiplexer (MUX).

ReLU

ReLU stands for Rectified Linear Unit. It is the hardware realization of the standard ReLU function, which finds the maximum value between the block input and zero. If the input is greater than zero, it will pass the input, whereras if the input is smaller than zero, it will output a zero.

Quantization

The quantization module is responsible for quantizing the 16-bit output from the PE into 8-bit, thereby reducing the number of ports required for output data.

COMP

COMP stands for comparator. It takes two signed inputs and compares the larger one with the previous largest number encountered. If the larger one is greater than the previous largest, it will store the new largest value and record its index. We use this block to perform max pooling. Notably, we adopted four Processing Elements (PEs) because our neural network model has four channels. The same input needs to be convolved with pixels from different channels. We store the pixels of different channels in separate PEs to minimize data exchange (which is more time-consuming). Meanwhile, these four PEs operate in parallel, working in a pipelined mode, thereby maximizing efficiency.

For more detailed information about our project, Columbia University users can access the datasheet using their UNI credentials at the following link: Project Datasheet.

EE6350 VLSI Design Lab
Spring 2024

NPU: Neural Processing Unit

ARCHITECTURE

PE(Process Element)

ReLU

Quantization

COMP

EE6350 VLSI Design Lab Spring 2024

NPU: Neural Processing Unit

ARCHITECTURE

PE(Process Element)

ReLU

Quantization

COMP

EE6350 VLSI Design Lab
Spring 2024