ARCHITECTURE
Before we were satisfied with all the blocks, we went through numerous architectural iterations. The final tape-out version's top-level block diagram is shown in Figure 1, and the expanded block diagram of the NPU Core is shown in Figure 2.For the purpose of this website tab, these two simplified diagrams are enough for us to introduce each functional block. We also included the detailed verison of our architecture diagrams at the end of this tab, in case someone wants to dig deeper. For detailed description of each block, please refer to the user manual that has been attached at the end of this tab.
There are three modules in the top level diagram in Figure 1, FSM, NPU Core and DFT. NPU Core is the core computing core that we designed for CNN inference, its expanded version is shown in Figure 2. FSM stands for finite state machine, which is a controlling system that we designed to provide correct controlling signals to NPU Core based on the timing requirement. The flowchart of the FSM module is shown in Figure 3. DFT stands for design for testability, which are techniques and methodologies implemented during the design phase to enhance the efficiency and effectiveness of chip testing. Our chip uses scan chain for DFT.
If we jump into NPU Core, there are many blocks working collaboratively to complete the entire inference. These blocks are introduced as follows:
PE(Process Element)
The PE is responsible for performing convolution operations and primarily consists of three multipliers, one adder, and a multiplexer (MUX).
ReLU
ReLU stands for Rectified Linear Unit. It is the hardware realization of the standard ReLU function, which finds the maximum value between the block input and zero. If the input is greater than zero, it will pass the input, whereras if the input is smaller than zero, it will output a zero.
Quantization
The quantization module is responsible for quantizing the 16-bit output from the PE into 8-bit, thereby reducing the number of ports required for output data.
COMP
COMP stands for comparator. It takes two signed inputs and compares the larger one with the previous largest number encountered. If the larger one is greater than the previous largest, it will store the new largest value and record its index. We use this block to perform max pooling. Notably, we adopted four Processing Elements (PEs) because our neural network model has four channels. The same input needs to be convolved with pixels from different channels. We store the pixels of different channels in separate PEs to minimize data exchange (which is more time-consuming). Meanwhile, these four PEs operate in parallel, working in a pipelined mode, thereby maximizing efficiency.
For more detailed information about our project, Columbia University users can access the datasheet using their UNI credentials at the following link: Project Datasheet.

