Architecture

Simpified Architecture

We had many iterations of architecture before we were satisfied with all blocks. The final tape-outed version is V7.1. The top level block diagram is shown in Figure 1 and the expanded block diagram of NPU Core is shown in Figure 2. For the purpose of this website tab, these two simplified diagrams are enough for us to introduce each functional block. We also included the detailed verison of our architecture diagrams at the end of this tab, in case someone wants to dig deeper. For detailed description of each block and timing mechanism, please refer to the user manual that has been attached at the end of this tab.

Top level on-chipNPUCoreRegistersFSM

Figure 1. Top level block diagram


NPU CoreReLU*+MACReLU*+MACPISOCOMPFIFODebug

Figure 2. NPU Core block diagram


There are three modules in the top level diagram in Figure 1, FSM, NPU Core and Registers. NPU Core is the core computing core that we designed for NN inference, its expanded version is shown in Figure 2. FSM stands for finite state machine, which is a controlling system that we designed to provide correct controlling signals to NPU Core based on the timing requirement. Registers, as the name entails, are a bunch registers that store configuration options of NPU Core. We call these registers SSFR (Stationary Special Function Registers), since they would remain stationary during one inference frame.

If we jump into NPU Core, there are many blocks working collaboratively to complete the entire inference. These blocks are introduced as follows:

MAC: MAC stands for Multiply and Accumulate and it is the core computation block for matrix multiplication. It is consisting of a multiplier, an adder, and a flop. For each block cycle it will take two new inputs, multiply them together, then add to the previously accumulated results.

ReLU: ReLU stands for Rectified Linear Unit. It is the hardware realization of the standard ReLU function, which finds the maximum value between the block input and zero. If the input is greater than zero, it will pass the input, whereras if the input is smaller than zero, it will output a zero.

PISO: PISO stands for Parallel-Input-Serial-Output. It will compress several parallel inputs and shift them out one by one. We used this block to reduce the number of output ports.

FIFO: FIFO stands for First-in-First-Out. It acts as a temporary results pool for NPU core so that user can choose to read the results later. This is a standard synchronous FIFO design, indicating there is only one clock for both reading and writing.

COMP: COMP stands for comparator. It takes two signed inputs and compares the larger one with the previous largest number encountered. If the larger one is greater than the previous largest, it will store the new largest value and record its index. We used this block to further accelerate the inference process of final NN layer.

MUX: MUX stands for multiplexer. It allows user to map the results of different blocks to output.

Debug: Debug stands for debugging engine. It has many probes on different blocks in the system and it can use these probes to take samples of signals and shift samples out one by one. We designed this engine for post-silicon debugging so that we can read chip internal signals.


Complete Architecture

There is a complex timing mechanism behind this architecture and everything has been documented in the user manual attached at the end of this tab. The followings are the complete top level block diagram and NPU Core diagram. These are good references when going through the timing diagram in Figure 5.

16-bit downcounterFSM_ACCFSM_OUTDD[7:0][15:8][7:0]RST_CTRNPU CoreRST_MACCTR_OUTEN_ReLUOUT_DONEDB[7:0]SEL_CON10External controlWR_EN (for FIFO)FSMDA[7:0]DB[7:0]DC[7:0]DD[7:0]D_OUT[7:0]On-ChipEN_PISO_DEBStationary Special Function RegistersSSFR[15:0]DA[7:0]DB[7:0]EN_CONFIGFULLEMPTYRD_ENEN_FSMRST_GLOCLKEXTCLR_PISO_DEBSHIFT_DEBDC[7:0]FSM control

Figure 3. Complete top level block diagram


Created by CircuitLab https://www.circuitlab.com/ 16 Data_IN[15] & (!BYPASS_ReLU) 16 16 16 16 16 16 16 17 16 ReLU_OUT Data_IN[15:0] EN_ReLU _ ReLU Module 17 bit adder A[16:0] B[16:0] OUT[16:0] 17 {a[15],a[15:0]} {b[15],b[15:0]} "7FFF" "8000" OUT[15:0] OUT[15:0] OUT[16] OUT[15] 17 16 16 a[15:0] b[15:0] ADD_OUT[15:0] 16-bit 2's Complement Adder with Saturation Control (16-bit 2CASC) Parallel-in, Serial-out Shift Regiter (PISO) Input Buffer 8 EN_BUF_IN DA[7:0] 8 8 8 DD[7:0] DC[7:0] DB[7:0] CLKEXT CLR_BUF_IN 1 1 _ _ _ 8 bit multiplier A[7:0] B[7:0] OUT[15:0] 16-bit 2CASC A[15:0] B[15:0] OUT[15:0] MAC module EN_ReLU 8 8 16 16 16 16 16 8 8 EN_MAC _ RST_MAC BIAS_IN1 8 bit multiplier A[7:0] B[7:0] OUT[15:0] 16-bit 2CASC A[15:0] B[15:0] OUT[15:0] MAC module 16 16 16 8 8 _ BIAS_IN2 ReLU IN[15:0] OUT[15:0] EN_ReLU BYPASS_ReLU RST_MAC ReLU1_OUT[7:0] ReLU1_OUT[15:8] ReLU2_OUT[7:0] ReLU2_OUT[15:8] 8 8 8 8 16 16 CLKEXT SHIFT_OUT EN_PISO_OUT _ _ _ _ CLR_PISO_OUT PISO_OUT 16 16 8 8 8 8 8 8 8 8 CLKEXT EN_PISO_DEB SHIFT_DEB _ _ _ CLR_PISO_DEB _ PISO_DEB SEL_OUT[2:0] D_OUT[7:0] 8 8 NPU Core Architecture V7.1 *Notes: red number means bus width [7:0] [15:8] [7:0] [15:8] Auto Comparator in1 [15:0] in2[15:0] trig enable index[7:0] largest[15:0] reset EN_COMP EN_MAC EN_ReLU EN_ReLU ReLU IN[15:0] OUT[15:0] EN_ReLU BYPASS_ReLU BYPASS_ReLU1 *Notes: green signal means it comes from SSFR BYPASS_ReLU2 BYPASS_ReLU _ RST_COMP 16 16 Output Synchronous FIFO Depth: 128 wr_en rd_en enable rst data_in[7:0] empty full data_out[7:0] RST_FIFO (!FULL) & WR_EN (!EMPTY) & RD_EN EN_FIFO MUX3 EMPTY FULL *Notes for FIFO: WR_EN comes from FSM RD_EN comes from user input (external source) Write operation disabled if full Read operation disabled if empty 8 8 8 8 8 8 largest[15:8] largest[7:0] SSFR[15:8] 8 8 8 CON_SIG[15:8] 8 SSFR[7:0] CON_SIG[7:0] Stationary Special Function Registers (SSFR[15:0]) ------------------------------------------------------------------------- SSFR[15]: SEL_OUT[2] SSFR[14]: SEL_OUT[1] SSFR[13]: SEL_OUT[0] SSFR[12]: BYPASS_ReLU1 SSFR[11]: BYPASS_ReLU2 SSFR[10]: EN_COMP SSFR[9]: RST_COMP SSFR[8]: EN_FIFO SSFR[7]: RST_FIFO SSFR[6:0]:unused, default as 0 Control Signals (CON_SIG[15:0]) ------------------------------------------------------------------------- CON_SIG[15]: EN_BUF_IN CON_SIG[14]: CLR_BUF_IN CON_SIG[13]: EN_MAC CON_SIG[12]: RST_MAC CON_SIG[11]: EN_ReLU CON_SIG[10]: SHIFT_OUT CON_SIG[9]: EN_PISO_OUT CON_SIG[8]: CLR_PISO_OUT CON_SIG[7]: WR_EN CON_SIG[6]: CTR_OUT (FSM debugging) CON_SIG[5]: OUT_DONE (FSM debugging) CON_SIG[4:0]: unused, connected to ground

Figure 4. Complete NPU Core block diagram


timing_photo

Figure 5. Timing diagram for different NPU operation modes


Click to download

NPU V7.1 Datasheet + User Manual

Back to top