Demo 2: FPGA+NPU

In demo 2, we configured the FPGA to serve as the host of NPU. A detailed block diagram is shown in Figure 1. The motivation for designing a second demo is to verify that NPU is able to operate under a much higher clock frequency(the fastest we could get is 6.25 MHz), compared with the 2 kHz frequency scenario in Demo 1: RPI+NPU.

As mentioned in Verification, we reprogrammed the Verilog testbench(TEST 5) as synthesizable RTL code, which we called the scheduler. The scheduler keeps an internal counter, and we specify the NPU's input/control signals for each counter value.

We also hard programmed the weights and biases of the neural network(param reg) and four images(image reg) into the FPGA, and connected a monitor to display the current image under inference. A 7-segment display on the FPGA board prints the recognition result of the current image.


figure1

Figure 1. Demo 2 block diagram


The FPGA host is connected to the NPU through the male headers on the right of Figure 2. As mentioned in PCB Design, the PCB board for the NPU has matching female headers, allowing for a direct plug in.


figure2

Figure 2. FPGA board programmed switches and buttons


We programmed 3 buttons and 1 switch on the FPGA board for demo 2.

The first button is called clk_rst. As the FPGA board runs at 50 MHz and the NPU runs at a much lower frequency, a clock divider module is included in the FPGA program to help provide clock for the NPU, which depends on the clk_rst signal to initiate and synchronize.

The rst_button resets the internal counter of the scheduler. It restarts the inference process on the current image. This button comes in handy when we try to probe signals out of the system with a logic analyzer.

The img_button helps switch to and start the inference process on the next image. When the four images are exhausted, it returns to the first image in image reg.

The start_switch enables the inference process. If start_switch is 0 the internal counter of the scheduler resets to and stays at 0.

Typical sequence of demo 2 is to push clk_rst first, rst_button second, and then flip the start_switch. This is when we will see the seven-segment display showing the first recognition result. Then pressing img_button would help us verify the prediction of the following images.

The actual connection of the FPGA board and NPU PCB is shown in Figure 3. The entire demo view is shown in Figure 4. In Figure 4, the monitor is displaying the image under processing, and the 7-segment LED is showing the recognition result. In this case the image is a hand-written seven and the LED is showing the current result.


figure3

Figure 3. NPU PCB - FPGA board connection through direct plug in



figure4

Figure 4. Entire demo 2 setup


We also find it interesting(and to exonerate ourselves from cheating) to look at the logic analyzer results, with which we probed four signals. The results are shown in Figure 5.


figure5

Figure 5. Logic analyzer results


The four signals, from top to bottom, are the NPU clock, DA0(one of the input), DOUT0(one of the output), and RD_EN, which is a control signal. RD_EN is used to control the output FIFO, and in the scheduler we pull RD_EN high each time two neurons are ready(as we have two pipelines). Therefore, we can easily see that computing two neurons takes a longer time in the beginning, and becomes very fast towards the end. This is due to the structure of our neural network. To compute the 16 first hidden layer neurons, as we have 256 input neurons, it would require 256 MAC operations, and therefore around 256 cycles between each RD_EN high. For the second hidden layer, the inputs are the 16 first hidden layer neurons, and therefore around 16 cycles between each RD_EN high, and the same goes for the output layer.

Comments:

We spent some time debugging the scheduler Verilog code for synthesis.

As we managed the posedge and negedge blocks separately, we included some signals both in the posedge and negedge, resulting in a multiple driver issue. The multiple driver issue is considered a warning instead of an error in Genus, and our code involves a lot of other operations that could potentially go wrong, like multiple array packing and unpacking, and array muxes. Therefore it took us days to identify this multiple driver issue.

When we tried to synthesize the code using Quartus, another problem got in the way. Apparently our scheduler module was too complex and could not be synthesized, so we had to split this module into two modules.

Although we synthesized the NPU core at 100 MHz, we were only able to clock it at 6.25 MHz(1/8 of the FPGA clock). We suspected that this problem is due to the very large VDD ringing on the PCB.



Back to top