Conclusions
In conclusion, we successfully translated the NPU idea into a chip design by utilizing a modern, systematic methodology. We implemented the NPU in two application-based demos to prove the hardware acceleration of matrix multiplication and its importance to AI computing. We overcame the challenge of integrating multiple engineering areas together and realized the translation between different abstractions and different hierarchies.
From a concept level, we started from an application problem we wanted to solve and chose an AI approach. We then abstracted the problem from image recognition, to neural network, matrix multiplication and all the way to the basic MAC operation that can be realized by standard hardware structures like adder and multiplier. We designed the datapath using a bottom-to-top approach, meaning that we expanded to complex architecture from basic MAC module with the essential computation function we needed.
Regarding digital flow, we proposed our RTL, synthesized it, verified it, and transformed it to placed and routed layout. We designed fully custom circuits for our special needs and verified the integrated chip from multiple dimensions.
We designed a PCB as a system-level platform for NPU and its peripheral circuits, customized the PCB for its compatibility for both demos. We proposed several novel CV algorithms for image pre-processing and alignment, also a novel approach to realize reliable timing control on serial CPU platform. We programmed a scheduler in FPGA to achieve high-performance computing mode. We 3D printed an enclosure for the entire demo system for higher integration.
It has been validated that all blocks within NPU are fully functional. We had limited time to push the chip to a higher frequency, but it will likely be solved by optimizing the PCB design.

