Title: Memory-Aware LLM Inference Processor Designs
Abstract:
The rapid advancement and deployment of large language models (LLMs) have led to dramatic surge in computational demands, placing pressure on semiconductor memory systems. In particular, LLM inference workloads require both a large memory capacity and high memory bandwidth. In this talk, I will present recent research results from my lab that address these bottlenecks through memory-aware hardware-algorithm co-design. First, I will introduce a hardware accelerator architecture designed to efficiently support weight-only quantization, which effectively increases both memory capacity and bandwidth efficiency. Our design is based on a theoretical analysis showing that the same level of numerical accuracy can be achieved using only integer arithmetic for floating-point/integer (FP-INT) operations, a departure from prior methods that relied on integer-units with numerical approximations for FP arithmetic results. Second, I will present a near-memory LLM inference architecture leveraging 3D DRAM-to-logic hybrid bonding (HB) technology. This architecture delivers DRAM PIM-level efficiency for GEMV computations and NPU-class performance for GEMM operations via a dynamically reconfigurable dataflow.
Bio:
Jae-Joon Kim is currently a professor at Seoul National University, Seoul, Korea. He is also a co-founder of a startup, SqueezeBits, which specializes in AI inference optimization and model compression. Before joining SNU, he was a professor at POSTECH, Korea from 2013 to 2021 and he worked at IBM T. J. Watson Research Center as a Research Staff Member from 2004 to 2013. His current research interests include memory-aware AI processor design, AI inference optimization, workload-optimized semiconductor memories and low-power VLSI design.
Host: Mingoo Seok