Events

Past Event

EE Seminar: Dr. Shilong Liu

March 5, 2026
12:15 PM - 1:15 PM
Event time is displayed in your time zone.
CEPSR 750

Title: From Perception to Creation: Building Self-Evolving Multimodal Agents

Abstract: Multimodal foundation models and agents are rapidly improving in perception and language, yet they remain largely static and rarely self-evolve through experience. In this talk, I present a research journey toward self-evolving multimodal agents, organized along a progression from perception to creation. I begin with object-centric visual perception that turns pixels into actionable spatial understanding, from transformer-based detection in DAB-DETR and DINO to open-world language grounding in Grounding DINO. I then move beyond static training with test-time scaling that iteratively refines visual evidence. Building on these foundations, I introduce tool-using agents, from supervised tool learning in LLaVA-Plus to robustness on unseen interfaces via evaluation and feedback in CubeBench and CubeAgent, and further to experience-driven web planning in Avenir-Web. I conclude with Alita and Alita-G as steps toward tool and agent creation, outlining how reusable capabilities can be constructed and accumulated over time.

Bio: Dr. Shilong Liu is a Postdoctoral Research Fellow at the AI Lab, Princeton University. He received his Ph.D. from Tsinghua University. Prior to Princeton, he gained research experience at ByteDance Seed, NVIDIA, Microsoft Research, and IDEA Research. His research focuses on multimodal agents, multimodal learning, and computer vision. His work has accumulated over 15,000 Google Scholar citations and 30,000 GitHub stars.

He is the first author of Grounding DINO, the most-downloaded zero-shot object detector on Hugging Face. His contributions have been recognized with several prestigious honors, including the WAIC Yunfan Award Rising Star (2024), the KAUST AI Rising Star (2024), and the CCF-CV Academic Emerging Scholar Award (2023).

Host: Asaf Cidon