Graduate Student Xilin Jiang Wins Best Paper Award at IEEE WASPAA

The award-winning paper explores how artificial intelligence models “hear” and “see” the world, uncovering similarities—and blind spots—between human and machine perception.

By
Xintian Tina Wang
November 05, 2025

EE graduate student Xilin Jiang, advised by associate professor of Electrical Engineering Nima Mesgarani, has been recognized with a Best Paper Award at The IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) for his groundbreaking study on how artificial intelligence perceives sound and vision—revealing that large language models (LLMs) may share the same sensory blind spots as humans.

Jiang’s paper, “Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation,” investigates how AI systems process the world through auditory and visual cues. Working with collaborators from the University of Washington and Columbia’s Mesgarani Lab, Jiang compared how humans and multimodal AI models identify real-world sounds—like a parrot talking or a person coughing—using audio-only, video-only, or combined audio-visual inputs.

The study found that both humans and AIs struggle with similar sensory ambiguities. “You may be fooled by a parrot mimicking human speech if you only listen,” Jiang explained. “But once you see the parrot, you immediately know it’s not a person.” The same goes for sight—without sound, subtle actions like coughing can easily go unnoticed.

Jiang and his team then took the research further, introducing a novel cross-modal distillation framework, in which an audio AI learns from a vision AI, and vice versa. This approach—essentially allowing the “audio AI” to imagine what it would see from the audio—led to a 20 percent increase in recognition accuracy, matching models that process both sound and vision together.

“This paper is special to me because it’s more philosophical than my other work,” said Jiang. “It started from a simple question: do AIs experience the world like we do? The experiments were intuitive, and we weren’t sure if others would buy into the story. Winning this award is both surprising and deeply meaningful.”

Looking ahead, Jiang hopes this work will inspire deeper exploration into AI perception. “AI perception is still very understudied,” he said. “There’s so much more we can do to make AI not just more powerful—but more human.”

The IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) is a highly-regarded bi-annual event hosted by the Audio and Acoustic Signal Processing Committee of the IEEE Signal Processing Society since 1986. This two-and-a-half day workshop is devoted to reviewing the current state of the art as well as recent advances in signal processing with emphasis on its applications to audio and acoustics. It brings together researchers and practitioners from universities and industry.