A team of students from Columbia Electrical Engineering— Linyang He, Qiaolin Wang, and Xilin Jiang, advised by Associate Professor of Electrical Engineering Nima Mesgarani—has been selected for a Senior Area Chair Highlight at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). The distinction identifies their work as one of 30 papers selected from nearly 8,000 submissions, placing it in the top 0.4 percent of contributions and marking it as one of the most influential and competitive contributions recognized at EMNLP 2025.
Their paper, Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations, introduces the first large-scale minimal-pair probing framework applied to speech models. Minimal-pair testing—long used in linguistics and natural language processing—allows researchers to isolate specific grammatical or semantic differences by comparing two nearly identical sentences that differ by only one linguistic feature. The Columbia team synthesized more than one hundred thousand such controlled sentence pairs and used them to examine how deeply and consistently speech models encode grammar and meaning.
The findings challenge longstanding assumptions about what speech models can learn from audio alone. The researchers discovered that modern self-supervised speech models, despite receiving no textual supervision, encode hierarchical grammatical structure far more robustly than previously believed. In many cases, these audio-only models match or even surpass the performance of models trained with explicit text outputs. Their analysis also shows that grammatical information—such as syntax, agreement, and the interface between syntax and semantics—emerges reliably in mid-level transformer layers, while conceptual meaning remains significantly harder for models to capture.
The study also offers new insight into how training objectives shape linguistic representations. Self-supervised models tend to concentrate grammatical knowledge in the middle layers before shifting focus toward pretraining-specific features in the final layers. In contrast, automatic speech recognition systems and audio-language models preserve or deepen linguistic structure at later stages, reflecting their text-oriented objectives. These results open the door to understanding not just the capabilities of speech models, but the internal mechanisms that govern how they learn.
By bridging techniques from NLP with cutting-edge speech processing, the team provides one of the clearest maps to date of how linguistic structure is encoded across large-scale audio models. Their work arrives at a pivotal moment, as the field seeks more interpretable and reliable speech-language systems for real-world applications.
The full paper is available here.