I am a PhD student in Columbia University, working with Prof. Dan Ellis and Prof. Nima Mesgarani. I am interested in the machine learning algorithms with the application in auditory source separation, speech enhancement, and the automatic speech recognition.You can find my CV here


Deep attractor network


Together with Yi Luo, and Prof. Nima Mesgarani, we made the generalized version of deep clustering, allowing direct end-to-end optimization for multi-speaker separation, see demo here

Deep music separation


Together with Yi Luo, Nima, Mesgarani, Jonathan Leroux and John Hershey, we largely increase the state of the art performance in music separation with our model, the Chimera network, and won the best performance in MIREX 2016 on singing separation track, see demo here

Neural decoding of attentional selection in multi-speaker separation


Together with James O'Sullivan, Sameer Sheth, Guy, McKhann; Ashesh Mehta, and Nima Mesgarani, we create this revolutionary device, that allows the patient to directly separate out the audio targets with high quality, using their attention as guiding clue, which is the next step of the hearing aid industry

Adaptation of neural networks constrained by prior statistics of node co-activations


Together with Tasha Nagamine and Nima Mesgarani, we created a unsupervisely model for neural network adaptaion, which largely increase the robustness for ASR system under noisy enviorment

End-to-End Attention based Speaker Verification


Together with Shixiong Zhang, Yong Zhao, Jinyu Li and Yifan Gong, we created a end to end speaker verfication model, which is the first deep learning based model that can be used for both text dependent and text independent speaker verification.



Together with sevaral friends of mine, I found this company, specically for speech enhancement and audio source separation. I left this company in 2016

Deep clustering


Together with John Hershey, Jonathan LeRoux, Shinji Watanabe and Yusuf Isik, we invented this revolutionary technic. We refresh the previous state of art performance by THREE TIMES. And it is the first time for human to achieve the high quality for overlapped unknown speaker (and unknown number of speaker) separation. See demo here

The 3rd Chime challenge


Together with researchers in MERL and SRI, we got the 2nd best performance in the 3rd CHiME challenge, a world level challenge for automatic speech recognition under noisy enviorment

The IRAPA-ASpIRE challenge


Together with researcher in MERL and BBN, we won the IRAPA-ASpIRE challenge, a world level ASR evalutation under highly corrupted enviroment


Columbia University


PhD Candidate, Laboratory for the Recognition and Organization of Speech and Audio (LabROSA), GPA: 3.9

Focused on the problem of automatic speech recognition , speech enhancement and source separation using deep learning technol- ogy and Bayesian statistic model.

Columbia University


M.S. in Electrical Engineering, GPA: 3.65.

Focused heavily on the advance technique of signal processing, and the foundation of the statistical model and optimization tools for signal processing.

Xi’an Jiaotong University


B.S. in Electrical Engineering, GPA: 3.4 | Minor in Economics, GPA: 3.2

Focused on the foundation of signal processing and the optimization of the power grid.




Research Intern | Bellevue, USA

  • Focused on single and mulit-channel speech recognition in far-field and mis-matched condition.
  • Focused on text-independent end to end speaker verification system with attentional neural network

Jelenik Speech and Language Technologies Workshop


FFS Team Member | Seattle, USA

  • Focused on the speech recognition in far-field and mis-matched condition.
  • Multi-channel auditory source separation using the recurrennt neural network.

Mitsubishi Electric Research Laboratories


Intern Researcher | Boston, USA

  • Robust automatic speech recognition and enhancement in heavily noisy and mis-matched condition.
  • Deep learning with Long Short-term Memory neural network with application in the auditory source separation.
  • Sequence tmbedding with deep neural network with application in speaker recognition.

International Computer Science Institute


Research assistant | Berkeley, USA

  • Robust front end for multi-language automatic speech recognition.
  • Robust decomposition based speech enhancement system.
  • Automatic speech recognition for minor languages.


  1. Mesgarani, N., O’Sullivan, J., Chen, Zhuo “Neural decoding of attentional selection in multi-speaker environments without access to separated sources”, Provisional Patent filed June 2016.
  2. John Hershey, Jonathan Le Roux, Shinji Watanabe, Zhuo Chen “Method for distinguishing components of an acoustic signal”, US patent No. 9368110, 2016


  1. Yi Luo, Zhuo Chen, Jonathan Le Roux, John Hershey, Nima Mesgarani, “Deep Clustering andConventional Networks for Music Separation: Stronger Together”, submitted to ICASSP 2017.
  2. Zhuo Chen, Yan Huang, Jinyu Li, Yifan Gong, “Improving mask learning based speech enhancement system with restoration layers and residual connection”, submitted to ICASSP 2017.
  3. Zhuo Chen, Yi luo, Nima Mesgarani, “Deep attractor network for single-microphone speaker separation”, submitted to ICASSP 2017.
  4. Yi Luo, Zhuo Chen, Jonathan Le Roux, John Hershey, Daniel P.W Ellis, ““Deep Clustering For Singing Voice Separation”, MIREX, task ofSinging Voice Separation, 2016(1st and 2nd performance).
  5. Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Yifan Gong, “End-to-End Attention based Text-Dependent Speaker Verification”, in 2016 IEEE Workshop on Spoken Language Technology, Dec 2016.
  6. Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey, “Single-Channel Multi-Speaker Separation Using Deep Clustering”, in Proc. Interspeech, San Francisco, Sep 2016.
  7. Tasha Negamine,Zhuo Chen, Nima Mesgarani, “Adaptation of Neural Networks Constrained by Prior Statistics of Node Co-Activations”, in Proc. Interspeech, San Francisco, Sep 2016
  8. Z. Chen, J. O'Sullivan, S. Sheth, G. Mckann, A. D. Mehta, N. Mesgarani, “Neural decoding of attentional selection in multi-speaker environments without access to separated sources”, in Society for Neuroscience 2016, San Diego, Nov 2016.
  9. John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe, Yusuf Isik, “Deep clustering:Discriminative embeddings for segmentation and separation”, in Proc.ICASSP, Shanghai, April 2016.
  10. T. Hori, Z. Chen, H. Erdogan, J. Hershey, J. Roux, V. Mitra, S. Watanabe, “The Merl/sri System For The 3rd Chime Challenge Using Beamforming, Robust Feature Extraction, And Advanced Speech Recognition”, in Proc. ASRU, Arizona, Dec 2015.
  11. R. Hsiao, J. Ma, W. Hartmann, M. Karafiat, F. Grezl, L. Burget, I. Szoke, J.H. Cernocky, S. Watanabe, Z. Chen, S. Mallidi, H. Hermansky, S. Tsakalidis, R. Schwartz, “Robust Speech Recognition in Unknown Reverberant and Noisy Conditions”, in Proc. ASRU, Arizona, Dec 2015.
  12. Z. Chen, S. Watanabe H. Erdogan, J. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks”,in Proc. Interspeech, Dresden, Sep 2015.
  13. Z. Chen, B. McFee, D. Ellis, “Speech enhancement by low-rank and convolutive dictionary spectrogram decomposition”, in Proc. Interspeech, Singapore, Sep 2014.
  14. D. Ellis and H. Satoh and Z. Chen, “Detecting proximity from personal audio recordings”, inProc. Interspeech, Singapore, Sep 2014.
  15. Z. Chen, H. Papadopoulos, D. Ellis, “Content-adaptive speech enhancement by a sparsely-activated dictionary plus low rank decomposition”, in Proc. HSCMA, Nancy, May 2014.
  16. Z. Chen, D. Ellis, “Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition”, in Proc. 2013, Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, 2013.
  17. Z. Chen, G. Grindlay, D. Ellis, “Transcribing multi-instrument polyphonic music with transformed eigeninstrument whole- note templates”, MIREX, task of Multiple Fundamental Frequency Estimation and Tracking , 2012.