About Qin Gao


Starting from Feb 2012, I work in Microsoft Research Redmond.

I am a Ph.D candidate  in Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Currently I am working with Stephan Vogel on GALE project, working on improving word alignment for phrase-based machine translation systems and and parallelizing the training pipeline of machine translation.

Also, I am working with Stephan Vogel and Noah Smith on INCA project, building open source distributed statistical machine translation system on clusters, especially for MapReduce frameworks such as Hadoop.
I worked in Microsoft Research in Redmond, WA from April 2011 to July 2011 as an intern. I worked closely with Will Lewis, Chris Quirk and Mei-Yuh Hwang. The research topic is incremental training for word alignment.

My current work focus on utilizing Semantic Role information to assist machine translation.


My research interest is Natural Language Processing, especially Machine Translation and Speech Recognition. My thesis topic is on using Semantic Role Information for Machine Translation.

I worked on parallelizing the pipeline of machine translation framework, released software packages such as Multi-Threaded GIZA++ MGIZA++ and Parallel GIZA++ PGIZA++ . And an internal version of parallel phrase extraction tool Chaksi is being developped and testing on Yahoo's M45 cluster. All these work will be part of INCA project, funded by NSF.

Also, I am interested in improving the word alignment quality for machine translation, and looking forward to integrating syntactical information into phrase based machine translation.

Before joining LTI at 2007, I worked on speech recognition, worked mainly on ASR decoders. Please refer to my CV for more detail on the research.


  • 250px-CMU_Hamerschlag_Hall.jpgPh.D Candidate
    Language Technologies Institute, Carnegie Mellon University
    August 2009 - Now
  • Master Student
    Language Technologies Institute, Carnegie Mellon University
    August, 2007 – August 2009
    GPA: 3.91
  • Master of Engineering
    National Key Laboratory for Machine Perception, Peking University
    September, 2004 – June, 2007
    GPA: 3.82
    Diploma Thesis: Research and Implementation of Chinese Spoken Document Retrieval System
    Graduate with First Honor
  • Bachelor of Science
    School of Mathematics Science, Peking University
    Major: Mathematics, Scientific and Engineering Computing
    September, 2000 – June 2004
    GPA: 3.25
    Graduate Research: Automatic Spoken English Quality Evaluation System
    Second Major: Bachelor of Economics



Please visit http://www.kyloo.net/software for a list of my softwrae.

  • Chaksi: A software package for training phrase-based machine translation system on Hadoop clusters, together with MGIZA it can train large scale model in hours.
  • MGIZA: Multi-threaded GIZA. It is a extended and optimized version of GIZAPP, which can run multi-threaded, and provide additional functionalities/optimizations such as:
    • Resume training from previous models. You may restart training from any step give previous model.
    • Memory usage optimization. Eliminate duplicated tables in memory, which may save hundreds of megabytes of memory. It is crucial for distributed alignment.
    • Integrate with Chaksi. The verison is fully integrated with Chaksi and therefore can be run on Hadoop clusters. (Currently only work for Hadoop 0.20.1+)



Book Chapter


  • Francisco Guzman, Qin Gao, Jan Niehues and Stephan Vogel, Word Alignment Revisited, in Handbook of Natural Language Processing and Machine Translation, pp. 164-175, edited by Joseph Olive, Caitlin Christianson, John McCary, Springer, ISBN 978-1-4419-7712-0, DOI 10.1007/978-1-4419-7713-7

Journal Papers


  • Qin Gao, Stephan Vogel. Training phrase-based machine translation models on the cloud: Open source machine translation toolkit Chaski. The Prague Bulletin of Mathematical Linguistics No. 93, 2010, pp. 37–46. ISBN 978-80-904175-4-0. doi: 10.2478/v10108-010-0004-8. pdf
  • Jonathan H. Clark, Jonathan Weese, Byung Gyu Ahn, Andreas Zollmann, Qin Gao, Kenneth Heafield, Alon Lavie. The Machine Translation Toolpack for LoonyBin: Automated Management of Experimental Machine Translation HyperWorkflows. The Prague Bulletin of Mathematical Linguistics No. 93, 2010, pp. 117–126. ISBN 978-80-904175-4-0. doi: 10.2478/v10108-010-0002-x.

Conference Papers


  • Qin Gao, Will Lewis, Chris Quirk and Mei-Yuh Hwang, Incremental Training and Intentional Over-fitting of Word Alignment, To Appear in MT Summit 13, September 2011, Xiamen, China
  • Nguyen Bach, Qin Gao and Stephan Vogel, TriS: A Statistical Sentence Simplifier with Log-linear Models and Margin-based Discriminative Training, To appear in The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), November 2011.
  • Sanjika Hewavitharana, Nguyen Bach, Qin Gao,Vamshi Ambati and Stephan Vogel, CMU Haitian Creole-English Translation System for WMT 2011, bib
  • Qin Gao and Stephan Vogel, Corpus Expansion for Statistical Machine Translation with Semantical Role Label Substitution Rules, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2011), pp. 288-294 . pdf bib
  • Qin Gao and Stephan Vogel, Utilizing Target-Side Semantic Role Labels to Assist Hierarchical Phrase-based Machine Translation, Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-5), pdf bib


  • Qin Gao and Stephan Vogel, A Multi-layer Chinese Word Segmentation System Optimized for Out-of-domain Tasks, CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010) pdf bib
  • Qin Gao, Francisco Guzman, Stephan Vogel, EMDC: A Semi-supervised Approach for Word Alignment, pp. 349-357, in Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pdf bib
  • Qin Gao, Nguyen Bach, Stephan Vogel, A Semi-supervised Word Alignment Algorithm with Partial Manual Alignments, pp. 1-10, ACL 2010 Joint fifth workshop on statistical machine translation and metrics MATR. pdf bib
  • Qin Gao and Stephan Vogel, Consensus Versus Expertise: A Cast Study of Word Alignment with Mechanical Turk, pp. 30-34, Workshop of Creating Speech and Language Data With Amazon’s Mechanical Turk, 2010 pdf bib


  • Nguyen Bach, Qin Gao, Stephan Vogel, Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints, MT Summit XII, 2009 pdf
  • Francisco Guzman, Qin Gao, Stephan Vogel, Reassessment of the Role of Phrase Extraction in SMT, MT Summit XII, 2009 pdf


  • Qin Gao, Stephan Vogel, "Parallel Implementations of Word Alignment Tool", Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, June, 2008 pdf bib
  • Nguyen Bach, Qin Gao, Stephan Vogel, "Improving Word Alignment with Language Model Based Confidence Scores", Proceedings of the Third Workshop on Statistical Machine Translation, pp. 151-154, June, 2008. pdf bib
  • Almut Silja Hildebrand, Kay Rottmann, Mohamed Noamany, Qin Gao, Sanjika Hewavitharana, Nguyen Bach, Stephan Vogel, "Recent Improvements in the CMU Large Scale Chinese-English SMT System", Proceedings of ACL-08: HLT, Short Papers, pp. 77-80, June, 2008 pdf bib

Before 2008

  • Qin Gao, Xiaojun Lin, Xihong Wu, "Just-in-time Latent Semantic Adaptation on Language Model for Chinese Speech Recognition Using Web Data", International Workshop on Spoken Language Technology(SLT), pp.50-53, September, 2006. abstract & fulltext
  • Runqiang Han, Pei Zhao, Qin Gao, Zhiping Zhang, Hao Wu, Xihong Wu, "CASA Based Speech Separation for Robust Speech Recognition", International Conference on Speech and Language Processing(ICSLP), pp.78-81, September, 2006. pdf


Slides & Reports

  • Large Scale Machine Translation Architectures, Reading report for Advanced Machine Translation Seminar, Spring 2008 ppt
  • Parallelizing the Training Procedure of Statistical Phrase-based Machine Translation, Student Research Symposium, 2008 pdf