I am a Ph.D student in Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Currently I am working with Stephan Vogel on GALE project, working on improving word alignment for phrase-based machine translation systems and and parallelizing the training pipeline of machine translation.
Also, I am working with Stephan Vogel and Noah Smith on INCA project, building open source distributed statistical machine translation system on clusters, especially for MapReduce frameworks such as Hadoop.
I worked in Microsoft Research in Redmond, WA from April 2011 to July 2011 as an intern. I worked closely with Will Lewis, Chris Quirk and Mei-Yuh Hwang. The research topic is incremental training for word alignment.
My current work focus on utilizing Semantic Role information to assist machine translation.
I am editing several pages in Wikipeida Chinese, if you know Chinese, please help me improve them: Statistical Machine Translation Machine Translation
Research
I focus on parallelizing the pipeline of machine translation framework, released software packages such as Multi-Threaded GIZA++ MGIZA++ and Parallel GIZA++ PGIZA++ . And an internal version of parallel phrase extraction tool Chaksi is being developped and testing on Yahoo's M45 cluster. All these work will be part of INCA project, funded by NSF.
Also, I am interested in improving the word alignment quality for machine translation, and looking forward to integrating syntactical information into phrase based machine translation.
Before joining LTI at 2007, I worked on speech recognition, worked mainly on ASR decoders. Please refer to my CV for more detail on the research.
Education
- Ph.D Student
- Language Technology Institution, Carnegie Mellon University
- August 2009 - Now
- Master Student
- Language Technology Institution, Carnegie Mellon University
- August, 2007 August 2009
- GPA: 3.91
- Master of Engineering
- National Key Laboratory for Machine Perception, Peking University
- September, 2004 June, 2007
- GPA: 3.82
- Diploma Thesis: Research and Implementation of Chinese Spoken Document Retrieval System
- Graduate with First Honor
- Bachelor of Science
- School of Mathematics Science, Peking University
- Major: Mathematics, Scientific and Engineering Computing
- September, 2000 June 2004
- GPA: 3.25
- Graduate Research: Automatic Spoken English Quality Evaluation System
- Second Major: Economics
Software
Please visit http://geek.kyloo.net/software for a list of my softwrae.
- Chaksi: A software package for training phrase-based machine translation system on Hadoop clusters, together with MGIZA it can train large scale model in hours.
- MGIZA : Multi-threaded GIZA. It is a extended and optimized version of GIZAPP, which can run multi-threaded, and provide additional functionalities/optimizations such as:
- Resume training from previous models. You may restart training from any step give previous model.
- Memory usage optimization. Eliminate duplicated tables in memory, which may save hundreds of megabytes of memory. It is crucial for distributed alignment.
- Integrate with Chaksi. The verison is fully integrated with Chaksi and therefore can be run on Hadoop clusters. (Currently only work for Hadoop 0.20.1+)
Want to use word for ACL submission but got frustrated because of the citation? I have BibWord Stylesheet for you. ACLStyleSheet
Publications
Book Chapter
2011
- Francisco Guzman, Qin Gao, Jan Niehues and Stephan Vogel, Word Alignment Revisited, in Handbook of Natural Language Processing and Machine Translation, pp. 164-175, edited by Joseph Olive, Caitlin Christianson, John McCary, Springer, ISBN 978-1-4419-7712-0, DOI 10.1007/978-1-4419-7713-7
Journal Papers
2009
- Qin Gao, Stephan Vogel. Training phrase-based machine translation models on the cloud: Open source machine translation toolkit Chaski. The Prague Bulletin of Mathematical Linguistics No. 93, 2010, pp. 3746. ISBN 978-80-904175-4-0. doi: 10.2478/v10108-010-0004-8. pdf
- Jonathan H. Clark, Jonathan Weese, Byung Gyu Ahn, Andreas Zollmann, Qin Gao, Kenneth Heafield, Alon Lavie. The Machine Translation Toolpack for LoonyBin: Automated Management of Experimental Machine Translation HyperWorkflows. The Prague Bulletin of Mathematical Linguistics No. 93, 2010, pp. 117126. ISBN 978-80-904175-4-0. doi: 10.2478/v10108-010-0002-x.
Conference Papers
2011
- Qin Gao, Will Lewis, Chris Quirk and Mei-Yuh Hwang, Incremental Training and Intentional Over-fitting of Word Alignment, To Appear in MT Summit 13, September 2011, Xiamen, China
- Nguyen Bach, Qin Gao and Stephan Vogel, TriS: A Statistical Sentence Simplifier with Log-linear Models and Margin-based Discriminative Training, To appear in The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), November 2011.
- Sanjika Hewavitharana, Nguyen Bach, Qin Gao,Vamshi Ambati and Stephan Vogel, CMU Haitian Creole-English Translation System for WMT 2011, bib
- Qin Gao and Stephan Vogel, Corpus Expansion for Statistical Machine Translation with Semantical Role Label Substitution Rules, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2011), pp. 288-294 . pdf bib
- Qin Gao and Stephan Vogel, Utilizing Target-Side Semantic Role Labels to Assist Hierarchical Phrase-based Machine Translation, Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-5), pdf bib
2010
- Qin Gao and Stephan Vogel, A Multi-layer Chinese Word Segmentation System Optimized for Out-of-domain Tasks, CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010) pdf bib
- Qin Gao, Francisco Guzman, Stephan Vogel, EMDC: A Semi-supervised Approach for Word Alignment, pp. 349-357, in Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pdf bib
- Qin Gao, Nguyen Bach, Stephan Vogel, A Semi-supervised Word Alignment Algorithm with Partial Manual Alignments, pp. 1-10, ACL 2010 Joint fifth workshop on statistical machine translation and metrics MATR. pdf bib
- Qin Gao and Stephan Vogel, Consensus Versus Expertise: A Cast Study of Word Alignment with Mechanical Turk, pp. 30-34, Workshop of Creating Speech and Language Data With Amazons Mechanical Turk, 2010 pdf bib
2009
- Nguyen Bach, Qin Gao, Stephan Vogel, Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints, MT Summit XII, 2009 pdf
- Francisco Guzman, Qin Gao, Stephan Vogel, Reassessment of the Role of Phrase Extraction in SMT, MT Summit XII, 2009 pdf
2008
- Qin Gao, Stephan Vogel, "Parallel Implementations of Word Alignment Tool", Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, June, 2008 pdf bib
- Nguyen Bach, Qin Gao, Stephan Vogel, "Improving Word Alignment with Language Model Based Confidence Scores", Proceedings of the Third Workshop on Statistical Machine Translation, pp. 151-154, June, 2008. pdf bib
- Almut Silja Hildebrand, Kay Rottmann, Mohamed Noamany, Qin Gao, Sanjika Hewavitharana, Nguyen Bach, Stephan Vogel, "Recent Improvements in the CMU Large Scale Chinese-English SMT System", Proceedings of ACL-08: HLT, Short Papers, pp. 77-80, June, 2008 pdf bib
Before 2008
- Qin Gao, Xiaojun Lin, Xihong Wu, "Just-in-time Latent Semantic Adaptation on Language Model for Chinese Speech Recognition Using Web Data", International Workshop on Spoken Language Technology(SLT), pp.50-53, September, 2006. abstract & fulltext
- Runqiang Han, Pei Zhao, Qin Gao, Zhiping Zhang, Hao Wu, Xihong Wu, "CASA Based Speech Separation for Robust Speech Recognition", International Conference on Speech and Language Processing(ICSLP), pp.78-81, September, 2006. pdf
Patents
- Method and system for automatic subtilting, with Huisheng Chi, Xihong Wu, Songfang Huang, Chunxia Lv, Hao Wu and Hao Tian http://ip.com/patent/CN100536532C
Slides & Reports
- Large Scale Machine Translation Architectures, Reading report for Advanced Machine Translation Seminar, Spring 2008 ppt
- Parallelizing the Training Procedure of Statistical Phrase-based Machine Translation, Student Research Symposium, 2008 pdf
My Gadgets
(From oldest to newest)- Palm Treo650 (Used to be my phone and my notebook and Gameboy simulator)
- N800 (My favorite, although I do not bring it around, I use it to read wikipedia every night, and it killed Treo with GarnetVM )
- MIO 520 (PNA, running MIO Mapper as well as iGo 8, hacked and being used as Skype phone when I forget to bring N800. SDIO Wifi card required, of coz)
- EeePC 1000h (It is a laptop, but small enough to replace most gadgets, I am now playing around with it. Less fun with Windoz installed anyway)
Some links about me
Well I have to say the information from media is not always accurate, I must clarify although I work on cloud computing, I am not work directly on TransTac project which does the English-Iraqi translationThe Tartan Online: CMU does research on cloud computing
Pittsburgh TRIBUNE-REVIEW: CMU pushes into frontiers of 'cloud computing'
CMU works on processing remote computer data
追求卓越,勇创佳绩北大获第九届挑战杯学术科技作品竞赛优胜杯 (Some Chinese staff about my work on TV Caption Alignment)
Private Zone
| CodeSnipplet | MySchedule | ConfigurationFiles |

There are no comments on this page. [Add comment]