Chaski is a distributed toolkit for machine translation. It contains the following tools:
On Yahoo!'s M45 cluster, Chaski performed full training, i.e. start from raw parallel data, output Moses compatible phrase table and reordering table, on a 6 million sentence pair corpus in 8 hours. In a single machine, this usually takes one week 1).
Using Chaski is easy, a typical training just requires two commands:
setup-chaski-full Source-corpus Target-corpus HDFSRoot > chaski.config
will setup a configuration file that you can fine-tune, and then
train-full chaski.config
will run through the pipeline and get the phrase table ready.
Chaski is also flexible, the config file contains a lot of options that you can adjust to maximize the speed according to your cluster's setup.
Currently Chaski is mainly develop on Yahoo's M45 cluster and we do appreciate if you can use it and test it on other clusters. Bug report is also appreciated!
Please visit download page for most up-to-date release of Chaski.
Please see install for detail about installing Chaski
tutorial provides a simple tutorial how to run Chaski and also explanation of its configurations.
If you encounter problems, please let me know and I will add it into FAQ