====== Chaski====== Chaski is a distributed toolkit for machine translation. It contains the following tools: - Distributed word clustering. Being able to build word classes for billion-word corpus. - Distributed word alignment. Using the newest version of [[MGIZA:overview]], it is able to training word alignment models on the cluster in hours instead of days. - Distributed phrase extraction. The phrase extraction for large corpus turns turns out to be slow and require huge disk space and (actually, or) memory, the Chaski can extract phrases in a very high speed and make use of HDFS to store intermediate files so as to alleviate the disk usage. On Yahoo!'s M45 cluster, Chaski performed full training, i.e. start from raw parallel data, output Moses compatible phrase table and reordering table, on a 6 million sentence pair corpus in 8 hours. In a single machine, this usually takes one week ((With MGIZA++ on quad-core machine, you may be able to get it done in four or five days.)). Using Chaski is easy, a typical training just requires two commands: setup-chaski-full Source-corpus Target-corpus HDFSRoot > chaski.config will setup a configuration file that you can fine-tune, and then train-full chaski.config will run through the pipeline and get the phrase table ready. Chaski is also flexible, the config file contains a lot of options that you can adjust to maximize the speed according to your cluster's setup. Currently Chaski is mainly develop on Yahoo's M45 cluster and we do appreciate if you can use it and test it on other clusters. Bug report is also appreciated! ===== Download ===== Please visit [[chaski:download]] page for most up-to-date release of Chaski. ===== Installation ===== Please see [[chaski:install]] for detail about installing Chaski ===== HOWTO ===== [[chaski:tutorial]] provides a simple tutorial how to run Chaski and also explanation of its configurations. If you encounter problems, please let me know and I will add it into [[chaski:FAQ]]