The most common usage of Chaski include word alignment and phrase extraction. This tutorial will cover both.
Download Chaski here.
Follow the instruction here to install chaski.
And the following part of the tutorial will explain two alternative pipelines: Running full training from word alignment and running only the phrase extraction.
To run full training we just need the parallel corpus file. You need to preprocess the corpus file so that it does not contain specially character sequences used as separators, which may break the pipeline. The special characters includes:
|||
{##}
And we do recommend you remove or map '|' and '#' to other character or markers because it may later affect Moses decoder.
As an example, if you want to build a Chinese-to-English translation model, you need to collect Chinese and English parallel corpus, store them in two plain text files, the corresponding lines are translations of each other. We assume the two files are:
Train.ch Train.en
Create a directory on a machine that has access to Hadoop, copy or link the files to where some directory, say /home/goodboy/chaski-train/.
Suppose now in /home/goodboy/chaski-train/ you can see the two files stated above.
Then you need to choose an HDFS directory which you have write access to. It will be used as temporary storage and also final output will stay in the directory. Let's suppose you chose /user/goodboy/Chaski-HDFS/.
Run the following command:
cd /home/goodboy/chaski-train/ $Chaski_HOME/scripts/setup-chaski-full\ Train.ch\ Train.en\ /user/goodboy/Chaski-HDFS/ \ > training.config
This will generate a default training configuration file
/home/goodboy/chaski-train/training.config
If you want, you may edit the file and modify some parameters, most of the parameters has explanation above it. Pay special attention to last two parameters:
# Output directory of Moses Phrase table moses-p=/user/goodboy/Chaski-HDFS/moses-phrase # Output directory of Moses lexiconized reorder table moses-r=/user/goodboy/Chaski-HDFS/moses-reorder
The moses-compatible phrase table and reorder table will be output to this directory.
Finally you are ready to fire up the training. Just run:
$Chaski_HOME/scripts/train-full /home/goodboy/chaski-train/training.config
And wait for its completion.
If it is complete successfully, you may want to get the Moses phrase table and reorder table and fetch it locally. You may run:
hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-phrase/* | gzip -c > phrase-table.0-0.gz hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-reorder/* | gzip -c > reorder-table.0-0.gz
to get them. I did not include it on the script since we generally can run decoder on HDFS, and it is trivial for you to do that.
Chaski requires Moses-style input, actually it replaces Moses' training steps from 3 on. Therefore you may run Moses up to step 3 1). Look into Moses training directory, in model/ directory there should be three files:
aligned.*.$src aligned.*.$tar aligned.$method
where src, tar are source/target language symbol and method is the alignment combination heuristic. For example, if you train a model to translate from Chinese ch to English en, and use grow-diag-and heuristic, then the three files should be2):
aligned.0.ch aligned.0.en aligned.grow-diag-and
Create a directory on a machine that has access to Hadoop, copy or link the files to where some directory, say /home/goodboy/chaski-train/.
Suppose now in /home/goodboy/chaski-train/ you can see the three files stated above.
Then you need to choose an HDFS directory which you have write access to. It will be used as temporary storage and also final output will stay in the directory. Let's suppose you chose /user/goodboy/Chaski-HDFS/.
Run the following command:
cd /home/goodboy/chaski-train/ $Chaski_HOME/scripts/setup-chaski \ aligned.0.ch \ aligned.0.en \ aligned.grow-diag-and \ /user/goodboy/Chaski-HDFS/ \ > training.config
This will generate a default training configuration file
/home/goodboy/chaski-train/training.config
If you want, you may edit the file and modify some parameters, most of the parameters has explanation above it. Pay special attention to last two parameters:
# Output directory of Moses Phrase table moses-p=/user/goodboy/Chaski-HDFS/moses-phrase # Output directory of Moses lexiconized reorder table moses-r=/user/goodboy/Chaski-HDFS/moses-reorder
The moses-compatible phrase table and reorder table will be output to this directory.
Finally you are ready to fire up the training. Just run:
$Chaski_HOME/scripts/extract /home/goodboy/chaski-train/training.config
And wait for its completion.
If it is complete successfully, you may want to get the Moses phrase table and reorder table and fetch it locally. You may run:
hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-phrase/* | gzip -c > phrase-table.0-0.gz hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-reorder/* | gzip -c > reorder-table.0-0.gz
to get them. I did not include it on the script since we generally can run decoder on HDFS, and it is trivial for you to do that.
The most frequent errors are memory errors. If you get error like “OutOfMemory Exception” or “GC Overhead exceed limit”, then you have two means to address this. First you may try to increase the heap size of each mapper/reducer. Usually these error happens on extract stage or score stage, and you can modify the following parameters:
extract.hp=1200m score.hp=1300m
However if it is too large, the map/reduce will not start. If you get error like “IOError” during initialization of Mappers, 3), you should give up this method and reduce the heap size.
Then you can turn to other fancy parameters for help, generally there are two most useful ones:
score.mb=2000000 score.cc=true
You may decrease the first, and set the second to be true. Both of them will impact the speed but if it works, you may just increase number of mapper/reducers to compensate it.
You may want to run extract step by step, there are five steps in extract, and one step and you may run one step of it, by specify the first and last step by number.
The steps in full training pipeline:
Usage : /home/qing/ChaskiV3/ChaskiV3/scripts/train-full CONFIG-FILE [FIRST-STEP] [LAST-STEP] Steps : 1 - Make word class 2 - Prepare corpus for word alignment 3 - Word alignment 4 - Extract phrases 5 - Build lexicon tables 6 - Score phrase table 7 - Postprocess and output Moses phrase table
The steps in extraction only pipeline:
Usage : /home/qing/ChaskiV3/ChaskiV3/scripts/extract CONFIG-FILE [FIRST-STEP] [LAST-STEP] Steps : 1 - Preprocess and put files to HDFS 2 - Extract phrases 3 - Build lexicon tables 4 - Score phrase table 5 - Postprocess and output Moses phrase table
Chaski actually has a lower-level interface, accessible through
$CHASKI_HOME/script/chaski
You may type it to see help information.
All the parameters can either by specified by configuration file or by commandline parameters. And you may find that each parameter can have multiple alias, such as
--corpus String(R) The input corpus file. We assume it is
--extract.corpus on HDFS, and it should be generated by
preprocessing stage
--num-reducer int Specify the number of reducers
--extract.nr
both –corpus and –extract.corpus points to the same variable, and you can use either of them in configuration file, such as:
corpus=abc extract.corpus=def
The reason for this is to enable sharing parameter name among modules, for example –corpus are used in extract, buildlex modules, the alias of the parameter enable you a way to override the parameter. Just to remember the alias will override the default name.