Chaski Tutorial

The most common usage of Chaski include word alignment and phrase extraction. This tutorial will cover both.

0. Prepare Chaski

Download Chaski here.

Follow the instruction here to install chaski.

And the following part of the tutorial will explain two alternative pipelines: Running full training from word alignment and running only the phrase extraction.

1. Full training with word alignment

1.1 Prepare corpus

To run full training we just need the parallel corpus file. You need to preprocess the corpus file so that it does not contain specially character sequences used as separators, which may break the pipeline. The special characters includes:

   |||
   {##}

And we do recommend you remove or map '|' and '#' to other character or markers because it may later affect Moses decoder.

As an example, if you want to build a Chinese-to-English translation model, you need to collect Chinese and English parallel corpus, store them in two plain text files, the corresponding lines are translations of each other. We assume the two files are:

Train.ch
Train.en

Create a directory on a machine that has access to Hadoop, copy or link the files to where some directory, say /home/goodboy/chaski-train/.

Suppose now in /home/goodboy/chaski-train/ you can see the two files stated above.

1.2 Choose an HDFS directory

Then you need to choose an HDFS directory which you have write access to. It will be used as temporary storage and also final output will stay in the directory. Let's suppose you chose /user/goodboy/Chaski-HDFS/.

1.3 Generate default config file

Run the following command:

   cd /home/goodboy/chaski-train/
   $Chaski_HOME/scripts/setup-chaski-full\
              Train.ch\
              Train.en\
              /user/goodboy/Chaski-HDFS/ \
              > training.config

This will generate a default training configuration file

/home/goodboy/chaski-train/training.config

If you want, you may edit the file and modify some parameters, most of the parameters has explanation above it. Pay special attention to last two parameters:

# Output directory of Moses Phrase table
moses-p=/user/goodboy/Chaski-HDFS/moses-phrase
 
# Output directory of Moses lexiconized reorder table
moses-r=/user/goodboy/Chaski-HDFS/moses-reorder

The moses-compatible phrase table and reorder table will be output to this directory.

1.4 Run the training

Finally you are ready to fire up the training. Just run:

        $Chaski_HOME/scripts/train-full /home/goodboy/chaski-train/training.config

And wait for its completion.

If it is complete successfully, you may want to get the Moses phrase table and reorder table and fetch it locally. You may run:

   hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-phrase/* | gzip -c > phrase-table.0-0.gz
   hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-reorder/* | gzip -c > reorder-table.0-0.gz

to get them. I did not include it on the script since we generally can run decoder on HDFS, and it is trivial for you to do that.

2. Phrase extraction only

2.1 Prepare input

Chaski requires Moses-style input, actually it replaces Moses' training steps from 3 on. Therefore you may run Moses up to step 3 1). Look into Moses training directory, in model/ directory there should be three files:

 aligned.*.$src
 aligned.*.$tar
 aligned.$method

where src, tar are source/target language symbol and method is the alignment combination heuristic. For example, if you train a model to translate from Chinese ch to English en, and use grow-diag-and heuristic, then the three files should be2):

 aligned.0.ch
 aligned.0.en
 aligned.grow-diag-and

Create a directory on a machine that has access to Hadoop, copy or link the files to where some directory, say /home/goodboy/chaski-train/.

Suppose now in /home/goodboy/chaski-train/ you can see the three files stated above.

2.2 Choose an HDFS directory

Then you need to choose an HDFS directory which you have write access to. It will be used as temporary storage and also final output will stay in the directory. Let's suppose you chose /user/goodboy/Chaski-HDFS/.

2.3 Generate default config file

Run the following command:

   cd /home/goodboy/chaski-train/
   $Chaski_HOME/scripts/setup-chaski \
              aligned.0.ch \
              aligned.0.en \
              aligned.grow-diag-and \
              /user/goodboy/Chaski-HDFS/ \
              > training.config

This will generate a default training configuration file

/home/goodboy/chaski-train/training.config

If you want, you may edit the file and modify some parameters, most of the parameters has explanation above it. Pay special attention to last two parameters:

# Output directory of Moses Phrase table
moses-p=/user/goodboy/Chaski-HDFS/moses-phrase
 
# Output directory of Moses lexiconized reorder table
moses-r=/user/goodboy/Chaski-HDFS/moses-reorder

The moses-compatible phrase table and reorder table will be output to this directory.

2.4 Run the training

Finally you are ready to fire up the training. Just run:

        $Chaski_HOME/scripts/extract /home/goodboy/chaski-train/training.config

And wait for its completion.

If it is complete successfully, you may want to get the Moses phrase table and reorder table and fetch it locally. You may run:

   hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-phrase/* | gzip -c > phrase-table.0-0.gz
   hadoop fs -cat /user/goodboy/Chaski-HDFS/moses-reorder/* | gzip -c > reorder-table.0-0.gz

to get them. I did not include it on the script since we generally can run decoder on HDFS, and it is trivial for you to do that.

3. Troubleshooting

3.1 Memory issue

The most frequent errors are memory errors. If you get error like “OutOfMemory Exception” or “GC Overhead exceed limit”, then you have two means to address this. First you may try to increase the heap size of each mapper/reducer. Usually these error happens on extract stage or score stage, and you can modify the following parameters:

extract.hp=1200m
score.hp=1300m

However if it is too large, the map/reduce will not start. If you get error like “IOError” during initialization of Mappers, 3), you should give up this method and reduce the heap size.

Then you can turn to other fancy parameters for help, generally there are two most useful ones:

   score.mb=2000000
   score.cc=true

You may decrease the first, and set the second to be true. Both of them will impact the speed but if it works, you may just increase number of mapper/reducers to compensate it.

3.2 Break the procedure down

You may want to run extract step by step, there are five steps in extract, and one step and you may run one step of it, by specify the first and last step by number.

The steps in full training pipeline:

Usage :  /home/qing/ChaskiV3/ChaskiV3/scripts/train-full  CONFIG-FILE [FIRST-STEP]  [LAST-STEP]
Steps :
   1 - Make word class
   2 - Prepare corpus for word alignment
   3 - Word alignment
   4 - Extract phrases
   5 - Build lexicon tables
   6 - Score phrase table
   7 - Postprocess and output Moses phrase table

The steps in extraction only pipeline:

Usage :  /home/qing/ChaskiV3/ChaskiV3/scripts/extract  CONFIG-FILE [FIRST-STEP]  [LAST-STEP]
Steps :
   1 - Preprocess and put files to HDFS
   2 - Extract phrases
   3 - Build lexicon tables
   4 - Score phrase table
   5 - Postprocess and output Moses phrase table

3.3 Find more parameters

Chaski actually has a lower-level interface, accessible through

 $CHASKI_HOME/script/chaski

You may type it to see help information.

All the parameters can either by specified by configuration file or by commandline parameters. And you may find that each parameter can have multiple alias, such as

--corpus              String(R)        The input corpus file. We assume it is
  --extract.corpus                       on HDFS, and it should be generated by
                                         preprocessing stage

--num-reducer         int              Specify the number of reducers
  --extract.nr

both –corpus and –extract.corpus points to the same variable, and you can use either of them in configuration file, such as:

corpus=abc
extract.corpus=def

The reason for this is to enable sharing parameter name among modules, for example –corpus are used in extract, buildlex modules, the alias of the parameter enable you a way to override the parameter. Just to remember the alias will override the default name.

1) specify –last-step 3 to stop it
2) If the first two files are not there, original corpus should work just fine
3) Usually mappers won't start so the percentage will always be 0%
chaski/tutorial.txt · Last modified: 2009/11/28 12:01 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0