Force Alignment Using MGIZA++

This is a mini-HOWTO on force aligning unseen test data using MGIZA++ with existing models. To force align unseen test data, you need the following staff:

  1. A set of models trained by MGIZA++. If you are training Model 1/2/3, you may use model output by GIZA++, however, if you are force aligning using IBM Model 4 or HMM, GIZA++ does not output correct model to be loaded by MGIZA++.
  2. Vocabulary of previous training. The vocabulary files are generated from plain2cooc executable, and in Moses the file is located in root/corpus/*.vcb. If you want to force align HMM or IBM Model 4, you also need root/corpus/*.vcb.classes. If you train the model using Chaski, the vocabulary files are stored on HDFS, the file names are root/dict/[src|tgt], and root/dict/[src|tgt].classes.
  3. Preprocessed testing corpus, make sure the preprocessing is identical with the training data, otherwise you may not get correct output.
  4. Training parameter of the model. That is very importatnt you get correct parameters, mismatched parameters can either result in inaccurate alignment or crashing MGIZA. The best way of getting the correct parameters is to search in the model output directory for a file prefix.gizacfg, which contains all the paramters used in the previous training.

Step 1. Prepare Corpus

Force alignment requires correctly map the words to ids according to existing vocabulary file, and re-generate the coocurrence table using snt2cooc executable. To process the vocabulary and generate new vocabulary, you can use the script plain2snt-hasvcb.py, the script will also be included in future release of MGIZA1).

Assume we are running Chinese-English alignment, the following command will generate new vocabularies and .snt files

./plain2snt-hasvcb.py ch.vcb en.vcb mt06.ch mt06.en en-ch.snt ch-en.snt ch.pk.vcb en.pk.vcb

The first two parameters are vocabulary files in previous training, and the next two parameters are source/target corpus file you want to force align. After that, the en-ch.snt and ch-en.snt are digitalized corpus file with UNKs properly handled, and notice that en-ch comes before ch-en, that is to be compatible with the naming convention of Moses. So if you have a Moses training, you should use en-ch.snt with the models in gize.en-ch/ and vice versa.

The last two parameters are processed vocabulary files, UNKs are added to the end of the vocabulary.

If you are going to do force alignment on HMM/Model4, MGIZA will search for word class file, and in this case it is expected to be ch.pk.vcb.classes and en.pk.vcb.classes. If you want to handle UNK words, You can either re-run mkcls, which takes hours or randomly assign class to UNK words. Or, a better solution may be just use the original one if the number of UNK is small. In that case you need to create a symbolic link from original .classes files to the expected file name.

Next step is simply, just run snt2cooc executable to generate co-occurrence file. Note here the usage of snt2cooc binary in MGIZA is different from GIZA, it does not write to stdout, instead you need to specify a filename.

./snt2cooc en-ch.cooc ch.pk.vcb en.pk.vcb en-ch.snt

will generate co-occurrence file en-ch.cooc.

Note that you cannot use co-occurrence file generated for original corpus, because some alignment pairs does not exist on that corpus, using it may cause segmental fault.

Step 2. Run alignment

Running alignment is simple, you just need to run

./mgiza PREVIOUS_GIZACFG [options]

where PREVIOUS_GIZACFG is the giza configure file mentioned in previous section. For example in Moses training the name would be

giza.en-ch/en-ch.gizacfg

The key is, however, to get the options correctly set. You need to correctly set the input data, and according to which model you want to load, correctly set the restart level and required model file. The table below shows all necessary files and parameter settings for different model stage. M1 means you load previous trained model 1 and force align the data. In the row of -restart, column M1, you read 1 from the cell, that means if you want to perform model 1 force alignment, you should specify -restart 1 on the command line.

Parameter Meaning File name Example (Moses) M1 HMM M2 M3 M4
-c input corpus en-ch.snt (generated) Required
-s source vocabulary ch.pk.vcb (generated) Required
-t target vocabulary en.pk.vcb (generated) Required
-o Output prefix anypath+prefix (you specify) Required
-coocurrence co-occurrence file en-ch.cooc (generated) Required
Parameter Meaning File name Example (Moses) M1 HMM M2 M3 M4
-restart restart level (number) (you specify) 1 6 3 9 11
-m1 model 1 iterations (number) (you specify) 1 0 0 0 0
-mh hmm iterations (number) (you specify) 0 1 0 0 0
-m2 model 2 iterations (number) (you specify) 0 0 1 0 0
-m3 model 3 iterations (number) (you specify) 0 0 0 1 0
-m4 model 4 iterations (number) (you specify) 0 0 0 0 1
Parameter Meaning File name Example (Moses) M1 HMM M2 M3 M4
-previoust previous 't' table *.t*.[1-9|final] en-ch.t3.final Required
-previousa previous 'a' table *.a*.[1-9|final] en-ch.a3.final Required
-previousd previous 'd' table *.d*.[1-9|final] en-ch.d3.final Required
-previousn previous 'n' table *.n*.[1-9|final] en-ch.n3.final Required
-previousd4 previous 'd4' table *.d4.[1-9|final] en-ch.d4.final R…
-previousd42 previous 'd4' table *.D4.[1-9|final] en-ch.D4.final R…
-previoushmm previous 'hmm' table *.hhmm.[1-9] en-ch.hhmm.5 R… Optional
Parameter Meaning File name Example (Moses) M1 HMM M2 M3 M4

Below is an example of running Model 4 alignment:

./mgiza giza.en-ch/en-ch.gizacfg -c en-ch.snt -o output/en-ch \
-s ch.pk.vcb -t en.pk.vcb -m1 0 -m2 0 -mh 0 -coocurrence ch-en.cooc \
-restart 11 -previoust giza.en-ch/en-ch.t3.final \
-previousa giza.en-ch/en-ch.a3.final -previousd giza.en-ch/en-ch.d3.final \
-previousn giza.en-ch/en-ch.n3.final -previousd4 giza.en-ch/en-ch.d4.final \
-previousd42 giza.en-ch/en-ch.D4.final -m3 0 -m4 1

Resuming Model 3 and Model 4 Training

As I mentioned above, loading HMM models for model 3 and model 4 are optional. Because HMM is used in model 3 and model 4 to get seed alignment, and we fall back to model 2 if no HMM is present. Another reason is people usually don't dump HMM models. However, if HMM model is not loaded, the model 3 and 4 after resuming are not exactly the same as single-run training. On large data usually the difference is quite small, but on small data we do notice unexpected errors.

Moses-express

OK if you just want to align some test set using existing Moses training, you may just use the script here:

force-align-moses.sh.

This script will also be included in future release of MGIZA.

Before running the script you need to make and install MGIZA, and set environment parameter ${QMT_HOME} to the installation dir of MGIZA.2).

Then you just go to the Moses training directory, and link the corpus you want to align to ${PREFIX}.${src} and ${PREFIX}.${tgt}, where src and tgt are the source/target tags used in Moses training. After that you just type:

${QMT_HOME}/scripts/force-align-moses.sh ${PREFIX} ${src} ${tgt} OUTPUT-DIR

Using the previous Chinese English training as an example:

${QMT_HOME}/scripts/force-align-moses.sh mt06 ch en force-align

After running you will get the force alignment at:

force-align/giza.ch-en/ch-en.A3.final.part?
force-align/giza.en-ch/en-ch.A3.final.part?

Resume Training

In addition to force alignment, you can also resume training using existing model. To do that, just set more iterations on parameters such as -m1, -m2, -mh, -m3, and m4. But remember you if you add more data to the corpus, you need to run Step 1 on the new corpus.

1) as of version 0.2, it is not included
2) So that ${QMT_HOME}/bin/mgiza can be accessed
mgiza/forcealignment.txt · Last modified: 2010/05/10 15:36 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0