This is a mini-HOWTO on force aligning unseen test data using MGIZA++ with existing models. To force align unseen test data, you need the following staff:
Force alignment requires correctly map the words to ids according to existing vocabulary file, and re-generate the coocurrence table using snt2cooc executable. To process the vocabulary and generate new vocabulary, you can use the script plain2snt-hasvcb.py, the script will also be included in future release of MGIZA1).
Assume we are running Chinese-English alignment, the following command will generate new vocabularies and .snt files
./plain2snt-hasvcb.py ch.vcb en.vcb mt06.ch mt06.en en-ch.snt ch-en.snt ch.pk.vcb en.pk.vcb
The first two parameters are vocabulary files in previous training, and the next two parameters are source/target corpus file you want to force align. After that, the en-ch.snt and ch-en.snt are digitalized corpus file with UNKs properly handled, and notice that en-ch comes before ch-en, that is to be compatible with the naming convention of Moses. So if you have a Moses training, you should use en-ch.snt with the models in gize.en-ch/ and vice versa.
The last two parameters are processed vocabulary files, UNKs are added to the end of the vocabulary.
If you are going to do force alignment on HMM/Model4, MGIZA will search for word class file, and in this case it is expected to be ch.pk.vcb.classes and en.pk.vcb.classes. If you want to handle UNK words, You can either re-run mkcls, which takes hours or randomly assign class to UNK words. Or, a better solution may be just use the original one if the number of UNK is small. In that case you need to create a symbolic link from original .classes files to the expected file name.
Next step is simply, just run snt2cooc executable to generate co-occurrence file. Note here the usage of snt2cooc binary in MGIZA is different from GIZA, it does not write to stdout, instead you need to specify a filename.
./snt2cooc en-ch.cooc ch.pk.vcb en.pk.vcb en-ch.snt
will generate co-occurrence file en-ch.cooc.
Note that you cannot use co-occurrence file generated for original corpus, because some alignment pairs does not exist on that corpus, using it may cause segmental fault.
Running alignment is simple, you just need to run
./mgiza PREVIOUS_GIZACFG [options]
where PREVIOUS_GIZACFG is the giza configure file mentioned in previous section. For example in Moses training the name would be
giza.en-ch/en-ch.gizacfg
The key is, however, to get the options correctly set. You need to correctly set the input data, and according to which model you want to load, correctly set the restart level and required model file. The table below shows all necessary files and parameter settings for different model stage. M1 means you load previous trained model 1 and force align the data. In the row of -restart, column M1, you read 1 from the cell, that means if you want to perform model 1 force alignment, you should specify -restart 1 on the command line.
| Parameter | Meaning | File name | Example (Moses) | M1 | HMM | M2 | M3 | M4 |
|---|---|---|---|---|---|---|---|---|
| -c | input corpus | en-ch.snt | (generated) | Required | ||||
| -s | source vocabulary | ch.pk.vcb | (generated) | Required | ||||
| -t | target vocabulary | en.pk.vcb | (generated) | Required | ||||
| -o | Output prefix | anypath+prefix | (you specify) | Required | ||||
| -coocurrence | co-occurrence file | en-ch.cooc | (generated) | Required | ||||
| Parameter | Meaning | File name | Example (Moses) | M1 | HMM | M2 | M3 | M4 |
| -restart | restart level | (number) | (you specify) | 1 | 6 | 3 | 9 | 11 |
| -m1 | model 1 iterations | (number) | (you specify) | 1 | 0 | 0 | 0 | 0 |
| -mh | hmm iterations | (number) | (you specify) | 0 | 1 | 0 | 0 | 0 |
| -m2 | model 2 iterations | (number) | (you specify) | 0 | 0 | 1 | 0 | 0 |
| -m3 | model 3 iterations | (number) | (you specify) | 0 | 0 | 0 | 1 | 0 |
| -m4 | model 4 iterations | (number) | (you specify) | 0 | 0 | 0 | 0 | 1 |
| Parameter | Meaning | File name | Example (Moses) | M1 | HMM | M2 | M3 | M4 |
| -previoust | previous 't' table | *.t*.[1-9|final] | en-ch.t3.final | Required | ||||
| -previousa | previous 'a' table | *.a*.[1-9|final] | en-ch.a3.final | Required | ||||
| -previousd | previous 'd' table | *.d*.[1-9|final] | en-ch.d3.final | Required | ||||
| -previousn | previous 'n' table | *.n*.[1-9|final] | en-ch.n3.final | Required | ||||
| -previousd4 | previous 'd4' table | *.d4.[1-9|final] | en-ch.d4.final | R… | ||||
| -previousd42 | previous 'd4' table | *.D4.[1-9|final] | en-ch.D4.final | R… | ||||
| -previoushmm | previous 'hmm' table | *.hhmm.[1-9] | en-ch.hhmm.5 | R… | Optional | |||
| Parameter | Meaning | File name | Example (Moses) | M1 | HMM | M2 | M3 | M4 |
Below is an example of running Model 4 alignment:
./mgiza giza.en-ch/en-ch.gizacfg -c en-ch.snt -o output/en-ch \ -s ch.pk.vcb -t en.pk.vcb -m1 0 -m2 0 -mh 0 -coocurrence ch-en.cooc \ -restart 11 -previoust giza.en-ch/en-ch.t3.final \ -previousa giza.en-ch/en-ch.a3.final -previousd giza.en-ch/en-ch.d3.final \ -previousn giza.en-ch/en-ch.n3.final -previousd4 giza.en-ch/en-ch.d4.final \ -previousd42 giza.en-ch/en-ch.D4.final -m3 0 -m4 1
As I mentioned above, loading HMM models for model 3 and model 4 are optional. Because HMM is used in model 3 and model 4 to get seed alignment, and we fall back to model 2 if no HMM is present. Another reason is people usually don't dump HMM models. However, if HMM model is not loaded, the model 3 and 4 after resuming are not exactly the same as single-run training. On large data usually the difference is quite small, but on small data we do notice unexpected errors.
OK if you just want to align some test set using existing Moses training, you may just use the script here:
This script will also be included in future release of MGIZA.
Before running the script you need to make and install MGIZA, and set environment parameter ${QMT_HOME} to the installation dir of MGIZA.2).
Then you just go to the Moses training directory, and link the corpus you want to align to ${PREFIX}.${src} and ${PREFIX}.${tgt}, where src and tgt are the source/target tags used in Moses training. After that you just type:
${QMT_HOME}/scripts/force-align-moses.sh ${PREFIX} ${src} ${tgt} OUTPUT-DIR
Using the previous Chinese English training as an example:
${QMT_HOME}/scripts/force-align-moses.sh mt06 ch en force-align
After running you will get the force alignment at:
force-align/giza.ch-en/ch-en.A3.final.part? force-align/giza.en-ch/en-ch.A3.final.part?
In addition to force alignment, you can also resume training using existing model. To do that, just set more iterations on parameters such as -m1, -m2, -mh, -m3, and m4. But remember you if you add more data to the corpus, you need to run Step 1 on the new corpus.