MGIZA++ Configuration

This section will explain all the parameters used in MGIZA++, listed in its original categories such as input/output/em.

Parameters may have different aliases, and if more than one aliases appear in the command line or configure file, the one appears latest will be kept, and the one appears in command line will override that in configure file.

Number of iterations

The parameters defines the training sequence, i.e, how many iterations need to be performed for each type of model. The simpler models are used to initialize the more complexed model, so perform a number of iterations for simpler models are necessary. HMM is considered a replacement of model 2, most widely used training sequnce is:

-m1 5 -m2 0 -mh 5 -m3 3 -m4 3 -m5 0 -m6 0
Name Aliases Type Meaning
model1iterations m1, noiterationsmodel1 INT Number of iterations when training model 1
model2iterations m2, noiterationsmodel2 INT Number of iterations when training model 2
model3iterations m3, noiterationsmodel3 INT Number of iterations when training model 3
model4iterations m4, noiterationsmodel4 INT Number of iterations when training model 4
model5iterations m5, noiterationsmodel5 INT Number of iterations when training model 5
model6iterations m6, noiterationsmodel6 INT Number of iterations when training model 6
hmmiterations mh, numberofiterationsforhmmalignmentmodel INT Number of iterations when training hmm

Output files and format

Name Aliases Type Meaning
compactalignmentformat BOOL Whether output the compact alignment format, the compact format does not have text of source/target sentence, and output only alingment links, such as 1 1 2 5
countoutputprefix STRING It is mostly used in distributed training, similar to outputprefix but will not dump the normalized model but the count tables before normalization
dumpcount BOOL A trigger for dumping counts in addition to the normalized models. Note that the count will ONLY be dumped in the final step.
dumpcountusingwordstring BOOL By default the word ids will appear in the count tables, if the parameter is set to true then the word surface form will be used.
logfile l STRING The path for log file
outputfilepreifx o STRING The prefix for output files
outputpath STRING (Not used and will be removed in next version )
model1dumpfrequency t1 INT Specify the model/alignment will be dump in every t1 iterations, 0 means no dump, 1 means dump every iteration. Set t1=m1, so only the last iteration is dumped
model2dumpfrequency t2 INT Dump frequnece of model 2
transferdumpfrequency t2to3 INT If model 2 is used instead of HMM, an additional iteration is needed to transfer from model 2 to model 3 this triggers dumping of that iteration
model345dumpfrequency t345 INT Dump frequnece of model 3/4/5/6
hmmdumpfrequency th INT Dump frequnece of HMM
nbestalignments BOOLEAN Whether dump N-Best alignment instead of Viterbi Alignmetn
nodumps BOOLEAN If true, then no dumps will be made for model files and alignment files
onlyaldumps BOOLEAN If true, only dump alignments (This will force the model 3/4/5 to output alignment of last step, i.e *.A3.final

The output filenames is generated as follows:

outputname = outputprefix + "." + modelType + trainingStage +  "."  + iteration;

where outputprefix is specified in the parameter, model type can be t, a, d, n, D and h, the training stage can be 1 (model 1) 2 (model 2) 3 (model 3/4/5, for d model), 4, 5 and hmm.

Input files and format

Name Aliases Type Meaning
corpusfile c STRING Input corpus file (The snt file)
testcorpusfile tc STRING Input corpus file (The snt file), the file will only be used in alignment but the counts will not affect the models.
sourcevocabularyfile s STRING Source vocabulary file (.vcb), note that in Moses, the definition of source and target are reversed, so when training ch-en, ch is target and en is source
targetvocabularyfile t STRING target vocabulary file (.vcb)
restart INT (Only in MGIZA), restart training from a certain level, will be explained later
previousa/d/d4/d42/hmm/n/t STRING Previous models for resume training

MGIZA supports resuming training, which needs loading previous models and set correct restart level. Below is a table list the restart levels and model files needed for the level.

Restart Level Previous table(s)
restart meaning t(lex) a(dist) hmm d(dist) n(fert) d4/d42
0 normal training
1 continue model 1 from model 1
2 initialize model 2 from model 1
3 continue model 2 from model 2
4 initialize hmm from model 1
5 initialize hmm from model 2
6 continue hmm from hmm
7 initialize model3 from hmm
8 initialize model3 from model2
9 continue model3 from model 3
10 initialize model4 from model 3
11 continue model4 from model 4

Cut-off and floorings

Name Aliases Type Meaning
countincreasecutoff FLOAT When accumulating individual counts, if the increment is smaller than this amount, then the increment will not be added. (It seems the variable has NO EFFECT and will be overwritten by mincountincrease
mincountincrease FLOAT Same as countincreasecutoff, but it is the one that really takes effect.
countincreasecutoffal FLOAT The meaning is similar to mincountincrease, however it does not work for model 1, 2 and HMM but only works for model 3 and on
probcutoff FLOAT When output model files, all entries smaller than the value will be ignored.
probsmooth FLOAT When a probability entry cannot be found on models, the value will be used.
peggedcutoff FLOAT When doing model 3,4,5, a different method is used which first get viterbi alignment and the shuffle several links to “sample” new alignment, this probability is a relative threshold which alignment will be accepted in the sampling procedure. The alignment will be accepted if score(a')>score(a^*) \times peggedcutoff. Where a^* is the viterbi alignment

Smoothing factors

Name Aliases Type Meaning
emalsmooth FLOAT For HMM only, when specified in emSmoothHMM parameter, the jump probabilities will be interpolated with uniform distribution, and the value represents the weight for uniform distribution. (1.0 means always use uniform value, 0 means no interpolation.
emSmoothHMM SHORT Flags for smoothing method for HMM model, the first bit toggle “modiefied counts” method, and the second bit toggles smoothing using emalsmooth parameter.
model23smoothfactor FLOAT The smoothing factor for distortion model, the probability will be interpolate with uniform distribution, the smoothing factor is weight of uniform distribution, (1.0 means always use uniform value, 0 means no interpolation)
model4smoothfactor FLOAT The smoothing factor for distortion model (model 4), the probability will be interpolate with uniform distribution, the smoothing factor is weight of uniform distribution, (1.0 means always use uniform value, 0 means no interpolation)
model5smoothfactor FLOAT The smoothing factor for distortion model (model 4), the probability will be interpolate with uniform distribution, the smoothing factor is weight of uniform distribution, (1.0 means always use uniform value, 0 means no interpolation)
nsmooth Float Smooth factor for word length dependent fertility parameters
nsmoothgeneral FLOAT Smooth factor for global fertility probability

For nsmooth and nsmoothgeneral, they are used in renormalization of fertility table. Fertility table holds the probability of P(N|e_i), N is the number of words e_i translates to. Some words only appear once or twice and it is impossible to get sufficient statistics for all possible Ns. Therefore during renormalization the probability will be smoothed by interpolating with two global probabilities, first is P(N|Len(e_i)), where Len(e_i) is byte length of e_i, and the other is P(N). So remember DIFFERENT ENCODING will give you DIFFERENT RESULT when training word alignment models.

Model Definitions & Probabilities

Name Aliases Type Meaning
compactadtable BOOL Whether to use 3-dimension (=1) or 4-dimension (=0) distortion/alignment table
deficientdistortionforemptyword SHORT
depm4 SHORT Flags for dependencies in model 4
depm5 SHORT Flags for dependencies in model 5
emalignmentdependencies SHORT Flags of dependencies in the HMM alignment model
emprobforempty FLOAT The probability of empty words when doing forward-backward training in HMM, will affect the NULL-word alignments in HMM
m5p0 FLOAT The probability of empty word for model 5, if set to -1, then it will be trained
p0 FLOAT The probability of empty word for model 3/4, if set to -1, then it will be trained. It is generally a bad idea to have it trained since model 3/4 are deficient. Usually ppl have magic numbers, typically 0.975 or so
maxfertility INTEGER Maximum fertility allowed in fertility models. It will affect training in two ways: 1. The fertility table will have only maxfertility+1 probability for each word. 2. The input corpus will be processed, if a sentence pair exceeds the ratio, for example one word in source side and maxfertility+1 words in target side, the sentecne pair will be truncated

About compactadtable. The original distortion model has three dependencies: P(i|j,L,M), where i,j are positions in source/target sentences and L,M are source/target sentence length. However the dimension becomes too high so we turn off M if compactadtable is 1.

Dependency of HMM model

Bit Meaning
1 setnece length
2 (target) previous class
3 (target) previous position
4 (source) previous position
5 (source) previous class

Dependency of Model 4/5

Bit Meaning
Dependencies for first word
1 source position
2 target position
3 source class
4 target class
Dependencies for second word and on
5 source position
6 target position
7 source class
8 target class

Below are parameters related to sentence probability (In most pipeline the sentence occurrence just set to 1, so ne effect at all, just ignore them, plus, have no idea why these parameters are there, it comes from GIZA++):

manlexfactor1 = 0  (When setence occurrence is set to -1.0, this value will be used)
manlexfactor2 = 0  (When setence occurrence is set to -2.0, this value will be used)
manlexmaxmultiplicity = 20  ()

Multi-threading

The only parameter to control multi-threading is ncpus,

  1. ncpus 5

Will enable 5 threads, and the final output will also be separated in 5 parts,

  prefix.A3.final.part0
...
  prefix.A3.final.part5
mgiza/configure.txt · Last modified: 2009/12/11 02:21 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0