\require{eqn-number}
Basic formula of n-gram language model:
Two-side class-based language model:
Where c(w_i) is the class of word i.
Alternatively we can use one-side class-base lm:
The difference is the second term, which depends on word instead of word class.
The optimization criterion is to find the best clustering of the words C^{*}, s.t. the likelihood of the corpus is maximized. As displayed at \ref{eqn:exchangeclustering:update}, which is simplified. The derivation can be found at [2]..
Here, N(w) and N(c ) are the number of occurrences of a word w or a class c in the training corpus and N(c,d) denotes the number of occurrences of some word in class c followed by a word in class d in the training corpus. (Simply it is class bigram frequency).
It is based on \ref{eqn:languagemodel:class:oneside}, and the maximum likelihood is given by:
Because in the last summaton of \ref{eqn:languagemodel:derivation} sums over all words and then it is the same as the first term. Finally \ref{eqn:languagemodel:derivation} becomes:
Based on \ref{eqn:languagemodel:simplied}, when exchanging a word w between two classes c and c', only two summands of the second summation of \ref{eqn:languagemodel:simplied} are affected. That is :
So, given a tentative movement c', we have the effect of the second summartion:
The first and third term can actually be computed for all the tentative classes.
Now consider the first part, we actually need to iterate over all the bigrams (v,w) that ends at w. When w is removed from class c, we need to deduct old entries:
and add new entries
Similarly, the first terms in \ref{eqn:updatefunction:part2} and \ref{eqn:updatefunction:part3} only need to be calculated once for every word.
The algorithm is simple:
while not interuppted do for word in vocabulary do for class c do Get new likelihood of exchanging w from c to c'. exchange w from c to c'' which has smallest likelihood.
Remove delta is :
Move delta is :
To update an exchange for word w, we need the following entries:
The first entry N(c ) is dynamic in every iteration, however if distributed, it can be updated only when the batch begins. We leave it that way, every worker will initialize the entries when it loads, but no communication afterwards.
The second entry is trivial, fixed.
The forth entry, while large, is also trivial and fixed.
The third one is somehow tricky. It is similar to first entry but much larger. It needs to be loaded for every worker and requires a lot of memory. This can be solved later, now as our data is relatively small, we just leave it that way.
| Mapper Key | File Position | Reducer Key | Word | Final Key | Word |
|---|---|---|---|---|---|
| Mapper Value | Sentence | Reducer Value | Word count | Final Value | word count |
Generate inital class and class count. N-1 most frequent words have their own classes, all other words have one class. Output should be w:c,N(w).
To save memory , consider replace word with IDs, so we can use trove.
Mapper 1.
Reducer 1.
Mapper 2.
Reducer 2.
Mapper
Reducer
Rerun T3
Therefore, from T3 to T4 is the iteration, we are done.
We also need a side MR, calculate likelihood, which needs two MR jobs
Mapper 1
Reducer 1
Sum all input, that's it
Mapper 2
Reducer 2
The necessary procedures are listed as below:
Part 1. Initialization
public static Submittable Initialization.buildLexicon(String inputCorpus, String workingRoot, String outputId, String outputMap, String outputCorpus, int nclass, String queues) throws IOException
Part 2. Prepare bigram count
public static Submittable BigramCount.buildBigram(String digitCorpus, String temp, String output, String clsMap, int nclass, int reducerCount, String queues) throws IOException
Part 3. Update class
public static Submittable UpdateClass.updateClass(String bigram, String classBigram, String output, int nclass, int reducerCount, String queues) throws IOException
Part 4. Update likelihood
public static Submittable Likelihood.getLikelihood(String bigram, String mapClass, String output, String workingDir, String queues) throws IOException
Here we list all the input/output file/dirs:
| File/Dir | Meaning | Generated By | Used By | Name pattern |
|---|---|---|---|---|
| buildLexicon | ||||
inputCorpus | Input Raw Corpus | External | buildLexicon | Specified externally (local:src, HDFS generalRoot/src-raw-corpus), so as tgt |
workingRoot | workingDir | Temporary directory | N/A | buildLexicon, getLikelihood | Temporary dir, generalRoot/mkcls/[src|tgt]/temp |
outputId | ID-to-word map | buildLexicon | postProcess | generalRoot/mkcls/[src|tgt].idmap |
outputMap | Initial word-class map | buildLexicon | buildBigram(clsMap), getLikelihood(mapClass) | generalRoot/mkcls/[src|tgt]/it0/classmap |
outputCorpus | Digitalized corpus | buildLexicon | buildBigram(digitCorpus) | generalRoot/mkcls/[src|tgt]/dcorpus |
| buildBigram | ||||
digitCorpus | Digitalized corpus | buildLexicon(outputCorpus) | N/A | generalRoot/mkcls/[src|tgt]/dcorpus |
temp | (v,w,c(w)):N(v,w), Bigram counts with class label | buildBigram | updateClass(bigram) | generalRoot/mkcls/[src|tgt]/itn/bigram |
output | (v,c):N(v,c), Word-to-Class bigram | buildBigram | updateClass(classBigram) | generalRoot/mkcls/[src|tgt]/itn/classbigram |
clsMap | Word-class map | buildLexicon(outputMap), updateClass(output) | buildBigram(clsMap), getLikelihood(mapClass) | generalRoot/mkcls/[src|tgt]/itn/classmap |
| updateClass | ||||
bigram | (v,w,c(w)):N(v,w), Bigram counts with class label | buildBigram(temp) | updateClass | generalRoot/mkcls/[src|tgt]/itn/bigram |
classBigram | (v,c):N(v,c), Word-to-Class bigram | buildBigram(output) | updateClass | generalRoot/mkcls/[src|tgt]/itn/classbigram |
output | Word-class map | updateClass(output) | buildBigram(clsMap), getLikelihood(mapClass) | generalRoot/mkcls/[src|tgt]/itn/classmap |
| getLikelihood | ||||
bigram | (v,w,c(w)):N(v,w), Bigram counts with class label | buildBigram(temp) | updateClass | generalRoot/mkcls/[src|tgt]/itn/bigram |
mapClass | Word-class map | updateClass(output) | buildBigram(clsMap), getLikelihood(mapClass) | generalRoot/mkcls/[src|tgt]/itn/classmap |
workingRoot | workingDir | Temporary directory | N/A | buildLexicon, getLikelihood | Temporary dir, generalRoot/mkcls/[src|tgt]/temp |
| PostProcessing (Initialization.mapCLassBackToWord | ||||
mapClass | Word-class map | updateClass(output) | buildBigram(clsMap), getLikelihood(mapClass) | generalRoot/mkcls/[src|tgt]/itn/classmap |
idMap | ID-to-word map | buildLexicon(outputId) | postProcess | generalRoot/mkcls/[src|tgt].idmap |
output | Final class output | postProcess | GIZA Alignment | External |
The convergence is not guaranteed, because each child optimizes on its on, and gradually the conflicts will increase. We are thinking: whether the changes can be propagate to other nodes, therefore even the children do miss some of the entries, they can do better because the bias are smaller on each step.
My expectation is to design a so called Hadoop Message queue, s.t. each child can post their changes to the queue, and children that start late can load all the changes, also the child can get the change as they run. All the changes are asynchronous, so the children will still miss some of the changes, but they will get more, so the bias can be smaller.
First we need to design HadoopMessageQueue.