Make word classes

\require{eqn-number}

1. Basic formuli

Basic formula of n-gram language model:

\label{eqn:languagemodel:basic} P(w_1^m) = \prod_{i=1}^{m}P(w_i|w_1^{i-1}) \approx \prod_{i=1}^{m}p(w_i|w_{i-n+1}^{i-1})

Two-side class-based language model:

\label{eqn:languagemodel:class:twoside} p(w_i|w_{i-n+1}^{i-1}) \approx p_0(w_i|c(w_i)) \cdot p_1(c(w_i)|c(w^{i-1}_{i-n+1}))

Where c(w_i) is the class of word i.

Alternatively we can use one-side class-base lm:

\label{eqn:languagemodel:class:oneside} p(w_i|w_{i-n+1}^{i-1}) \approx p_0(w_i|c(w_i)) \cdot p_1(c(w_i)|w^{i-1}_{i-n+1})

The difference is the second term, which depends on word instead of word class.

2. Exchange Clustering

The optimization criterion is to find the best clustering of the words C^{*}, s.t. the likelihood of the corpus is maximized. As displayed at \ref{eqn:exchangeclustering:update}, which is simplified. The derivation can be found at [2]..

\label{eqn:exchangeclustering:update} \begin{eqnarray} C^{\ast} & = & \arg \max_C \sum_{w\in V} N(w) \cdot \log N(w) + \\
~ & ~ & \sum_{c\in C, d\in C} N(c,d) \cdot \log N(c,d) - \\
~ & ~ & 2 \cdot \sum_{c\in C} N(c ) \cdot \log N(c ) \end{eqnarray}

Here, N(w) and N(c ) are the number of occurrences of a word w or a class c in the training corpus and N(c,d) denotes the number of occurrences of some word in class c followed by a word in class d in the training corpus. (Simply it is class bigram frequency).

2.1 Predictive Exchange Clustering

It is based on \ref{eqn:languagemodel:class:oneside}, and the maximum likelihood is given by:

\label{eqn:languagemodel:class:oneside:freq} \begin{eqnarray} p(w_i|w_{i-n+1}^{i-1}) &\approx & p_0(w_i|c(w_i)) \cdot p_1(c(w_i)|w^{i-1}_{i-n+1}) \\
~& = & \frac{N(w_i)}{N(c(w_i))} \cdot \frac{N(w_{i-1},c(w_i))}{N(w_{i-1})} \end{eqnarray}
And here N(v,c) means a word v followed by some word in class c. Therefore, if we let L(C) be log likelihood function of the predictive class bigram model given a clustering C, then:

\label{eqn:languagemodel:derivation} \begin{eqnarray} L(C) & = & \sum_{w\in V} N(w) \cdot \log p(w|c(w)) \\
~ & ~ & + \sum_{v\in V, c\in C} N(v,c) \cdot \log p(c|v) \\
~ & = & \sum_{w\in V} N(w) \cdot \log \frac{N(w)}{N(c(w))} \\
~ & ~ & + \sum_{v\in V, c\in C} N(v,c) \cdot \log \frac{N(v,c)}{N(v)} \\
~ & = & \sum_{w\in V}N(w) \cdot \log N(w) \\
~ & ~ & - \sum_{w\in V} N(w) \cdot \log N(c(w))\\
~ & ~ & + \sum_{v \in V, c \in C} N(v,c) \cdot \log N(v,c) \\
~ & ~ & - \sum_{v\in V,c \in C} N(v,c) \cdot \log N(v) \end{eqnarray}

Because in the last summaton of \ref{eqn:languagemodel:derivation} sums over all words and then it is the same as the first term. Finally \ref{eqn:languagemodel:derivation} becomes:

\label{eqn:languagemodel:simplied} \begin{eqnarray} L(C) & = & \sum_{v \in V, c \in suc(v)} N(v,c) \cdot \log N(v,c) \\
~ & ~ & - \sum_{w\in V} N(w) \cdot \log N(c(w))\\
\end{eqnarray}

2.2 Exchange Algorithm

Tentative movement and update function

Based on \ref{eqn:languagemodel:simplied}, when exchanging a word w between two classes c and c', only two summands of the second summation of \ref{eqn:languagemodel:simplied} are affected. That is :

  • For class c, N'( c) = N( c) - N(w)
  • For class c', N'(c') = N(c') + N(w)

So, given a tentative movement c', we have the effect of the second summartion:

\label{eqn:updatefunction:part1} \begin{eqnarray} L'(C') & = & L(C) - N( c) \cdot \log N( c) \\
~ & ~ & - N(c') \log N(c') \\
~ & ~ & + (N( c)-N(w)) \log (N( c)- N(w)) \\
~ & ~ & + (N(c')+N(w) \log (N(c')+ N(w')) \end{eqnarray}

The first and third term can actually be computed for all the tentative classes.

Now consider the first part, we actually need to iterate over all the bigrams (v,w) that ends at w. When w is removed from class c, we need to deduct old entries:

\label{eqn:updatefunction:part2} \sum_{(v,w) \in B} N(v,c)\cdot \log N(v,c) + \sum_{(v,w) \in B} N(v,c')\cdot \log N(v,c')

and add new entries

\label{eqn:updatefunction:part3} \begin{eqnarray} ~ & ~ & \sum_{(v,w) \in B} (N(v,c)- N(v,w))\cdot \log (N(v,c)-N(v,w)) \\
~& ~& + \sum_{(v,w) \in B} (N(v,c')+N(v,w))\cdot \log (N(v,c')+N(v,w)) \end{eqnarray}

Similarly, the first terms in \ref{eqn:updatefunction:part2} and \ref{eqn:updatefunction:part3} only need to be calculated once for every word.

Algorithm

The algorithm is simple:

while not interuppted do
   for word in vocabulary do
        for class c do
           Get new likelihood of exchanging w from c to c'.
   exchange w from c to c'' which has smallest likelihood.

Remove and Move

Remove delta is :

\label{eqn:updatefunction:remove} \begin{eqnarray} \Delta_R & = & + N( c) \cdot \log N( c) \\
~ & ~ & - (N( c)-N(w)) \log (N( c)- N(w)) \\
~ & ~ & - \sum_{(v,w) \in B} N(v,c)\cdot \log N(v,c) \\
~ & ~ & + \sum_{(v,w) \in B} (N(v,c)- N(v,w))\cdot \log (N(v,c)-N(v,w)) \\
\end{eqnarray}

Move delta is :

\label{eqn:updatefunction:move} \begin{eqnarray} \Delta_M(c') & = & N(c') \log N(c') \\
~ & ~ & - (N(c')+N(w) \log (N(c')+ N(w')) \\
~ & ~ & - \sum_{(v,w) \in B} N(v,c')\cdot \log N(v,c') \\
~ & ~ & + \sum_{(v,w) \in B} \sum_{(v,w) \in B} (N(v,c')+N(v,w))\cdot \log (N(v,c')+N(v,w)) \\
\end{eqnarray}

3. Distributed Algorithm

3.1 Analysis of data needed

To update an exchange for word w, we need the following entries:

  1. N( c), \forall c
  2. N(w)
  3. N(v,c) \forall v,c
  4. N(v,w) \forall v

The first entry N(c ) is dynamic in every iteration, however if distributed, it can be updated only when the batch begins. We leave it that way, every worker will initialize the entries when it loads, but no communication afterwards.

The second entry is trivial, fixed.

The forth entry, while large, is also trivial and fixed.

The third one is somehow tricky. It is similar to first entry but much larger. It needs to be loaded for every worker and requires a lot of memory. This can be solved later, now as our data is relatively small, we just leave it that way.

3.2 Definition of MapReduce tasks

T1. Intialization: Word Count

Mapper Key File Position Reducer Key Word Final Key Word
Mapper Value Sentence Reducer Value Word count Final Value word count

T2. (Sequential) Initialization: Class/Class count

Generate inital class and class count. N-1 most frequent words have their own classes, all other words have one class. Output should be w:c,N(w).

To save memory , consider replace word with IDs, so we can use trove.

T3. Generate bigram counts

Mapper 1.

  • Input the sentence
  • Output (v,w):N(v,w)

Reducer 1.

  • Preload class difinition
  • Input (v,w):N(v,w)
  • Output (v,w,c(w)):N(v,w)

Mapper 2.

  • Input (v,w,c(w)):N(v,w)
  • Output (v,c):N(v,w)

Reducer 2.

  • Input (v,c):N(v,w))
  • Output (v,c), N(v,c)

T4. Update

Mapper

  • Input (v,w,c(w)):N(v,w)
  • Output w:[v,c(w),N(v,w)]

Reducer

  • Preload N(v,c), N( c), N( c) can be generated as loading N(v,c)
  • Input: w:[v,c(w),N(v,w)]
  • Output: w:c',N(w), where c' is the new class. Also update N(v,c), N( c) as it goes on

T5. Regenerate N(v,c)

Rerun T3

Therefore, from T3 to T4 is the iteration, we are done.

TA. Calculate real likelihood

We also need a side MR, calculate likelihood, which needs two MR jobs

Mapper 1

  • Input: N(v,c)
  • Output: N(v,c)\cdot N(v,c)

Reducer 1

Sum all input, that's it

Mapper 2

  • Input : w : c, N(w)
  • Output: c: N(w)

Reducer 2

  • Input : c : N(w), \forall w
  • Output : \sum N(w) \cdot \log N(c(w))

3. Input/Output Structure

3.1 Procedures

The necessary procedures are listed as below:

Part 1. Initialization

public static Submittable  
       Initialization.buildLexicon(String inputCorpus, 
                       String workingRoot, 
                       String outputId, 
                       String outputMap, 
                       String outputCorpus, 
                       int nclass, 
                       String queues) throws IOException

Part 2. Prepare bigram count

public static Submittable  
       BigramCount.buildBigram(String digitCorpus, 
                       String temp, 
                       String output, 
                       String clsMap,
                       int nclass, 
                       int reducerCount, 
                       String queues) throws IOException

Part 3. Update class

public static Submittable  
       UpdateClass.updateClass(String bigram, 
                       String classBigram, 
                       String output, 
                       int nclass, 
                       int reducerCount, 
                       String queues) throws IOException

Part 4. Update likelihood

public static Submittable  
       Likelihood.getLikelihood(String bigram, 
                       String mapClass, 
                       String output, 
                       String workingDir, 
                       String queues) throws IOException

3.2 Relation table

Here we list all the input/output file/dirs:

File/Dir Meaning Generated By Used By Name pattern
buildLexicon
inputCorpus Input Raw Corpus External buildLexicon Specified externally (local:src, HDFS generalRoot/src-raw-corpus), so as tgt
workingRoot | workingDir Temporary directory N/A buildLexicon, getLikelihood Temporary dir, generalRoot/mkcls/[src|tgt]/temp
outputId ID-to-word map buildLexicon postProcess generalRoot/mkcls/[src|tgt].idmap
outputMap Initial word-class map buildLexicon buildBigram(clsMap), getLikelihood(mapClass) generalRoot/mkcls/[src|tgt]/it0/classmap
outputCorpus Digitalized corpus buildLexicon buildBigram(digitCorpus) generalRoot/mkcls/[src|tgt]/dcorpus
buildBigram
digitCorpus Digitalized corpus buildLexicon(outputCorpus) N/A generalRoot/mkcls/[src|tgt]/dcorpus
temp (v,w,c(w)):N(v,w), Bigram counts with class label buildBigram updateClass(bigram) generalRoot/mkcls/[src|tgt]/itn/bigram
output (v,c):N(v,c), Word-to-Class bigram buildBigram updateClass(classBigram) generalRoot/mkcls/[src|tgt]/itn/classbigram
clsMap Word-class map buildLexicon(outputMap), updateClass(output) buildBigram(clsMap), getLikelihood(mapClass) generalRoot/mkcls/[src|tgt]/itn/classmap
updateClass
bigram (v,w,c(w)):N(v,w), Bigram counts with class label buildBigram(temp) updateClass generalRoot/mkcls/[src|tgt]/itn/bigram
classBigram (v,c):N(v,c), Word-to-Class bigram buildBigram(output) updateClass generalRoot/mkcls/[src|tgt]/itn/classbigram
output Word-class map updateClass(output) buildBigram(clsMap), getLikelihood(mapClass) generalRoot/mkcls/[src|tgt]/itn/classmap
getLikelihood
bigram (v,w,c(w)):N(v,w), Bigram counts with class label buildBigram(temp) updateClass generalRoot/mkcls/[src|tgt]/itn/bigram
mapClass Word-class map updateClass(output) buildBigram(clsMap), getLikelihood(mapClass) generalRoot/mkcls/[src|tgt]/itn/classmap
workingRoot | workingDir Temporary directory N/A buildLexicon, getLikelihood Temporary dir, generalRoot/mkcls/[src|tgt]/temp
PostProcessing (Initialization.mapCLassBackToWord
mapClass Word-class map updateClass(output) buildBigram(clsMap), getLikelihood(mapClass) generalRoot/mkcls/[src|tgt]/itn/classmap
idMap ID-to-word map buildLexicon(outputId) postProcess generalRoot/mkcls/[src|tgt].idmap
output Final class output postProcess GIZA Alignment External

Problem: Convergence

The convergence is not guaranteed, because each child optimizes on its on, and gradually the conflicts will increase. We are thinking: whether the changes can be propagate to other nodes, therefore even the children do miss some of the entries, they can do better because the bias are smaller on each step.

My expectation is to design a so called Hadoop Message queue, s.t. each child can post their changes to the queue, and children that start late can load all the changes, also the child can get the change as they run. All the changes are asynchronous, so the children will still miss some of the changes, but they will get more, so the bias can be smaller.

First we need to design HadoopMessageQueue.

Reference

[1] Jakob Uszkoreit and Thorsten Brants, 2008, Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation bib pdf

[2] Sven Martin, Jörg Liermann and Hermann Ney, 1998, Algorithms for bigram and trigram word clustering bib pdf

indepth/mkcls.txt · Last modified: 2010/01/24 14:14 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0