Table of Contents

MGIZA

MGIZA++ is a multi-threaded word alignment tool based on GIZA++. It extends GIZA++ in multiple ways:

Multi-threading

MGIZA++ can make use of multi-core platforms efficiently. Usually a quad-core machine can have a three-fold speedup over single-thread GIZA++.

Memory optimization

By eliminating duplicated tables, MGIZA++ can save a lot of memory comparing to GIZA++.

Resume training

MGIZA++ can resume training from any stage and continue training. For example you may be able to re-use previous available models and continue training directly from IBM Model 4 instead of all the way from Model 1.

Integrated with Chaski

MGIZA++ can be integrated into Chaski and run on cluters, which will give you even larger speedup.

Download

Latest version of MGIZA++ can be download here:

Version Data Link Comment Release Note
Version 0.6.3.1 2010-01-23 Download Minor code clean and move download to Sf Release Note
Version 0.6.3 2010-01-11 Download Memory optimization and bug fix Release Note
Version 0.6.2 2009-12-07 Download Minor interface change to keep compatibility with Chaski 0.2.2 Release Note
Version 0.6.1 2009-11-17 Download Unnecessary dependencies removed Release Note
Version 0.6 2009-11-10 Download

Installation

To compile MGIZA++ you need the following package installed:

  1. Berkeley DB (libdb)
  2. Berkeley DB++ (libdb++)
  3. Boost library. (regex, string)

After the dependencies are installed. As of version 0.6.1, you do not need the dependencies of berkeley db, but you still need boost library.. Just go to the source directory of the source and

  ./configure --prefix=${QMT_HOME}
  make
  make install

If you want to use MGIZA++ with Chaski, you need to add the environment variable QMT_HOME to your .bashrc.

For boost library you can either download it from http://www.boost.org or install the header package of your linux distribution.

Usage

The basic usage of MGIZA++ is easy, given that you know how to run GIZA++. MGIZA++ is compatible with GIZA++'s parameters, and you can run:

  ${QMT_HOME}/bin/mgiza  -ncpu 5 [ALL-YOUR-GIZA-PARAMETERS]

to tell mgiza to run five-threads.

The alignment output of MGIZA++ is somehow different from GIZA++, given n-threads, the alignment output will be:

prefix.A3.final.part0
prefix.A3.final.part1
...
prefix.A3.final.part(n-1)

To combine the alignments you need to run:

${QMT_HOME}/scripts/merge_alignment.py ${prefix}.A3.final.part* > ${prefix}.A3.final

For advanced usage please refer to the following “HOWTOs”

mgiza/overview.txt · Last modified: 2010/01/23 06:15 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0