Configuration of Chaski

Chaski as a Jar

Chaski provides several useful scripts to make training easy, these will be described later. Leaving the scripts aside, Chaski is packed into a single jar file. It can only be executed with Hadoop version 0.20.1+.

  1. log4j.properties the configure file for java logging system
  2. pkglist.txt list available actions of Chaski and the entry points.

Chaski will search for those files first in Chaski_HOME environment variable, and then in current directory.

Actions

The execution of chaski is divided into several actions, running:

 hadoop jar chaski-0.0-latest.jar

will display a list of actions:

[extract]        Extract phrase and count occurance of each phrase
[buildlex]       Build lexicon table from alignment file
[mkcls]          Make word classes
[postproc]       Post process the phrase table
[preproc]        Merge moses alignment format and corpus into a single file
[score]          Score phrases using lexicon probability and extracted phrases
[walign]         Perform word alignment
[waprep]         Prepare corpus for word alignment

To run mkcls action, you just need to type:

hadoop jar chaski-0.0-latest.jar mkcls [parameters]

Supplying parameters

There are two ways of supplying parameters for Chaski. You can write a configuration file and specify the file name after -p switch, or specify the value directly in command line.

The configuration file is in the following format:

....
# Output directory of Moses Phrase table
moses-p=Linh-Best-Retrain-Full/moses-phrase
....

The lines started with # are comments, and the parameters are in key=value format.

When specify parameters by command line, put two dashes before the key. E.g.:

hadoop jar chaski-0.0-latest.jar mkcls --moses-p Linh-Best-Retrain-Full/moses-phrase

will have the same effect as the configure file.

General rule is the parameters in the commandline will override the parameters in configure file.

Get help

Type

  hadoop jar chaski-0.0-latest.jar <action> -h

will display all available parameters with help information. E.g. typing

  hadoop jar chaski-0.0-latest.jar mkcls -h

will show you the following help information.


===============================================================================
Package : mkcls
-------------------------------------------------------------------------------
Make word classes
===============================================================================
Parameter list:
-------------------------------------------------------------------------------
--src                  String(R)        The path of source courpus file
  --mkcls.source
  --source
--tgt                  String(R)        The path of target courpus file
  --mkcls.target
  --target
--root                 String(R)        The root directory for word alignment
  --mkcls.rootdir
--verbose              boolean          The detail information?
  --vbs
--overwrite            boolean          Whether the output director will be
  --mkcls.overwrite                      overwritten if exists, by default it
                                         is false, and program will die if it
                                         exists
--nocopy               boolean          If specified, we will re-use the input
  --mkcls.nocopy                         raw corpus
--num-reducer          int              Total number of available reducer slots
  --mkcls.nr
--num-iter             int              Total number of iterations
  --mkcls.ni
--num-classes          int              Total number of class
  --mkcls.nc
--queues,--qu          String           Specify the queue that the MR-Job will
  --mkcls.queue                          be submitted, if not specified, a
                                         warning will be displayed but the
                                         execution will continue.
--class-side           int              The direction of class, by default it
  --mkcls.side                           is 2 which means both s2t and t2s,
                                         specify 0 for s2t only and 1 for t2s
                                         only
--heap,--mkcls.hp      String           Specify how much memory should each
  --mkcls.heap                           children task (Map/Reduce) use,
                                         default is 1024m
--highcocu             boolean          Whether the driver will seek to
  --highcocurrency                       parallize as much as possible, for
                                         example, normalize different tables in
                                         parallel
-------------------------------------------------------------------------------

The first column are the keys, one key may have multiple alias, that is for different actions to share parameters in a same configure file while still giving you oppotunity to specify different values for different packages. The general rule is, the aliases has higher priority than the names, for example if both heap and mkcls.heap are presented in a config file, mkcls action will use the value in mkcls.heap instead of that after heap. In the mean time other actions may also have parameters named heap but does not have alias mkcls.heap and they will use the heap parameter.

Configuration for actions

TO BE WRITTEN

chaski/configure.txt · Last modified: 2009/11/28 11:41 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0