Chaski provides several useful scripts to make training easy, these will be described later. Leaving the scripts aside, Chaski is packed into a single jar file. It can only be executed with Hadoop version 0.20.1+.
log4j.properties the configure file for java logging systempkglist.txt list available actions of Chaski and the entry points.
Chaski will search for those files first in Chaski_HOME environment variable, and then in current directory.
The execution of chaski is divided into several actions, running:
hadoop jar chaski-0.0-latest.jar
will display a list of actions:
[extract] Extract phrase and count occurance of each phrase [buildlex] Build lexicon table from alignment file [mkcls] Make word classes [postproc] Post process the phrase table [preproc] Merge moses alignment format and corpus into a single file [score] Score phrases using lexicon probability and extracted phrases [walign] Perform word alignment [waprep] Prepare corpus for word alignment
To run mkcls action, you just need to type:
hadoop jar chaski-0.0-latest.jar mkcls [parameters]
There are two ways of supplying parameters for Chaski. You can write a configuration file and specify the file name after -p switch, or specify the value directly in command line.
The configuration file is in the following format:
.... # Output directory of Moses Phrase table moses-p=Linh-Best-Retrain-Full/moses-phrase ....
The lines started with # are comments, and the parameters are in key=value format.
When specify parameters by command line, put two dashes before the key. E.g.:
hadoop jar chaski-0.0-latest.jar mkcls --moses-p Linh-Best-Retrain-Full/moses-phrase
will have the same effect as the configure file.
General rule is the parameters in the commandline will override the parameters in configure file.
Type
hadoop jar chaski-0.0-latest.jar <action> -h
will display all available parameters with help information. E.g. typing
hadoop jar chaski-0.0-latest.jar mkcls -h
will show you the following help information.
===============================================================================
Package : mkcls
-------------------------------------------------------------------------------
Make word classes
===============================================================================
Parameter list:
-------------------------------------------------------------------------------
--src String(R) The path of source courpus file
--mkcls.source
--source
--tgt String(R) The path of target courpus file
--mkcls.target
--target
--root String(R) The root directory for word alignment
--mkcls.rootdir
--verbose boolean The detail information?
--vbs
--overwrite boolean Whether the output director will be
--mkcls.overwrite overwritten if exists, by default it
is false, and program will die if it
exists
--nocopy boolean If specified, we will re-use the input
--mkcls.nocopy raw corpus
--num-reducer int Total number of available reducer slots
--mkcls.nr
--num-iter int Total number of iterations
--mkcls.ni
--num-classes int Total number of class
--mkcls.nc
--queues,--qu String Specify the queue that the MR-Job will
--mkcls.queue be submitted, if not specified, a
warning will be displayed but the
execution will continue.
--class-side int The direction of class, by default it
--mkcls.side is 2 which means both s2t and t2s,
specify 0 for s2t only and 1 for t2s
only
--heap,--mkcls.hp String Specify how much memory should each
--mkcls.heap children task (Map/Reduce) use,
default is 1024m
--highcocu boolean Whether the driver will seek to
--highcocurrency parallize as much as possible, for
example, normalize different tables in
parallel
-------------------------------------------------------------------------------
The first column are the keys, one key may have multiple alias, that is for different actions to share parameters in a same configure file while still giving you oppotunity to specify different values for different packages. The general rule is, the aliases has higher priority than the names, for example if both heap and mkcls.heap are presented in a config file, mkcls action will use the value in mkcls.heap instead of that after heap. In the mean time other actions may also have parameters named heap but does not have alias mkcls.heap and they will use the heap parameter.
TO BE WRITTEN