Hadoop Daemon is a simple interface to help you run ANY program using hadoop. I.e. it makes Hadoop more like Condor or Maui/Torque, which may appear to be bad or even stupid… But sometimes you need it, because going through the MapReduce framework may just screw up your files, for example, if you need to preserve the line order of your files, you have to either sort the file after the first M-R task or add a line number to it, which, seems equally stupid to me.
Sometimes, you don't have other choices since Hadoop is the only way to run you job, hadoop daemon may come to help. Just flashback to the Condor age, you split the input, put it to NFS, AFS, or whatever shared file system, and write your scripts plus the Condor configure files, then submit it to Condor. Condor will find a machine for you and run your job.
That is what Hadoop Daemon does, it uses HDFS as shared file system, (to use it, you can either wrap the HDFS library into your program, or use the functionality similar to distributed cache, which prepare the data for you), and it runs as a mapper on some node, wait for your command. You submit the command to the control server, and it will take care of finding some free node to finish your job.
I admit it is dirty, and most of the functionality it can provide can also be done by streaming. But it saves time to rewrite the codes or scripts to make streaming happy, and also it saves the computation effort of your brain.
Download it from temporary SVN:
svn://128.2.53.45/home/qing/svnroot/trunk/HadoopDaemon
This SVN is just temporary, if it breaks, please let me know. Also, because it is under development, no release is made, so the version may goes wrong, if so, please let me know.
After downloading it, make sure you have ant installed, and then, run
ant
On the directory.
If nothing goes wrong, the jar file will be in dist/lib/hadoopdaemon-2.0-latest.jar.
You have to first allocate a cluster using HOD, or find where your cluster configure file reside.
You run:
hod allocate -d ${WORKDIR} -n $SIZE
After that, you need to modify the run-server.sh:
#!/usr/bin/env bash #Before running, put your HADOOPDAEMON directory here HADOOPDAEMON= #If you are running going to run one server only, the following port is OK #but if you going to run more on a single gateway, you can either change #the following parameters or just use them as commandline parameters CONTROLPORT=9999 SERVICEPORT=9998 #CONTROLPORT=$3 #SERVICEPORT=$4 hadoop --config $1 jar ${HADOOPDAEMON}/dist/lib/hadoopdaemon-2.0-latest.jar $2 `hostname` ${CONTROLPORT} ${SERVICEPORT}
Set the correct directory for HADOOPDAEMON and the ports, just make sure the ports are not used. The client care about the CONTROLPORT.
Also, modify the run-client.sh in the same way.
After that, you start the server by calling
./run-server.sh ${WORKDIR} ${NUMDAEMON} [CONTROLPORT] [SRERVICEPORT]
The NUMDAEMON can be from 1 to (SIZE-1)*2After it starts, you can call.
./run-client.sh localhost ${CONTROLPORT} command-file
to submit the command file. And we will explain the command file after.
The configure file consists of three parts, the environment preparation, the actual command sequence and the outputs.
Below is a sample command to run Stanford Chinese segmenter:
< /user/qing/Software/StanfordParser/stanford-chinese-segmenter-2008-05-21.tar.gz tmp/stanford-chinese-segmenter-2008-05-21.tar.gz < /user/qing/Chinese_Segementer_Hadoop_Daemon_Test/ch.1 tmp/ch.1 ! tar xzf tmp/stanford-chinese-segmenter-2008-05-21.tar.gz -C tmp ! tmp/stanford-chinese-segmenter-2008-05-21/segment.sh ctb tmp/ch.1 UTF-8 0 {#} {#} {#} tmp/ch.1.out > tmp/ch.1.out /user/qing/Chinese_Segementer_Hadoop_Daemon_Test/ch.1.out COMMIT
As you can see, except the final COMMIT line, other lines starts with a symbol at the beginning, which tells which of the three parts it belongs to.
The environment preparation commands starts with ”<”, the less than symbol, and takes two parameters, first of which is a file or directory on HDFS, and the second is the corresponding file on local disk.
For example,
< /user/qing/Software/StanfordParser/stanford-chinese-segmenter-2008-05-21.tar.gz \ tmp/stanford-chinese-segmenter-2008-05-21.tar.gz
will cause the child to copy
/user/qing/Software/StanfordParser/stanford-chinese-segmenter-2008-05-21.tar.gz
from HDFS and put it to
tmp/stanford-chinese-segmenter-2008-05-21.tar.gz
We STRONGLY SUGGEST you to always copy the files to tmp directory, or at least current directory, because it will be cleaned after you jobs were done, otherwise it may sit there forever and eat up your disk space.
The lines ”!” defines a command line, you can specify the commandline, the environment and also be able to redirect the STDIN/STDOUT. For example:
! tmp/stanford-chinese-segmenter-2008-05-21/segment.sh ctb tmp/ch.1 UTF-8 0 {#} {#} {#} tmp/ch.1.out
It calls the commandline
tmp/stanford-chinese-segmenter-2008-05-21/segment.sh ctb tmp/ch.1 UTF-8 0
and redirect standard output to
tmp/ch.1.out
You many notice the {#} symbol, which separate the environment, standard in redirection and the standard out redirection. The sequence is :
! COMMAND-LINE {#} ENVIRONMENT {#} STDIN {#} STDOUT
If you do not want to redirect anything, just put space on the field, but you cannot omit it.
After the command is executed, we may need to copy back the result to HDFS, which is specified by the ”>” symbol. For example
> tmp/ch.1.out /user/qing/Chinese_Segementer_Hadoop_Daemon_Test/ch.1.out
will copy
tmp/ch.1.out
to
/user/qing/Chinese_Segementer_Hadoop_Daemon_Test/ch.1.out
You have to put COMMIT at the last line of the file. Without that, the command cannot be sent. This was designed to allow other commands to be called, such as shutting down, cleaning temporary dirs, but for now that is the only command and have to be put at the end of the file.