We can add a datanode in the existing cluster and we can make use of it for data storage and data processing without restarting the cluster.
Here is how you can do this.
- I assume that we have already added a user with password less SSH authentication.
- Install hadoop on this machine. Same version of hadoop as we are running on other datanodes/name node.
- To make it participate in the cluster it should know the address of name node, jobtacker. So for this, we can copy the settings from other datanode or namenode.
- Manually you can copy the settings from core-site.xml, hdfs-site.xml , mapred-site.xml , masters, slaves files. And don’t forget to export JAVA_HOME in the hadoop-env.sh OR better :
- You can use rsync utility to copy the conf directory.
rsync -a namenodeLocation:$HADOOP_HOME/conf $HADOOP_HOME
Where $HADOOP_HOME is the location where you have installed Hadoop on your nodes.
The above command will copy the conf directory. To start it issue the following command
Now our new datanode is ready to store data
To make it participate in the mapreduce job we need to start jobtacker process also.
you need to issue following command
$HADOOP_HOME/bin/hadoop-deamon.sh start tasktracker
Rebalancing the cluster
When we add new datanodes to Hadoop cluster, HDFS will not rebalance automatically. So we need to use the utility/tool provided by Hadoop to blance the cluster.This tool will balance the data blocks across cluster.
It is very helpful if we are running of space in the other existing nodes. Optionally we can specify the threshold percentage with the command. By default its 10%.
Issue the following command to re-balance the cluster.