Category Archives: Uncategorized

Hadoop Services Starting and stopping Sequence (Mostly in HDP)

All installed services should be started in a specific order. The suggested order is:

1) Knox
2) ZooKeeper
3) HDFS
4) YARN
5) HBase
6) Hive Metastore
7) HiveServer2
8) WebHCat
9) Oozie
10) Storm
11) Kafka

Running services in the cluster should be stopped in a specific order. The suggested order is is:

1) Knox
2) Oozie
3) WebHCat
4) HiveServer2
5) Hive Metastore
6) HBase
7) YARN
8) HDFS
9) ZooKeeper
10) Storm
11) Kafka

Advertisements

Resetting Ambari Admin password

I was trying to setup the Ambari to authenticate against Active directory and meanwhile I tried logging in to Ambari portal. I tried with admin userid and it’s password but did not work. Thinking I might have forgotten them, tried resetting. To reset there is not direct way (which I know of ), you can follow the following path to set it to default id/password to : admin/admin.

  1. Stop Ambari server
  2. Log on to ambari server host (node on which Amabari server is installed)
  3. Run ‘psql -U ambari’
  4. Enter password **** (this password is stored in /etc/ambari-server/conf/password.dat, default is bigdata)
  5. In psql prompt ambari=>:
  6. update ambari.users set user_password=’538916f8943ec225d97a9a86a2c6ec0818c1cd400e09e03b660fdaaec4af29ddbb6f2b1033b81b00′ where user_name=’admin’
  7. Quit psql
  8. Restart Ambari Server ‘ambari-server restart’

PS: The password used in step 6 is the encrypted form of ‘admin’.

Fixing : Ambari Agent Disk Usage Alert Critical

Hi,

I recently faced this issue. Even though I’ve 3 TB of HDD attached with the machines, I used to frequently see the Ambari Agent Disk usage red alerts. It was a production cluster and was supposed to hand over customer and on the same day this error started appearing. Tried googling and searching on Ambari forums but it was merely waste of time. It was a bit scary situation and was under lot of pressure. :).

Sharing the solution here so that others get benefited and can fix it easily without taking pressure 😉

Capacity Used: [59.05%, 18.0 GB], Capacity Total: [30.5 GB], path=/usr/hdp
Capacity Used: [56.50%, 17.3 GB], Capacity Total: [30.5 GB], path=/usr/hdp
Capacity Used: [60.61%, 18.5 GB], Capacity Total: [30.5 GB], path=/usr/hdp

Along with this you may also see this kind of error:

1/1 local-dirs are bad: /hadoop/yarn/local; 1/1 log-dirs are bad: /hadoop/yarn/log

Generally it happens due to YARN Applications while executing the job, generates lot of temporary data and as show above Ambari is taking only /usr/hdp path by default.

Fixing it is relatively easy :

  1. Create the directories for Log and temporary directory for intermediate data and set the owner of these directories to yarn.

mkdir -p /mnt/datadrive01/hadoop/yarn/local
mkdir -p /mnt/datadrive01/hadoop/yarn/log

(Do for various mounts)

chown yarn:hadoop /mnt/datadrive01/hadoop/yarn/log
chown yarn:hadoop /mnt/datadrive01/hadoop/yarn/local

Change following two properties under Node Manager

yarn.nodemanager.local-dirs = /mnt/datadrive01/hadoop/yarn/local, /mnt/datadrive02/hadoop/yarn/local,/mnt/datadrive03/hadoop/yarn/local,/mnt/datadrive04/hadoop/yarn/local

yarn.nodemanager.log-dirs =/mnt/datadrive01/hadoop/yarn/log,/mnt/datadrive02/hadoop/yarn/log,/mnt/datadrive03/hadoop/yarn/log,/mnt/datadrive04/hadoop/yarn/log

Restart the affected components and you are done.

Good luck.

-A

Status_Code 403 while setting up Spark Component on the HDP 2.3 cluster

Hi,

Recently I was trying to setup Spark components on the Hortonworks 2.3 cluster. While installation I found the something like this : http://<serverName>:50070/webhdfs/v1/user/spark?op=MKDIRS&user.name=hdfs” returned status_code=403

Complete stack trace is like this from Ambari Server:

Traceback (most recent call last):
File “/var/lib/ambari-agent/cache/common-services/SPARK/1.2.0.2.2/package/scripts/job_history_server.py”, line 90, in <module>
JobHistoryServer().execute()
File “/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py”, line 218, in execute
method(env)
File “/var/lib/ambari-agent/cache/common-services/SPARK/1.2.0.2.2/package/scripts/job_history_server.py”, line 54, in start
self.configure(env)
File “/var/lib/ambari-agent/cache/common-services/SPARK/1.2.0.2.2/package/scripts/job_history_server.py”, line 48, in configure
setup_spark(env, ‘server’, action = ‘config’)
File “/var/lib/ambari-agent/cache/common-services/SPARK/1.2.0.2.2/package/scripts/setup_spark.py”, line 44, in setup_spark
mode=0775
File “/usr/lib/python2.6/site-packages/resource_management/core/base.py”, line 157, in __init__
self.env.run()
File “/usr/lib/python2.6/site-packages/resource_management/core/environment.py”, line 152, in run
self.run_action(resource, action)
File “/usr/lib/python2.6/site-packages/resource_management/core/environment.py”, line 118, in run_action
provider_action()
File “/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py”, line 390, in action_create_on_execute
self.action_delayed(“create”)
File “/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py”, line 387, in action_delayed
self.get_hdfs_resource_executor().action_delayed(action_name, self)
File “/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py”, line 246, in action_delayed
self._create_resource()
File “/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py”, line 256, in _create_resource
self._create_directory(self.main_resource.resource.target)
File “/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py”, line 280, in _create_directory
self.util.run_command(target, ‘MKDIRS’, method=’PUT’)
File “/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py”, line 201, in run_command
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of ‘curl -sS -L -w ‘%{http_code}’ -X PUT ‘http://serverName:50070/webhdfs/v1/user/spark?op=MKDIRS&user.name=hdfs&#8221; returned status_code=403.

Well this issue is relatively simple.

Check your name node status most of the time this issue comes due to Name node in the safe mode.

Either you can wait for NN to leave the safe mode or force the NN to come out of safe mode.  Generally when you have configured the NN in High Availability mode, it take time to come out of Safe mode on the cluster restart because of existing number of blocks and the block report needs to be sent to both the active and stand by name nodes.

Once the name node is out of safe mode, try restarting the install process for Spark component it should be smooth.

Cheers.

Configuring single node Storm Cluster.

The following steps are for Ubuntu 12.04 LT or 14.05 LT. Storm has lot of moving parts as of now and the easiest configuration happens on Ubuntu. I tried configuring on CentOS and found quite challenging. Before trying storm configuration on CentOS I suggest you first try on Ubuntu. ——————————- PRE Requisites—————————-

  1. Make sure your Ubuntu is updated. You can update it using $ sudo apt-get update
  2. Install your favorite JDK. For example : sudo apt-get install openjdk-6-jdk

Image ————————– Other required tools —————————–

$ sudo apt-get install git -y

$ sudo apt-get install libtool -y

$ sudo apt-get install automake -y

$ sudo apt-get install uuid-dev

$ sudo apt-get install g++ -y

$ sudo apt-get install gcc-multilib -y

————————- Zookeeper————————-

ZooKeeper provides a service for maintaining centralized information in a distributed environment using a small set of primitives and group services. Storm uses ZooKeeper primarily to coordinate state information such as task assignments, worker status, and topology metrics between nimbus and supervisors in a cluster. 1) Get the Zookeeper Download the zookeeper setup ( latest at the time of writing is : 3.4.6 ). You can download from browser or with wget. wget http://www.eng.lsu.edu/mirrors/apache/zookeeper/stable/zookeeper-3.4.6.tar.gz Image 2)      Extract the tarball. Image 3) Rename the Zookeeper extracted directory : $ mv zookeeper-3.4.6 zookeeper Image   4) Optionally : a)  Add ZOOKEEPER_HOME under .bashrc b)  Add ZOOKEEPER_HOME/bin to the PATH variable Image   5)  Create a data directory on your favorite place. $ mkdir zookeeper-data/ Image 6) Create a configuration file under ZOOKEEPER_HOME/conf/ directory say zoo.cfg Image 7) Add ticktime, dataDir and clientPort properties in the zoo.cfg file. Image 8) Verify that you are able to start the zookeeper server  : after starting using : $ zkServer.sh start. Image   ———————- Zero MQ ———————————- Storm internally uses ZerMQ, in the current version it is to be installed explicitly, in the future releases they are planning to include this dependency as a part of the storm distribution. 1) Get the ZeroMQ $ wget http://download.zeromq.org/zeromq-2.1.7.tar.gz Image 2) Untar the tarball. $ tar –xvf zeromq-2.1.7.tar.gz Image ————- Configuring ZeroMQ——————- 1) $ cd zeromq-2.1.7 2) $ ./configure   Image   Image 3) $ make ZeroMQ5 4) $ sudo make install ZeroMQ6     ————————– Java Bindings for ZeroMQ  —————————– 1) Get the java binding for ZeroMQ $ git clone https://github.com/nathanmarz/jzmq.git This will create a folder with name jzmq jzmq1  Configurating jzmq: 1) $ cd jzmq 2)  $ sed -i ‘s/classdist_noinst.stamp/classnoinst.stamp/g’ src/Makefile.am 3) $ ./autogen.sh jzmq2 4) $ ./configure jzmq3 5) $ make jzmq4 6) $ sudo make install jzmq5       ————————————– Configuring Storm ———————————— 1) Download storm binaries $ wget http://mirror.tcpdiag.net/apache/incubator/storm/apache-storm-0.9.1-incubating/apache-storm-0.9.1-incubating.tar.gz 2) Untar the tarball $ tar -xvf apache-storm-0.9.1-incubating.tar.gz Storm1 2) Rename the extracted directory to Storm $ mv apache-storm-0.9.1-incubating storm   Storm2 3) Optionally : Add STORM_HOME in .bashrc file. Add STORM_HOME/bin to the PATH. Storm3   4) Add a data directory for storm to store the temporary data and topology jars. I am creating under $STORM_HOME $ cd $STORM_HOME $ mkdir data 5) Edit $STORM_HOME/conf/storm.yaml file. Storm4 6) Edit the values of various parameters like following :  This is very important. And this is the major part of Storm configuration.  Storm5 ————- Start the demeans and verify that installation is successful.———————   Note : If you have not added Storm_Home/bin to the path then you will require to go to STORM_HOME/bin directory and issue the commands on terminal…

 

1) Start Nimbus :
$ storm nimbus

x1

 

2) Start Supervisor ( Do not close previous terminal Open another terminal window and type following )

x2

3) Start UI ( Open a new terminal, change the directory to storm and start UI . Don’t close the previous terminal)

x3

4) Check the UI ( Hit the URL with IP Address of the UI Port defined in storm.yaml file)

x4

 

Troubleshooting Notes : 

Most commonly faced issues are two :

1) Exception in thread “main” java.lang.RuntimeException: org.apache.thrift7.transport.TransportException: 

java.net.ConnectException: Connection refused 

2) Nimbus starts , but stops after few seconds.

The problem could be one of the following

  1. 1. Nimbus is not started correctly. Or Nimbus stopped some time after starting. Check the nimbus logs for errors.
  2. Nimbus or zookeeper is not correctly configured. Please check the storm.yaml file.
  3. dataDir defined in the storm.yaml does not exist or has permission issues.
  4. storm.local.dir defined in the storm.yaml, does not exist or it has permission issues.
  5. There could be connectivity issues between the machines. Please check your network settings.

Happy STORMing!

Recently added data node is not cumming up

  • —namespaceID of master (name node ) and all slave ( datanodes )should match.
  • —If any of the machines has mismatch , while starting the cluster you will get an error : Incompatible namespaceID
  • —Due to mismatch data node process will not be started.
  • —Whenever data node /Name node does not start, please check for this issue. (generally when you add additional node to the cluster).
  • Every time NameNode is formatted, new namespace id is allocated to each of the machines. So formatting is not an option !

Solution :

Open the version file under namenode/<hadoop -temp dir>/dfs/name/current/VERSION check for property : namespaceID, copy this id and check if it is the same as datanode/<hadoop-temp>/dfs/data/current/VERSION. If not replace the one in data node’s version file with one of Namenode.

Before beginning with map reduce : Part 1

Before starting with Map-Reduce framework, its very important to know few concepts, I’ve mentioned few here, more I will be adding soon. This can also serve for quick revision for those who already know. 🙂

  • Map reduce framework works by breaking the computation into two parts or phases. Map and reduce phase. For both map/reduce phase Key-Value pair is input and Key-Value pair is output. During the map phase, programmer defined map function runs on the input Key-Value pairs and generate intermediate Key-Value pairs, and intermediate key value pairs are consumed by user defined reduce function, which again generates Key-Value pairs in output.

      We can say : K1-V1 to Map ->  K2,V2 (map output) -> Reduce -> K3, V3 final output.

  • The intermediate output (o/p from map functions K2-V2 pairs) are processed by Hadoop MapReduce Framework before they are sent to the reducer. Before they are passed on to reducer framework sorts and groups them based on the key.
  • Map function is a good place to drop the non matching or bad records. (This can also be done by writing custom RecordReader)

 

What is a Map Reduce Job?

In Map reduce job is a set of instructions or computation or processing which client wants to perform on the input data. 

Map reduce job consists of three things:

  1. Map-Reduce Program (i.e Mapper, Reducer, Job classes)
  2. Input Data
  3. Configuration information.

Hadoop runs the given job by splitting into various “Tasks” which are executed in parallel on the cluster.

‘Nodes’ of Map-Reduce:

There are two types of daemons which are required to run any MR job. 

  1. Job Tracker (Master process)
  2. Task Tracker (Slave process)

Job Tracker manages the job execution on the cluster by scheduling the tasks to run on the task trackers. Task trackers are the actual work horses of the system on which computation will happen. 

Any task tracker can run two kind of task

  1. Map task
  2. Reduce task

All the task trackers report to job-tracker periodically by sending their status (progress) messages, so that JobTracker can keep track of the overall progress of the given job. 

If any task tracker goes down, then JobTracker reschedules the ‘failed’ task on other task tracker.

What is a split ?

Input data for a single map task is known as input split. Hadoop runs a map task for the given split and run the programmer defined Map function for each input record of the split.

By default HDFS block size is 64 MB. Typically input split size for the job is also kept as 64 MB.

Generally they are matched to guarantee that data required by the map function is processed locally. 

For example HDFS block size = 32 MB and input split size = 128MB , that means we need four blocks of data for a single map task, since hadoop doesn’t write all the blocks on the same node, means map task might have to copy 3 blocks from other nodes, which leads to network traffic. 

If split size = block size , say for example both are 64 MB, that means its guaranteed that for the map task data will be found locally. 

Hadoop tries to its maximum to run the map task locally (on the node where input data is stored) , this helps to reduce cluster bandwidth wastage. Reduce function cannot take advantage of data locality as they are fed from each mapper.

Map functions outputs (K2-V2 key value pairs) are written on the local hard disk of the machine and not on HDFS, as its intermediate output and its discarded once reduce function is over. If we write the intermediate output to the HDFS with replication then we are wasting lot of disk space + network bandwidth. 

Reduce function’s outputs are written on HDFS unlike map outputs.

Generally if we have ‘m’ maps and ‘r’ reducers then output of each map will be fed to each reducers. Hence m*r distinct copy operation will happen. 

What if my map function finished processing but reducer could not copy output ?

In this scenario, JobTracker will re-run the map task on some other task tracker. And map function will re-create the output.

Number of Map tasks = Number of input splits = Number of blocks ( if size of split = size of HDFS block)

Number of reducer = ? No, its not decided by splits, it be specified on command prompt or in configuration xml or through API in driver class. (more on this later)

What is a combiner?

A combiner can be thought of as ‘Map side reducer’ or ‘local reducer’ .

To optimize the network bandwidth, Hadoop provides facility to specify the combiner. Combiner runs on each map’s output and output of combiner becomes the input of the reducer.  Signature and the interface to be implemented for combiner is same as that of Reducer. Many times we can use Reducer class as is for combiner. 

Its important to know check if we can use the combiner for the MR job, this can drastically improve performance as on the reducer side there will be less work to be done for sorting and grouping.

Hadoop doesn’t provide any guarantee that our combiner will run or not. Or if it runs how many times it will run. So, we need to make sure that our job doesn’t rely on combiner.

 

Partitioner:

When we have more than one reducer, its important to distribute the load equally and more importantly, we need to make sure that all the values for any given key goes to a single reducer.

For example : 

We have 10 records, which have the same key : a, it is useless if we send some of the records in one reducer and some in other reducers, all 10 records have to be reduce-ed in a single function. 

By default MR framework uses default HashPartitioner. We can over-ride the partioner by writing our own partitioner. (Generally the case with secondary sorting and reduce side joins).

Map only jobs:

For some jobs, which can be executed in parallel and output can be generated only by Map functions, those kind of jobs are known as Map only job. We can make map only job by specifying number of reducers = 0 (on command prompt while running the job or through API during writing driver class). 

For the map only jobs, output of the map functions is written and replicated on HDFS.