Input path does not exist: file:……………………………./pigsample_1406502801_1378470046724

Hi guys,

Again one more issue which is very specific to cygwin + PIG.

You may see Input path does not exist <some path>/pigsampe_somenumber. on the cygwin while doing “ORDER BY” clause. It took some time for me to figure out it was due to ORDER BY clause.

Commonly you may see the stacktrace like this :

2013-09-06 17:50:52,110 [Thread-118] WARN org.apache.hadoop.mapred.LocalJobRunner – job_local_0008
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/E:/<directory from grunt started>/pigsample_1406502801_1378470046724

at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:157)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:677)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/E:/<directory from grunt started>/pigsample_1406502801_1378470046724

Solution :

You can use

A2 = foreach A1 {

A3 = ORDER A0 by fieldName;

GENERATE $0, $1…….

}

Hive and PIG/Grunt shell hangs on Cygwin

Again this is typical issue related to Cygwin.

Scenario:

I am running Hadoop on local mode on my windows 7 machine (32/64 Bit).

I’ve installed HIVE/PIG/Hadoop/Java6 all on the C: drive.

I am using Cygwin version: 2.819 (current latest).
I’ve mounted C: on the cygwin.
I am able to run hadoop commands from the cygwin terminal for example : fs -ls etc.
I am also able to start grunt and hive shells.

But the real problem is:

Any command I enter on grunt shell (example: fs -ls or records = LOAD..... ) I do not see any output, it kind of hangs. Similarly with the hive prompt if I give the command as show tables ; I do not see any output just cursor keeps on blinking! Any keyboard inputs and gives NOTHING. System appears to be doing NOTHING.

To me all the things look fine.  All the environment variables are set correctly. I am not sure what is going wrong here!

Wow !!! I spent hours to fix it!

This is the issue with cygwin created icon on the desktop or shortcut.

If you right click the icon -> properties you will see like this in target field :

<cygwin_home>\bin\mintty.exe -i /Cygwin-Terminal.ico –

Just point it to

<cygwin_home>\Cygwin.bat -i /Cygwin-Terminal.ico –

Alternatively you can also go to <cygwin_home> and start Cygwin.bat from command prompt.

Cheers!

NoSuchMethodError while using joda-time-2.2.jar in PIG

Hi Guys,

I spent many hours solving this before I found the solution.

In the UDF we are using some APIs of Joda-time. The issue is while running the job it fails. There are no compilation issues in eclipse.  In eclipse also you might face this issue because the pig-version.jar also has joda package. Just put the joda-time-version.jar first in the classpath (before pig.jar) and you’r issue will be fixed.

I was trying to run it on the cygwin on windows machine but same issue can be seen on the linux box also.

Common stacktrace you might see :

2013-08-11 13:01:06,911 [Thread-9] WARN org.apache.hadoop.mapred.LocalJobRunner – job_local_0001

java.lang.NoSuchMethodError: org.joda.time.DateTime.now(Lorg/joda/time/DateTimeZone;)Lorg/joda/time/DateTime;
at com.myproject.pig.udf.ExtractDataByDates.exec(ExtractDataByDates.java:178)
at com.myproject.pig.udf.ExtractDataByDates.exec(ExtractDataByDates.java:12)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:381)

I tried many options :

1) Registering this joda jar in the PIG script using REGISTER call. (Didn’t work)

2) using -Dpig.additional.jars=/path/to/joda-time/jar (Did not work)

3) Set this jar in $HADOOP_CLASSPATH (Did not work)

4) Set this jar in the $classpath (Did not work)

5) Set the jar in the $PIG_CLASSPATH (It works)

export PIG_CLASSPATH=$PIG_CLASSPATH:/path/to/joda-time-2.2.jar

Cannot locate pig.jar. do ‘ant jar’, and try again

Hi folks,

I was trying to set up PIG on my gateway machine which has Windows 7 installed on it.

This issue is very specific to Cygwin.

After breaking my head for a couple of hours I found the solution :

Solution is very simple.

Just rename your from  “pig-0.10.1-withouthadoop.jar” to “pig-withouthadoop.jar”.

Namenode doesn’t start after upgrading Hadoop version

I have copied all the files correctly, all my jars are in place. I’ve set all the important properties correctly (i.e. namenode address,etc). I have formatted the HDFS .Now I am trying to start the cluster.

If you are running the cluster which is 0.20 version or before, and if you upgrade to 1.0.4 or above, while you restart the cluster by saying start-all.sh. All the demons should have started on the respective machines. (masters and slaves). But my namenode is not starting……..

This is due to file system changes in the HDFS itself. If you check the log you will see :

File system image contains an old layout version -18. An upgrade to version -32 is required.

Solution :

Very simple :

No need to stop other demons (you can stop if you want but it is not required).

Use start-dfs.sh -upgrade   (Here upgrade is mandatory).

Using jps you can see that now namenode is also running!

Common problem while copying data from source to HDFS using Flume

FLUME JAVA.LANG.CLASSNOTFOUNDEXCEPTION: ORG.APACHE.HADOOP.IO.SEQUENCEFILE$COMPRESSIONTYPE

Scenario: I wanted to copy the logs from the scource to HDFS. HDFS demons are up and running on the cluster. I’ve pointed the sink to hdfs but when I am trying to start the agent it is not starting. On checking the log files I see the stacktrace like this :

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.SequenceFile$CompressionType
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)

It is very clear that is not able to find the expected class on the classpath, hence the solution :
Copy your hadoop-core-xyz.jar to $FLUME_HOME/lib directory.

Note : If you are running your hadoop cluster on 0.20 versions, after copying this file, FileNotFound Exception will be solved, but you will end up getting authentication errors. Try using 1.0.x stable versions.

Before beginning with map reduce : Part 1

Before starting with Map-Reduce framework, its very important to know few concepts, I’ve mentioned few here, more I will be adding soon. This can also serve for quick revision for those who already know. 🙂

  • Map reduce framework works by breaking the computation into two parts or phases. Map and reduce phase. For both map/reduce phase Key-Value pair is input and Key-Value pair is output. During the map phase, programmer defined map function runs on the input Key-Value pairs and generate intermediate Key-Value pairs, and intermediate key value pairs are consumed by user defined reduce function, which again generates Key-Value pairs in output.

      We can say : K1-V1 to Map ->  K2,V2 (map output) -> Reduce -> K3, V3 final output.

  • The intermediate output (o/p from map functions K2-V2 pairs) are processed by Hadoop MapReduce Framework before they are sent to the reducer. Before they are passed on to reducer framework sorts and groups them based on the key.
  • Map function is a good place to drop the non matching or bad records. (This can also be done by writing custom RecordReader)

 

What is a Map Reduce Job?

In Map reduce job is a set of instructions or computation or processing which client wants to perform on the input data. 

Map reduce job consists of three things:

  1. Map-Reduce Program (i.e Mapper, Reducer, Job classes)
  2. Input Data
  3. Configuration information.

Hadoop runs the given job by splitting into various “Tasks” which are executed in parallel on the cluster.

‘Nodes’ of Map-Reduce:

There are two types of daemons which are required to run any MR job. 

  1. Job Tracker (Master process)
  2. Task Tracker (Slave process)

Job Tracker manages the job execution on the cluster by scheduling the tasks to run on the task trackers. Task trackers are the actual work horses of the system on which computation will happen. 

Any task tracker can run two kind of task

  1. Map task
  2. Reduce task

All the task trackers report to job-tracker periodically by sending their status (progress) messages, so that JobTracker can keep track of the overall progress of the given job. 

If any task tracker goes down, then JobTracker reschedules the ‘failed’ task on other task tracker.

What is a split ?

Input data for a single map task is known as input split. Hadoop runs a map task for the given split and run the programmer defined Map function for each input record of the split.

By default HDFS block size is 64 MB. Typically input split size for the job is also kept as 64 MB.

Generally they are matched to guarantee that data required by the map function is processed locally. 

For example HDFS block size = 32 MB and input split size = 128MB , that means we need four blocks of data for a single map task, since hadoop doesn’t write all the blocks on the same node, means map task might have to copy 3 blocks from other nodes, which leads to network traffic. 

If split size = block size , say for example both are 64 MB, that means its guaranteed that for the map task data will be found locally. 

Hadoop tries to its maximum to run the map task locally (on the node where input data is stored) , this helps to reduce cluster bandwidth wastage. Reduce function cannot take advantage of data locality as they are fed from each mapper.

Map functions outputs (K2-V2 key value pairs) are written on the local hard disk of the machine and not on HDFS, as its intermediate output and its discarded once reduce function is over. If we write the intermediate output to the HDFS with replication then we are wasting lot of disk space + network bandwidth. 

Reduce function’s outputs are written on HDFS unlike map outputs.

Generally if we have ‘m’ maps and ‘r’ reducers then output of each map will be fed to each reducers. Hence m*r distinct copy operation will happen. 

What if my map function finished processing but reducer could not copy output ?

In this scenario, JobTracker will re-run the map task on some other task tracker. And map function will re-create the output.

Number of Map tasks = Number of input splits = Number of blocks ( if size of split = size of HDFS block)

Number of reducer = ? No, its not decided by splits, it be specified on command prompt or in configuration xml or through API in driver class. (more on this later)

What is a combiner?

A combiner can be thought of as ‘Map side reducer’ or ‘local reducer’ .

To optimize the network bandwidth, Hadoop provides facility to specify the combiner. Combiner runs on each map’s output and output of combiner becomes the input of the reducer.  Signature and the interface to be implemented for combiner is same as that of Reducer. Many times we can use Reducer class as is for combiner. 

Its important to know check if we can use the combiner for the MR job, this can drastically improve performance as on the reducer side there will be less work to be done for sorting and grouping.

Hadoop doesn’t provide any guarantee that our combiner will run or not. Or if it runs how many times it will run. So, we need to make sure that our job doesn’t rely on combiner.

 

Partitioner:

When we have more than one reducer, its important to distribute the load equally and more importantly, we need to make sure that all the values for any given key goes to a single reducer.

For example : 

We have 10 records, which have the same key : a, it is useless if we send some of the records in one reducer and some in other reducers, all 10 records have to be reduce-ed in a single function. 

By default MR framework uses default HashPartitioner. We can over-ride the partioner by writing our own partitioner. (Generally the case with secondary sorting and reduce side joins).

Map only jobs:

For some jobs, which can be executed in parallel and output can be generated only by Map functions, those kind of jobs are known as Map only job. We can make map only job by specifying number of reducers = 0 (on command prompt while running the job or through API during writing driver class). 

For the map only jobs, output of the map functions is written and replicated on HDFS.