Hadoop by default launches a new JVM for each for map or reduce job and run the map/reduce tasks parallely and in isolation. When we have long initialization process which takes significant time and our map/reduce method takes very less time hardly few seconds, then spawning a new JVM for each map/reduce is literally an overkill.
Thanks to mapreduce framework, we can enable an option to reuse the JVM. With JVM reuse the tasks sharing the JVM will run serially instead of parallelly. We have mapred.job.reuse.jvm.num.tasks property to set the maximum number of tasks for a single job , which will be executed in the single JVM. Default value is 1.
Note : This property specify tasks and not map or reduce task. So this applies to map or reduce tasks. Tasks from different jobs will always run in separate JVM. this is only for single Job’s task.
We can set the value to -1 , to indicate all the tasks for a job will run same JVM.
JVM and HotSpot:
When you run your code on JVM for long time, HotSpot finds the sections which can be dynamically converted into Java byte ode of these sections to native machine code. If your process is going to take a minute or a few , we cannot be benefited of this.
So, when we reusing the same JVM, Hotspot will build those hotspots (mission critical sections ) into native machine code and helps in performance optimization. This can be really useful for the long running tasks.
My personal experience on a 3 node cluster :
- With JVM Reuse : (set to 5) ~ 14 minutes
- Without JVM Reuse : ~ 20 minutes
You are welcome to share your experience.