Shashwat Dubey: Establishing performance benchmarks for your Hadoop cluster

Hadoop cluster are there to manage your big data easily, but what about the quality that is needed for getting optimized results and that helps to save cost of managing and maintaining Hadoop cluster. For this purpose, I was searching for some kind of testing or benchmark if we can establish which helps to certify Hadoop cluster environment.

Production ready Hadoop distribution is coming with some set of benchmark, which helps to get maximum optimized Hadoop cluster environment.

TestDFSIO – This help to find the IO rate of Hadoop cluster with 3 set of operation –write –read –clean. You can configure number of file and megabytes per file to be written on the HDFS for calculating with performance of cluster while doing write operation. It automatically generates data as per user input to check performance of cluster. After write operation, –read operation will read the same generated data and calculate IO and throughout while reading the file from HDFS. User can perform multiple tests for getting a set of result parameters to evaluate performance variation with different cluster configuration parameters while keeping the hardware configuration static.

With TestDFSIO, user can set a benchmark for its Hadoop cluster and tune his cluster accordingly to get maximum performance out of it.

You need to run three simple commands to get results out for performance benchmarks-

First, you need to do write operation with which test jar coming with Hadoop distribution. This need two user input- number of file and file size which you want to generate to test write capabilities. Following is the example, in which we generate 100 file with each file size 1024 and it will gives you IO rate, throughput, IO rate std. deviation and test execution time of your Hadoop cluster based on input values.

$ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 100 -fileSize 1024

Now, you need to perform read operation to test IO rate while reading the same generated file. It also needs same user input. Following example shows the use of this –

$ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 100 -fileSize 1024

Finally, after write and read operation, you should cleanup your cluster with following command-

$ hadoop jar hadoop-*test*.jar TestDFSIO -cleanup
nnBench – Performance of Master in Hadoop is as important as its role in Hadoop cluster. nnBench give a detailed report of performance of NameNode (master of Hadoop cluster) . This benchmark is to test the capabilities of your namenode with some set of hardware to handle your Hadoop cluster with multiple datanodes. It generates multiple simultaneous requests to perform operations on files on HDFS. It gives you a facility to define number of map and reduce you want to use to check the performance of namenode.

You will fine number of option with coming nnbench using following command-

$ hadoop jar hadoop-*test*.jar nnbench –help

Hadoop distribution is coming with other benchmarks also, which I will cover in my coming blogs. In the next blog, I will share my experience about loadgen and mrbench.

Shashwat Dubey

Sunday, 21 April 2013

Establishing performance benchmarks for your Hadoop cluster

No comments:

Post a Comment