Friday, 13 September 2013

Building Spark with Maven

Pre-requite – 

Install following things and update environment variable 
  1. Java
    1. Download java and install in your linuz box 
    2. Update environment variable with JAVA_HOME and PATH
  2. Scala
    1. Download scala using command:  wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
    2. Extract scala using command: tar –xzf scala-2.10.1.tgz
    3. Update environment variable for SCALA_HOME and PATH
  3. Maven
    1. Download Maven using command:  wget http://apache.tradebit.com/pub/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
    2. Extract scala using command: tar –xzf apache-maven-3.2.3-bin.tar.gz
    3. Update environment variable for MAVEN_HOME and PATH
Spark uses java 6+ and maven 3.0.4 to build. It need to configure MAVEN_OPTS to use more memory than usual. Otherwise it will fail with an error – 



In CentOS – run a command “nano ~/.bash_profile” 
In Ubuntu – run a command “pico ~/etc/environment”

And add a new line – 

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

If you want to read data from HDFS then you have to build spark against the specific version of HDFS in your environment. Which could be done through the “Hadoop.version” property. 

And I you don’t specify then by default it will build against Hadoop 1.0.4 

Building with specific version depending on your need – 
 
S. no
Distribution
Command
1.
Apache Hadoop 1.2.1
mvn -Dhadoop.version=1.2.1 -DskipTests clean package
2.
Cloudera CDH 4.2.0 with MapReduce v1
mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
3.
Apache Hadoop 0.23.x
mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
4.
Apache Hadoop 2.0.5-alpha (with yarn)
mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
5.
Cloudera CDH 4.2.0 (with yarn )
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
6.
Apache Hadoop 0.23.x (with yarn)
mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
7.
Apache Hadoop 2.2.X (with yarn)
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
8.
Apache Hadoop 2.4.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
9.
Different versions of HDFS and YARN
mvn -Pyarn-alpha -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=0.23.7 -DskipTests clean package

Building with hive and JDBC support

It is need to enable integration for spark SQL along with its JDBC server and CLI … to do so, add –Phive profile in your existing build option


Spark Tests in Maven

ScalaTest Maven plugin is used, it run test by default. Some test require spark to be packaged first.
So always first package them using command – mvn  -DskipTests clean package

And then run test using command – mvn -test