Shashwat Dubey: September 2013

Pre-requite –

Install following things and update environment variable

Java

Download java and install in your linuz box
Update environment variable with JAVA_HOME and PATH

Scala

Download scala using command: wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
Extract scala using command: tar –xzf scala-2.10.1.tgz
Update environment variable for SCALA_HOME and PATH

Maven

Download Maven using command: wget http://apache.tradebit.com/pub/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
Extract scala using command: tar –xzf apache-maven-3.2.3-bin.tar.gz
Update environment variable for MAVEN_HOME and PATH

Spark uses java 6+ and maven 3.0.4 to build. It need to configure MAVEN_OPTS to use more memory than usual. Otherwise it will fail with an error –

In CentOS – run a command “nano ~/.bash_profile”

In Ubuntu – run a command “pico ~/etc/environment”

And add a new line –

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

If you want to read data from HDFS then you have to build spark against the specific version of HDFS in your environment. Which could be done through the “Hadoop.version” property.

And I you don’t specify then by default it will build against Hadoop 1.0.4

Building with specific version depending on your need –

S. no	Distribution	Command
1.	Apache Hadoop 1.2.1	mvn -Dhadoop.version=1.2.1 -DskipTests clean package
2.	Cloudera CDH 4.2.0 with MapReduce v1	mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
3.	Apache Hadoop 0.23.x	mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
4.	Apache Hadoop 2.0.5-alpha (with yarn)	mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
5.	Cloudera CDH 4.2.0 (with yarn )	mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
6.	Apache Hadoop 0.23.x (with yarn)	mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
7.	Apache Hadoop 2.2.X (with yarn)	mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
8.	Apache Hadoop 2.4.X	mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
9.	Different versions of HDFS and YARN	mvn -Pyarn-alpha -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=0.23.7 -DskipTests clean package

Building with hive and JDBC support

It is need to enable integration for spark SQL along with its JDBC server and CLI … to do so, add –Phive profile in your existing build option

Spark Tests in Maven

ScalaTest Maven plugin is used, it run test by default. Some test require spark to be packaged first.

So always first package them using command – mvn -DskipTests clean package

And then run test using command – mvn -test

Shashwat Dubey

Friday, 13 September 2013

Building Spark with Maven