Friday, 13 September 2013

Building Spark with Maven

Pre-requite – 

Install following things and update environment variable 
  1. Java
    1. Download java and install in your linuz box 
    2. Update environment variable with JAVA_HOME and PATH
  2. Scala
    1. Download scala using command:  wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
    2. Extract scala using command: tar –xzf scala-2.10.1.tgz
    3. Update environment variable for SCALA_HOME and PATH
  3. Maven
    1. Download Maven using command:  wget http://apache.tradebit.com/pub/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
    2. Extract scala using command: tar –xzf apache-maven-3.2.3-bin.tar.gz
    3. Update environment variable for MAVEN_HOME and PATH
Spark uses java 6+ and maven 3.0.4 to build. It need to configure MAVEN_OPTS to use more memory than usual. Otherwise it will fail with an error – 



In CentOS – run a command “nano ~/.bash_profile” 
In Ubuntu – run a command “pico ~/etc/environment”

And add a new line – 

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

If you want to read data from HDFS then you have to build spark against the specific version of HDFS in your environment. Which could be done through the “Hadoop.version” property. 

And I you don’t specify then by default it will build against Hadoop 1.0.4 

Building with specific version depending on your need – 
 
S. no
Distribution
Command
1.
Apache Hadoop 1.2.1
mvn -Dhadoop.version=1.2.1 -DskipTests clean package
2.
Cloudera CDH 4.2.0 with MapReduce v1
mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
3.
Apache Hadoop 0.23.x
mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
4.
Apache Hadoop 2.0.5-alpha (with yarn)
mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
5.
Cloudera CDH 4.2.0 (with yarn )
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
6.
Apache Hadoop 0.23.x (with yarn)
mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
7.
Apache Hadoop 2.2.X (with yarn)
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
8.
Apache Hadoop 2.4.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
9.
Different versions of HDFS and YARN
mvn -Pyarn-alpha -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=0.23.7 -DskipTests clean package

Building with hive and JDBC support

It is need to enable integration for spark SQL along with its JDBC server and CLI … to do so, add –Phive profile in your existing build option


Spark Tests in Maven

ScalaTest Maven plugin is used, it run test by default. Some test require spark to be packaged first.
So always first package them using command – mvn  -DskipTests clean package

And then run test using command – mvn -test

4 comments:

  1. Kafka ---

    1. Start Zookeeper server
    2. Then Start Kafka Server
    Process --

    How to Start
    I - Understand Zookepper configuration ---for ex -zk nodes and zk server
    2. Identify zookeper loction (ls -cd)
    3. Move to Zookepper (cd zookeeper) -- cd conf
    4. Open zoo.cfc (command -NANO )
    5. If nano is not available then isntall it…
    6. Anything to install command is : yum install nano (GO to Root) ot use sudo youm install nano (if ubantu -apt-get install nano )
    7. Nano zoo.cfc ---
    tickTime=2000
    dataDir=/var/zoo --[IMP] -Zookeeper Logging information
    clientPort=2181 --Port for Zookeeper
    8. nano zoo_sample.cfg ---- Sample config file -- This is the configuration file which has all the relevant informaton
    a. Information understanding for config file--- Tick: Polling time (Cluster (Many Computers) -- Server (Machines - EC2) / Nodes () ) --ZK Server (Met data or information's) -- ZK nodes (manager -data nodes)

    dataDir=/tmp/zookeeper
    # the port at which the clients will connect
    clientPort=2181
    ----------------------------------------------------------------------------------------------------------
    Kafka: https://kafka.apache.org/quickstart

    1. Broker - Nodes (M/C)
    2. Topic -Message (Specific t one stram)
    3. Produce -Messge -topic

    /kafka_2.10-0.8.2.1/bin (Executable files)

    ReplyDelete
  2. Commands
    Ls -ltr : Output will sorted with time stamp
    Cat - Clog.e (Print on Console) cat log.txt |tail
    cat log.txt | head (first-part)

    tail -200f log.e1 (Real time logging)

    ReplyDelete
  3. benchmark-commands.txt
    Producer

    Setup
    bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test-rep-one --partitions 6 --replication-factor 1
    bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3

    Single thread, no replication

    bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

    Single-thread, async 3x replication

    bin/kafktopics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3
    bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test6 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

    Single-thread, sync 3x replication

    bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=-1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=64000

    Three Producers, 3x async replication
    bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

    Throughput Versus Stored Data

    bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

    Effect of message size

    for i in 10 100 1000 10000 100000;
    do
    echo ""
    echo $i
    bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test $((1000*1024*1024/$i)) $i -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=128000
    done;

    Consumer
    Consumer throughput

    bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1

    3 Consumers

    On three servers, run:
    bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1

    End-to-end Latency

    bin/kafka-run-class.sh kafka.tools.TestEndToEndLatency esv4-hcl198.grid.linkedin.com:9092 esv4-hcl197.grid.linkedin.com:2181 test 5000

    Producer and consumer

    bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

    bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1

    ReplyDelete
  4. It is nice blog Thank you porovide important information and i am searching for same information to save my time Big Data Hadoop Online Course

    ReplyDelete