Shashwat Dubey: Building Spark with Maven

Friday, 13 September 2013

Building Spark with Maven

Pre-requite –

Install following things and update environment variable

Java

Download java and install in your linuz box
Update environment variable with JAVA_HOME and PATH

Scala

Download scala using command: wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
Extract scala using command: tar –xzf scala-2.10.1.tgz
Update environment variable for SCALA_HOME and PATH

Maven

Download Maven using command: wget http://apache.tradebit.com/pub/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
Extract scala using command: tar –xzf apache-maven-3.2.3-bin.tar.gz
Update environment variable for MAVEN_HOME and PATH

Spark uses java 6+ and maven 3.0.4 to build. It need to configure MAVEN_OPTS to use more memory than usual. Otherwise it will fail with an error –

In CentOS – run a command “nano ~/.bash_profile”

In Ubuntu – run a command “pico ~/etc/environment”

And add a new line –

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

If you want to read data from HDFS then you have to build spark against the specific version of HDFS in your environment. Which could be done through the “Hadoop.version” property.

And I you don’t specify then by default it will build against Hadoop 1.0.4

Building with specific version depending on your need –

S. no	Distribution	Command
1.	Apache Hadoop 1.2.1	mvn -Dhadoop.version=1.2.1 -DskipTests clean package
2.	Cloudera CDH 4.2.0 with MapReduce v1	mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
3.	Apache Hadoop 0.23.x	mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
4.	Apache Hadoop 2.0.5-alpha (with yarn)	mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
5.	Cloudera CDH 4.2.0 (with yarn )	mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
6.	Apache Hadoop 0.23.x (with yarn)	mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
7.	Apache Hadoop 2.2.X (with yarn)	mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
8.	Apache Hadoop 2.4.X	mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
9.	Different versions of HDFS and YARN	mvn -Pyarn-alpha -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=0.23.7 -DskipTests clean package

Building with hive and JDBC support

It is need to enable integration for spark SQL along with its JDBC server and CLI … to do so, add –Phive profile in your existing build option

Spark Tests in Maven

ScalaTest Maven plugin is used, it run test by default. Some test require spark to be packaged first.

So always first package them using command – mvn -DskipTests clean package

And then run test using command – mvn -test

4 comments:

Rohan21 May 2017 at 13:48
Kafka ---

1. Start Zookeeper server
2. Then Start Kafka Server
Process --

How to Start
I - Understand Zookepper configuration ---for ex -zk nodes and zk server
2. Identify zookeper loction (ls -cd)
3. Move to Zookepper (cd zookeeper) -- cd conf
4. Open zoo.cfc (command -NANO )
5. If nano is not available then isntall it…
6. Anything to install command is : yum install nano (GO to Root) ot use sudo youm install nano (if ubantu -apt-get install nano )
7. Nano zoo.cfc ---
tickTime=2000
dataDir=/var/zoo --[IMP] -Zookeeper Logging information
clientPort=2181 --Port for Zookeeper
8. nano zoo_sample.cfg ---- Sample config file -- This is the configuration file which has all the relevant informaton
a. Information understanding for config file--- Tick: Polling time (Cluster (Many Computers) -- Server (Machines - EC2) / Nodes () ) --ZK Server (Met data or information's) -- ZK nodes (manager -data nodes)

dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
----------------------------------------------------------------------------------------------------------
Kafka: https://kafka.apache.org/quickstart

1. Broker - Nodes (M/C)
2. Topic -Message (Specific t one stram)
3. Produce -Messge -topic

/kafka_2.10-0.8.2.1/bin (Executable files)
ReplyDelete
Replies
Rohan21 May 2017 at 13:49
Commands
Ls -ltr : Output will sorted with time stamp
Cat - Clog.e (Print on Console) cat log.txt |tail
cat log.txt | head (first-part)

tail -200f log.e1 (Real time logging)

ReplyDelete
Replies
Rohan21 May 2017 at 19:33
benchmark-commands.txt
Producer

Setup
bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test-rep-one --partitions 6 --replication-factor 1
bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3

Single thread, no replication

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

Single-thread, async 3x replication

bin/kafktopics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test6 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

Single-thread, sync 3x replication

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=-1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=64000

Three Producers, 3x async replication
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

Throughput Versus Stored Data

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

Effect of message size

for i in 10 100 1000 10000 100000;
do
echo ""
echo $i
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test $((1000*1024*1024/$i)) $i -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=128000
done;

Consumer
Consumer throughput

bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1

3 Consumers

On three servers, run:
bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1

End-to-end Latency

bin/kafka-run-class.sh kafka.tools.TestEndToEndLatency esv4-hcl198.grid.linkedin.com:9092 esv4-hcl197.grid.linkedin.com:2181 test 5000

Producer and consumer

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1
ReplyDelete
Replies
Tejuteju9 October 2018 at 00:32
It is nice blog Thank you porovide important information and i am searching for same information to save my time Big Data Hadoop Online Course
ReplyDelete
Replies

Add comment