Pre-requite –
Install following things and update environment variable
- Java
- Download java and install in your linuz box
- Update environment variable with JAVA_HOME and PATH
- Scala
- Download scala using command: wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
- Extract scala using command: tar –xzf scala-2.10.1.tgz
- Update environment variable for SCALA_HOME and PATH
- Maven
- Download Maven using command: wget http://apache.tradebit.com/pub/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
- Extract scala using command: tar –xzf apache-maven-3.2.3-bin.tar.gz
- Update environment variable for MAVEN_HOME and PATH
Spark uses java 6+ and maven 3.0.4 to build. It need to configure MAVEN_OPTS to use more memory than usual. Otherwise it will fail with an error –
In CentOS – run a command “nano ~/.bash_profile”
In Ubuntu – run a command “pico ~/etc/environment”
And add a new line –
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
If you want to read data from HDFS then you have to build spark against the specific version of HDFS in your environment. Which could be done through the “Hadoop.version” property.
And I you don’t specify then by default it will build against Hadoop 1.0.4
Building with specific version depending on your need –
|
S. no
|
Distribution
|
Command
|
|
1.
|
Apache Hadoop 1.2.1
|
mvn -Dhadoop.version=1.2.1
-DskipTests clean package
|
|
2.
|
Cloudera CDH 4.2.0 with MapReduce v1
|
mvn
-Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
|
|
3.
|
Apache Hadoop 0.23.x
|
mvn -Phadoop-0.23
-Dhadoop.version=0.23.7 -DskipTests clean package
|
|
4.
|
Apache Hadoop 2.0.5-alpha (with yarn)
|
mvn -Pyarn-alpha
-Dhadoop.version=2.0.5-alpha -DskipTests clean package
|
|
5.
|
Cloudera CDH 4.2.0 (with yarn )
|
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0
-DskipTests clean package
|
|
6.
|
Apache Hadoop 0.23.x (with yarn)
|
mvn -Pyarn-alpha -Phadoop-0.23
-Dhadoop.version=0.23.7 -DskipTests clean package
|
|
7.
|
Apache Hadoop 2.2.X (with yarn)
|
mvn -Pyarn -Phadoop-2.2
-Dhadoop.version=2.2.0 -DskipTests clean package
|
|
8.
|
Apache Hadoop 2.4.X
|
mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package
|
|
9.
|
Different versions of HDFS and YARN
|
mvn -Pyarn-alpha -Phadoop-2.3
-Dhadoop.version=2.3.0 -Dyarn.version=0.23.7 -DskipTests clean package
|
Building with hive
and JDBC support
It is need to enable integration
for spark SQL along with its JDBC server and CLI … to do so, add –Phive profile
in your existing build option
Spark Tests in Maven
ScalaTest Maven plugin is used, it run test by default. Some
test require spark to be packaged first.
So always first package them using command – mvn
-DskipTests clean package
And then run test using command – mvn -test


Kafka ---
ReplyDelete1. Start Zookeeper server
2. Then Start Kafka Server
Process --
How to Start
I - Understand Zookepper configuration ---for ex -zk nodes and zk server
2. Identify zookeper loction (ls -cd)
3. Move to Zookepper (cd zookeeper) -- cd conf
4. Open zoo.cfc (command -NANO )
5. If nano is not available then isntall it…
6. Anything to install command is : yum install nano (GO to Root) ot use sudo youm install nano (if ubantu -apt-get install nano )
7. Nano zoo.cfc ---
tickTime=2000
dataDir=/var/zoo --[IMP] -Zookeeper Logging information
clientPort=2181 --Port for Zookeeper
8. nano zoo_sample.cfg ---- Sample config file -- This is the configuration file which has all the relevant informaton
a. Information understanding for config file--- Tick: Polling time (Cluster (Many Computers) -- Server (Machines - EC2) / Nodes () ) --ZK Server (Met data or information's) -- ZK nodes (manager -data nodes)
dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
----------------------------------------------------------------------------------------------------------
Kafka: https://kafka.apache.org/quickstart
1. Broker - Nodes (M/C)
2. Topic -Message (Specific t one stram)
3. Produce -Messge -topic
/kafka_2.10-0.8.2.1/bin (Executable files)
Commands
ReplyDeleteLs -ltr : Output will sorted with time stamp
Cat - Clog.e (Print on Console) cat log.txt |tail
cat log.txt | head (first-part)
tail -200f log.e1 (Real time logging)
benchmark-commands.txt
ReplyDeleteProducer
Setup
bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test-rep-one --partitions 6 --replication-factor 1
bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3
Single thread, no replication
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196
Single-thread, async 3x replication
bin/kafktopics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test6 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196
Single-thread, sync 3x replication
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=-1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=64000
Three Producers, 3x async replication
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196
Throughput Versus Stored Data
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196
Effect of message size
for i in 10 100 1000 10000 100000;
do
echo ""
echo $i
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test $((1000*1024*1024/$i)) $i -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=128000
done;
Consumer
Consumer throughput
bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1
3 Consumers
On three servers, run:
bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1
End-to-end Latency
bin/kafka-run-class.sh kafka.tools.TestEndToEndLatency esv4-hcl198.grid.linkedin.com:9092 esv4-hcl197.grid.linkedin.com:2181 test 5000
Producer and consumer
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196
bin/kafka-consumer-perf-test.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --messages 50000000 --topic test --threads 1
It is nice blog Thank you porovide important information and i am searching for same information to save my time Big Data Hadoop Online Course
ReplyDelete