Shashwat Dubey: 2013

Friday, 13 September 2013

Building Spark with Maven

Pre-requite –

Install following things and update environment variable

Java

Download java and install in your linuz box
Update environment variable with JAVA_HOME and PATH

Scala

Download scala using command: wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
Extract scala using command: tar –xzf scala-2.10.1.tgz
Update environment variable for SCALA_HOME and PATH

Maven

Download Maven using command: wget http://apache.tradebit.com/pub/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
Extract scala using command: tar –xzf apache-maven-3.2.3-bin.tar.gz
Update environment variable for MAVEN_HOME and PATH

Spark uses java 6+ and maven 3.0.4 to build. It need to configure MAVEN_OPTS to use more memory than usual. Otherwise it will fail with an error –

In CentOS – run a command “nano ~/.bash_profile”

In Ubuntu – run a command “pico ~/etc/environment”

And add a new line –

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

If you want to read data from HDFS then you have to build spark against the specific version of HDFS in your environment. Which could be done through the “Hadoop.version” property.

And I you don’t specify then by default it will build against Hadoop 1.0.4

Building with specific version depending on your need –

S. no	Distribution	Command
1.	Apache Hadoop 1.2.1	mvn -Dhadoop.version=1.2.1 -DskipTests clean package
2.	Cloudera CDH 4.2.0 with MapReduce v1	mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
3.	Apache Hadoop 0.23.x	mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
4.	Apache Hadoop 2.0.5-alpha (with yarn)	mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
5.	Cloudera CDH 4.2.0 (with yarn )	mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
6.	Apache Hadoop 0.23.x (with yarn)	mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
7.	Apache Hadoop 2.2.X (with yarn)	mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
8.	Apache Hadoop 2.4.X	mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
9.	Different versions of HDFS and YARN	mvn -Pyarn-alpha -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=0.23.7 -DskipTests clean package

Building with hive and JDBC support

It is need to enable integration for spark SQL along with its JDBC server and CLI … to do so, add –Phive profile in your existing build option

Spark Tests in Maven

ScalaTest Maven plugin is used, it run test by default. Some test require spark to be packaged first.

So always first package them using command – mvn -DskipTests clean package

And then run test using command – mvn -test

Saturday, 10 August 2013

Agile QA Best Practices


Step 1: As story reach to you	· Open the story, change the status on ‘Testing/In Testing’ and assign it to yourself. · Investigate the story & read the acceptance criteria thoroughly. · If the story doesn't have a check list you should write it (before creation, ask other members of your team maybe it has been already created- (As per your Project Process). · Before closing the story, checklist should be reviewed by other member of the team and then attached to the story. · You can create various labels such as (‘AutomationToDO, 'ChecklistToDO', 'ChecklistToReview', 'NeedtoTestonUAT' etc.). This will help the team member to keep the track of the stories and it is easy to search the stories.

Step 2: What next	· Waiting for the information: Once the story is updated with the additional information, and story is assigned back to you when answered. If the answer is sufficient, remove the flag and continue testing. If new questions appear, perform the described above flow again. · Blocked story/issue: Once the blocking defect is fixed (marked as done or Ready for UAT), remove the flag and continue testing. · “Re-opened” stories: Once related issues are resolved, the story will be assigned back to you to continue testing.

Step3: Taking Story to conclusion	The story/defect is waiting for the information from another person (BA or DEV): · The story is assigned to the person · The proper comment is be added · The story is flagged · The story status remains unchanged. The story/defect testing is blocked by the defect (blocker) that isn’t related to your story: · The story assignee remains unchanged (you should be still there) · The blocking defect is linked to the story (“blocks” or ‘is blocked by’) · The proper comment is added · The story is flagged · The story status remains “In Testing” If a defect related to user story is found (one or all acceptance criteria are not satisfied): · Enter an Intra-sprint bug and assign it to developer who worked on it. If it's not obvious who worked on the task, assign it to the lead developer or Team lead to triage and assign it to the proper resource · Related issues are linked (“relates”) · The proper comment is added · The status of story is ’Reopened’. If a defect is NOT related to user story and all acceptance criteria are satisfied: · Enter a bug and assign it to developer · Link bug to appropriate epic · We do not reopen story with such bugs · Story should be moved to next step Ready for Deployment to UAT/DONE · Add label ‘Automation-ToDo’ etc. If all acceptance criteria of user story are satisfied: · Checklist should be revised and attached · All related bugs are fixed · The proper comment is added. · The status of story is ’Done’ or ‘Ready for Deployment to UAT’ as per your project needs. · Log the time spent to work on the user story.
	Defect Logging
Step4: Filling ticket	· Defect should be logged against your project. · Summary format should highlight: “Business Flow/Flavor – Page – Sub-page: issue description”. · Summary should give all the information about the issue. · For browser specific issues Summary should contain browser, e.g. Chrome · For regression issues Summary should contact word “Regression” · If you’re not able to describe the issue fully in the Summary, please add “see description” to the Summary. · If you find any blocker, please investigate the behavior and make sure there is no workaround. If it is possible to avoid the issue by some specific steps, please write the summary mentioning workaround.
Step5: fields to capture:	· Clear steps to reproduce – if you’re not sure about the steps, please investigate the issue or ask teammates about the help. · Expected and actual result (expected should be written first). · Attachments, where appropriate. In case of seeing stack trace error, please add .txt, .doc file with the error text, not the snapshot. · Enter the information related to environment on which you encounter the issue mentioning the build no. · Notes if there is anything additional you want to let people now. · Associate with the current sprint or if it is intra-sprint defect log it accordingly. · Write the detail steps if work around is present. · Defects should be associated / related to the corresponding user story. If there is no story, e.g. the issue is regression, if there is story related to regression then associate it else skip it. · Defects should be linked to the mentioned epics corresponding to the project. If you’re not sure where to link, please ask QA mates to help. If there is no epic for your issue check with your team mates.

Wednesday, 10 July 2013

How to Full-screen virtual box

After installing Linux in a virtual machine, some time you must have faced issue with the screen size. In desktop version of any linux box you need to install guest additions in your VM. The Virtual Box Guest Additions for Linux are a set of device drivers and system applications which may be installed in the guest operating system.

To install them, go to

Device >> install Guest Additions

It will prompt you with options to auto-run setup. Click on Run and then enter your password to authenticate installation in your linux box.

It will start installation…. if incase it fails with error like highlighted below...

Verifying archive integrity... All good.

Uncompressing VirtualBox 4.1.12 Guest Additions for Linux.........

VirtualBox Guest Additions installer

Removing installed version 4.1.12 of VirtualBox Guest Additions...

Removing existing VirtualBox DKMS kernel modules ...done.

Removing existing VirtualBox non-DKMS kernel modules ...done.

Building the VirtualBox Guest Additions kernel modules

The headers for the current running kernel were not found. If the following

Module compilation fails then this could be the reason.

Building the main Guest Additions module ...fail!

(Look at /var/log/vboxadd-install.log to find out what went wrong)

Doing non-kernel setup of the Guest Additions ...done.

Installing the Window System drivers

Warning: unknown version of the X Window System installed. Not installing

X Window System drivers.

Installing modules ...done.

Installing graphics libraries and desktop services components ...done.

Then open your terminal and run a command.

sudo apt-get install virtualbox-guest-x11

Now just reboot your virtual machine and it will starting adjusting screen size accordingly you need.

Thursday, 20 June 2013

Robot framework installation and environment setup

Installation and Environment Setup

This post will help to set up Robot Framework and RIDE on windows.

Note: The installation of RobotFramework is being done using python and all other software’s installed should be of the same version as that of python

Install Python: Python must be installed on to the system. python2.7.1 installer can be downloaded from following link –

https://www.python.org/download/releases/2.7.1/

Once installed, the next step would be to install setup tools for Easy install.
Install easy_install: Download setuptools from following link –

https://pypi.python.org/pypi/setuptools/0.6c11

Note that the setup tools file is specific to the Python version installed. For eg., If the python installed is 2.7 version then the setup tools should also be of the same version as that of python.
Creation and Setting of Environment Variables: Create two environment variables

PYTHONHOME = <Directory>:\Python27
path =;%PYTHONHOME%;%PYTHONHOME%\Scripts

Verify python installation by executing the following command at command prompt.
>> Run command - Python--version It will display the version of python installed onto the machine if the installation has been successful.
Open Command prompt and run the following commands:

easy_install pip
Click ez_setup.py or get-pip if not able to install pip using easy_install
pip install robotframeworkfor installing RobotFramework.
pip install robotframework-ride
for installing RIDE.

Note: easy_install can also be used in place of pip

Install wxPython (Unicode): Download and install wxPython2.8-win32-unicode-2.8.12.1-py27 file from following link –

http://www.wxpython.org/download.php
Creation and Setting of Environment Variables:

PYTHONPATH = <Directory>:\Python27\Lib; <Directory>:\Python27\Scripts; <Directory>:\Python27\Lib\site-packages; <Directory>:\Python27\Lib\site-packages\robotide\;<Directory>:\Python27\Lib\site-packages\robot\;<Directory>:\Python27\Lib\site-packages\wx-2.8-msw-unicode;
Note- replace <Directory> from your installation directory path
Verify robotframework installation by executing the following command at command prompt.

pybot--versionIt will display the version of robot framework installed onto the machine if the installation has been successful.
ride.py
Execute this command is to verify installation and it will start the RIDE.

Third Party Integration with Robot Framework

There are many different libraries to implement with Robot Framework. Some of these libraries are implicitly installed with Robot Framework installation like Built-in library while others need to be installed explicitly.

Integration with Selenium2Library

Selenium2Library is a web testing library for Robot Framework. It uses the Selenium 2 (WebDriver) libraries internally.

To installing Selenium2Library with Robot Framework requires following step.

Pre-Requsite: Python, Robot Framework and RIDE must be installed

Open Command prompt and run the following command:

pip install robotframework-selenium2library
Now edit the PYTHONPATH environment variable created above by adding the following path to it:
<Directory>:\Python27\Lib\site-packages\Selenium2Library;

Open ride.py and verify the Selenium installation by adding the library to the Project. That means add library as “Selenium2Library” at any place in the project hierarchy, for unsuccessful installation, it would mark the library into Red color like: Selenium2Library

Integration with AutoIT

AutoItLibrary is a Python keyword library that extends Robot Framework keywords based on the COM interface to AutoIt, a freeware tool for automating the Windows GUI.

Installing AutoIt with Robot Framework requires the following steps.

Pre-Requsite: Python, RobotFramework and RIDE must be installed

Download and install AutoIT setup from following link -

https://code.google.com/p/robotframework-autoitlibrary/
Copy the package of AutoItLibrary-1.1 for RobotFramework from your download location and place it to “<Directory>:\Python27\Lib\site-packages”.
Download pywin module from the below link for the installed python version and run exe as Administrator.

https://pypi.python.org/pypi/pywin32
Open command prompt as administrator.
At command prompt Navigate to “<Directory>:\Python27\Lib\site-packages\AutoItLibrary-1.1” and run the following command

python setup.py install
Open ride.py and verify the AutoIT installation by adding the library to the Project same as that of the Selenium2Library.

Saturday, 25 May 2013

Microsoft HDinsight and Testing

Microsoft has come up with focus of offering apache Hadoop on windows servers and windows azure. HDinsight is a big data solutions build on HortonWork data platform (HDP). It represents a significant benefit for wide range of windows user and make it easy to build big data solution on their existing windows platform. With its exceptional features HDinsight is available in two variants-

HDinsight is a services with Windows Azure (An open and flexible windows cloud platform), which can be included as component in any existing azure account
HDinsight server for windows server which is available for download on- http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx

HDinsight service is a cloud based solution for handling and implementing big data solutions. It targets fundamental needs of distributed computing with efficiency. This service works as a wrapper to manage and store data along with monitoring features. Monitoring help to collect and analysis application as well as cluster performance details. HDinsight consists of two components, HDFS for computation of data and Azure blob (Binary linear object) for data storage. Azure blob is a default file system for storing data. It is a service for storing large dataset which can be accessed worldwide via HTTP or HTTPS. Blob storage provide high scalability & availability on low cost and long term sharable option with azure. Common uses of Blob storage could be:

Images or documents can be accessed directly from browser
Storing files for distributed access
Streaming video and audio
Secure backup and disaster recovery

HDinsight service helps to manage and analyze large data sets, it also leverages the parallel processing capabilities of the Map Reduce programming model. The other Apache Hadoop technologies are also available with HDinsight to facilitate user with more options and wide scope to fulfill their needs of big data. It provides implementations of Hive and Pig to assimilate data processing and warehousing capabilities altogether. It also integrates with other tools, such as SQL Server Analysis Services and Excel. Tool Integration feature is helpful in case of test data collection or loading of generated/golden data. In case of user generated data, integration with tools will help to perform testing using different datasets. Test data is key factor in testing cycles of big data applications. Test data can be collected from the different data management tools or any other BI tool.

HDinsight supports other Hadoop ecosystems like hive, Hbase, zookeeper etc., it provides automated Hadoop cluster creation along with different ecosystems.

For processing, data is transferred to HDFS as it is highly optimized for computation of data but it is an expensive way to maintain HDFS cluster after completion of processing. HDinsight provide collective benefit to user by storing data on azure blob and processing on HDFS. HDinsight infrastructure is located on the computing nodes and the data resides in blog storage. For computation, it transfer data from storage to computing nodes and ensure that the transfer should be fast. For the need of fast transfer of data, Microsoft has deployed azure flat network storage which is also known as Quantum 10 or Q10 network. It is a mesh grid network that allows very high bandwidth connectivity. HDinsight is streaming data from the storage node (Azure blob) to the compute node (HDFS nodes).

HDinsight provide features like monitoring and automated deployment of Hadoop cluster with its ecosystems. Such features are very helpful for testing of application developed and deployed on HDinsight. Testers can leverage benefitsof monitoring for different benchmarking &other non-functional aspects. In big data testing, test infrastructure plays a very important role.Test Infrastructure should be scalable enough to validate functional and non-functional aspects of application and certify production ready releases. HDinsight is a service over Microsoft azure (cloud) and allow test engineers to add or remove any number of instances from existing cluster.

Test Environment should be efficient enough in terms of configuration and memory to process large amount of data. HDinsight provide automated cluster deployment, so there is no need to worry about manual configuration of Hadoop and it’s ecosystems on different number of nodes.

Monitoring is the primaryneed for a cluster during data processing to ensure efficient utilization of cluster resources. HDinsight provides monitoring features which could help in performance monitoringandwill provide real time details about the cluster performance.

A testing user can design different use cases based on application needs (in terms of Hadoop ecosystems) and deploy his cluster in an automated manner. This solves the purpose of test environment which create a platform for big data testing on Azure.

Certain other benefits are-

Provides Open Database Connectivity (ODBC) drivers to integrate Business Intelligence (BI) tool
Full set of components in Hadoop ecosystem like pig scoop or hive
Provides a Sqoop connector
Simplified configuration and post-processing of Hadoop jobs
Provides JavaScript and Hive interactive consoles to make it more usable

Concluding the same, HDinsight is suited best for development as well as testing of applications based on Hadoop. Looking at the 3 v’s of big data and testing challenges associated with it, HDinsight provides monitoring and performance tuning features together to test the scalability of applications. Test engineer can deploy test environment on azure based on the test scenarios and scale it to any number of nodes and at the last, terminate the instances as per the needs or after execution of test cycles.

References-

http://www.windowsazure.com/en-us/overview/what-is-windows-azure/

http://www.windowsazure.com/en-us/documentation/articles/hdinsight-introduction/

http://dennyglee.com/2013/03/18/why-use-blob-storage-with-hdinsight-on-azure/

http://www.windowsazure.com/en-us/documentation/articles/hdinsight-get-started/

Sunday, 21 April 2013

Establishing performance benchmarks for your Hadoop cluster

Hadoop cluster are there to manage your big data easily, but what about the quality that is needed for getting optimized results and that helps to save cost of managing and maintaining Hadoop cluster. For this purpose, I was searching for some kind of testing or benchmark if we can establish which helps to certify Hadoop cluster environment.

Production ready Hadoop distribution is coming with some set of benchmark, which helps to get maximum optimized Hadoop cluster environment.

TestDFSIO – This help to find the IO rate of Hadoop cluster with 3 set of operation –write –read –clean. You can configure number of file and megabytes per file to be written on the HDFS for calculating with performance of cluster while doing write operation. It automatically generates data as per user input to check performance of cluster. After write operation, –read operation will read the same generated data and calculate IO and throughout while reading the file from HDFS. User can perform multiple tests for getting a set of result parameters to evaluate performance variation with different cluster configuration parameters while keeping the hardware configuration static.

With TestDFSIO, user can set a benchmark for its Hadoop cluster and tune his cluster accordingly to get maximum performance out of it.

You need to run three simple commands to get results out for performance benchmarks-

First, you need to do write operation with which test jar coming with Hadoop distribution. This need two user input- number of file and file size which you want to generate to test write capabilities. Following is the example, in which we generate 100 file with each file size 1024 and it will gives you IO rate, throughput, IO rate std. deviation and test execution time of your Hadoop cluster based on input values.

$ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 100 -fileSize 1024

Now, you need to perform read operation to test IO rate while reading the same generated file. It also needs same user input. Following example shows the use of this –

$ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 100 -fileSize 1024

Finally, after write and read operation, you should cleanup your cluster with following command-

$ hadoop jar hadoop-*test*.jar TestDFSIO -cleanup
nnBench – Performance of Master in Hadoop is as important as its role in Hadoop cluster. nnBench give a detailed report of performance of NameNode (master of Hadoop cluster) . This benchmark is to test the capabilities of your namenode with some set of hardware to handle your Hadoop cluster with multiple datanodes. It generates multiple simultaneous requests to perform operations on files on HDFS. It gives you a facility to define number of map and reduce you want to use to check the performance of namenode.

You will fine number of option with coming nnbench using following command-

$ hadoop jar hadoop-*test*.jar nnbench –help

Hadoop distribution is coming with other benchmarks also, which I will cover in my coming blogs. In the next blog, I will share my experience about loadgen and mrbench.

Wednesday, 27 March 2013

Role of QA in Requirement, Functional, Technical Phase of SDLC

QA plays a Pivot role right from the start of the generation of Requirements, so as to understand the real business needs of the Customer. Participation of QA in the course of feature preparation, arrangement and development will make huge impact to business, for the reason that if QA understand the product and the domain it can lead to early identifications of gaps and fruitful ideas that can enlarge and enhance quality in depth of the deliverable's.

QA works very close to the business analysts to understand the user’s needs. Engagements of QA with the developers, when they discuss architecture will impact the product quality, because at this time of SDLC, QA sets up the user expectations, technical limitation can be worked out based on the functional solutions planned and if there needs to be any change in the technical architecture it can be worked very early in the SDLC. The knowledge gathered by QA will not only help to focus the testing efforts on the most critical areas of the program, but will also help QA determine the areas of risk to the business. Some of other benefits are:

Understand the Functionality well in advance.
Propose new enhancement or feasibility early in the product life cycle.
Creating testcases based on the actual Business need.
Organized usability testing.
Performing exploratory testing on early builds.