Saturday, 25 May 2013

Microsoft HDinsight and Testing


Microsoft has come up with focus of offering apache Hadoop on windows servers and windows azure. HDinsight is a big data solutions build on HortonWork data platform (HDP). It represents a significant benefit for wide range of windows user and make it easy to build big data solution on their existing windows platform. With its exceptional features HDinsight is available in two variants-
  1. HDinsight is a services with Windows Azure (An open and flexible windows cloud platform), which can be included as component in any existing azure account
  2. HDinsight server for windows server which is available for download on- http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx

HDinsight service is a cloud based solution for handling and implementing big data solutions. It targets fundamental needs of distributed computing with efficiency. This service works as a wrapper to manage and store data along with monitoring features. Monitoring help to collect and analysis application as well as cluster performance details. HDinsight consists of two components, HDFS for computation of data and Azure blob (Binary linear object) for data storage. Azure blob is a default file system for storing data. It is a service for storing large dataset which can be accessed worldwide via HTTP or HTTPS. Blob storage provide high scalability & availability on low cost and long term sharable option with azure. Common uses of Blob storage could be:
  • Images or documents can be accessed directly from browser
  • Storing files for distributed access
  • Streaming video and audio
  • Secure backup and disaster recovery 

HDinsight service helps to manage and analyze large data sets, it also leverages the parallel processing capabilities of the Map Reduce programming model. The other Apache Hadoop technologies are also available with HDinsight to facilitate user with more options and wide scope to fulfill their needs of big data. It provides implementations of Hive and Pig to assimilate data processing and warehousing capabilities altogether. It also integrates with other tools, such as SQL Server Analysis Services and Excel. Tool Integration feature is helpful in case of test data collection or loading of generated/golden data. In case of user generated data, integration with tools will help to perform testing using different datasets. Test data is key factor in testing cycles of big data applications. Test data can be collected from the different data management tools or any other BI tool.


HDinsight supports other Hadoop ecosystems like hive, Hbase, zookeeper etc., it provides automated Hadoop cluster creation along with different ecosystems.

For processing, data is transferred to HDFS as it is highly optimized for computation of data but it is an expensive way to maintain HDFS cluster after completion of processing. HDinsight provide collective benefit to user by storing data on azure blob and processing on HDFS. HDinsight infrastructure is located on the computing nodes and the data resides in blog storage. For computation, it transfer data from storage to computing nodes and ensure that the transfer should be fast.
  For the need of fast transfer of data, Microsoft has deployed azure flat network storage which is also known as Quantum 10 or Q10 network. It is a mesh grid network that allows very high bandwidth connectivity. HDinsight is streaming data from the storage node (Azure blob) to the compute node (HDFS nodes).

HDinsight
HDinsight provide features like monitoring and automated deployment of Hadoop cluster with its ecosystems. Such features are very helpful for testing of application developed and deployed on HDinsight. Testers can leverage benefitsof monitoring for different benchmarking &other non-functional aspects. In big data testing, test infrastructure plays a very important role.Test Infrastructure should be scalable enough to validate functional and non-functional aspects of application and certify production ready releases. HDinsight is a service over Microsoft azure (cloud) and allow test engineers to add or remove any number of instances from existing cluster.
Test Environment should be efficient enough in terms of configuration and memory to process large amount of data. HDinsight provide automated cluster deployment, so there is no need to worry about manual configuration of Hadoop and it’s ecosystems on different number of nodes.
Monitoring is the primaryneed for a cluster during data processing to ensure efficient utilization of cluster resources. HDinsight provides monitoring features which could help in performance monitoringandwill provide real time details about the cluster performance.
A testing user can design different use cases based on application needs (in terms of Hadoop ecosystems) and deploy his cluster in an automated manner. This solves the purpose of test environment which create a platform for big data testing on Azure.
Certain other benefits are- 
  • Provides Open Database Connectivity (ODBC) drivers to integrate Business Intelligence (BI) tool
  • Full set of components in Hadoop ecosystem like pig scoop or hive
  • Provides a Sqoop connector
  • Simplified configuration and post-processing of Hadoop jobs
  • Provides JavaScript and Hive interactive consoles to make it more usable

Concluding the same, HDinsight is suited best for development as well as testing of applications based on Hadoop.  Looking at the 3 v’s of big data and testing challenges associated with it, HDinsight provides monitoring and performance tuning features together to test the scalability of applications. Test engineer can deploy test environment on azure based on the test scenarios and scale it to any number of nodes and at the last, terminate the instances as per the needs or after execution of test cycles.

References-


No comments:

Post a Comment