Monday, March 9, 2015

An overview of Hadoop & Big Data

What is Hadoop...?

It’s a framework (open source software framework written in Java) of tools, used for processing of large data on distributed storage, in a distributed manner, and not just one software which can be downloaded on to a computer, and used to analyse big data. 

Objective


It's objective is to support running  of applications on Big Data. The hadoop framework allows for the distributed processing of big data sets across different computers & computer clusters. It does not require all the data to be on the local machine but it is designed to scale up from 1 single server to thousands of servers.  All of its libraries and modules are designed to detect and handle the failures at the application layer, which makes it highly available service on top of the computer clusters, considering the fundamental assumption that failures in hardware are a common place, and hence need to be handled by this framework automatically by using software.

The following are the main modules in Hadoop:
  • Hadoop Map Reduce  - A system designed for parallel processing of large data sets using a framework for job scheduling and cluster resources management.
  • Hadoop Distributed File System (HDFS) - HDFS is a distributed file system that provides high-thoughput access to application data.
  • Hadoop YARN - A framework that helps Hadoop Map Reduce work, for job scheduling and cluster resource management.
  • Hadoop Common - It has the common utilities that support the Hadoop modules.

It is opensource and is running on Apache License, which guarantees that , no specific company is controlling is use and is controlled by Apache.

Hadoop often refers to a collection of additional software packages that can be installed on top of or alongside Hadoop's base modules.

Big Data:
Due to the growth of data or web data explosion, the data was growing in a very high pace each day, in volumes, varity and speed. There were data in different formats all together, so handling it using the traditional systems was becoming tougher, and was far beyond its ability to handle. So people had to think of different ways to handle & process a variety of data in large chunks at a much faster speed. This lead to creation of  Hadoop.

Big Data Challenge points ( 3 levels):
  1. Velocity – data coming in very high speed
  2. Volume – Data is continusly increasing
  3. Variety – Data is of variety
Hadoop has the ability to handle the above mentioned challenges in today's IT world. 

Why Hadoop:
One of the main reasons why Hadoop became important is its ability to handle huge amounts of data, that too any kind of data, quickly, when data volumes and varieties are growing in a great pace each day especially from sources like social media...

There were several other factors also for organizations to consider Hadoop. Some of them are as follows:

  1. Low cost
  2. Computing power
  3. Scalability
  4. storage flexibility
  5. Self healing capabilities (detecting & handling hardware falures)
  6. Data protection
It is used by a wide variety of companies and organizations for production and data research. Prominent corporates like Facebook and Yahoo are also users of Hadoop framework. It can be deployed on any onsite datacenters or on cloud (like Microsoft Azure, Amazon EC2, or Amazon S3, IBM Bluemix cloud or Google App Engine etc...

Fun Fact :)
Hadoop was the name of a toy elephant owned by the son of one of its inventors (Doug Cutting)...