Tuesday, October 7, 2014

Why Hadoop is Required

When you have massive loads of data in a RDMS, you usually aggregate raw data and most of the time you move the raw data into a different storage or delete them. So the problem is that when in some point when you want to access that Raw data, there could be two things it very time consuming or not possible.
In some companies have their Raw data in files and they have ETL (Extract, transform, load) tools to aggregate and load them into to a RDMS, This process makes a huge stress on network layer since the data from the file stores has to move into the ETL tool for processing, through the network. Once the data is aggregated it is moved to a achieve location. The archived data retrieval is very slow process as well if some one wants read the raw data again.

So there are three problems:

  1. Cannot access Raw high fidelity data
  2. Computation cannot scale (Since all the data should be taken into the computation engine through the network)
  3. To avoid stress on computation, the old data needs to be archived, and once they are achieved it is an expensive process to access them again.


How the  RDMS process data


Hadoop does not get the data to the computation engine through the network instead it keeps the computation Just Next to the Data.

Each server is processing independently it's local data.



No comments:

Post a Comment