Since its inception in 2005, Hadoop has arguably been the most important analysis technology for big data. In over a decade of its use, Hadoop has helped big data companies handle their structured and unstructured data in ways that even its creators, scientists Mike Cafarella and Doug Reed Cutting would not have surmised.
Fast forward to 2017 and Hadoop continues to grow. In spite of media reports that some companies are moving from Hadoop to other technologies like Spark, the latest market trends indicate that Hadoop is set to grow exponentially in the next few years.
Researchers at Zion Market Research predict that Hadoop, which is currently valued at roughly 7.69 billion will reach a value of around $87.14 billion USD by 2022. This is a growth of 50% CAGR between this year and 2022.
Sentient tech companies were quick to embrace Hadoop, a framework that stores and processes big data in distributed environments across a cluster or clusters of computer systems. This open source framework uses simple models in computer programming to store and process structured and unstructured data. By design, it is supposed to expand from single server systems to many machines, with each machine having the capability to offer local storage and computation.
How Big Tech Companies Have Invested in Hadoop
To give you an idea of how big tech companies have invested in Hadoop, take a look at their clusters. The bigger a company’s clusters, the greater its investment in this framework. In a 2013 article by Jimmy Wong, a Big Data Expert, Yahoo has the largest Hadoop clusters with more than 42,000 nodes.
These are the numbers as they were in 2011. Facebook had 2,000 nodes at the time. Mr. Wong estimated Quantcast to have around 750 nodes and LinkedIn around 4100 nodes. Other notable companies in the list were NetSeer with 1050 nodes and EBay with 532 nodes.
Given that these numbers are from a period before Cloudera and Intel joined forces to accelerate the adoption of Hadoop for enterprises, more recent numbers may be exponentially greater than these. Consequently, it becomes necessary to answer the question, why are the biggest tech companies favoring Hadoop? Here are a number of possible explanations:
It is an excellent framework when handling massive data sets
While Hadoop may not be a framework of choice for small enterprises dealing with MB and GB level of data, it makes for a perfect technology for data analysis for big companies handling data in the terabyte and petabyte realms. Few technologies have the flexibility, scalability and cost efficacy that Hadoop makes available to big data companies.
It is highly cost effective
When using the traditional approach to data analysis, big data companies had to spend a fortune to keep up with their exploding sets of data. For that reason, many only analyzed a fraction of their data and use it as a sample to make assumptions regarding the rest of the data they had in their possession. What’s more, it was also uneconomical to store the raw data they were collecting on a daily basis, so they would delete it after analyzing a sample of it.
The introduction of Hadoop changed all these; the big data companies could now afford to store their data for longer, as well as analyze all their raw data. Instead of the millions of dollars they would have spent analyzing their data using the old methods, tech companies can do this at a fraction of the cost; the dollar amount is in the hundreds for each terabyte of data.
It is highly scalable
Hadoop’s ability to both store and distribute large sets of data across many servers makes scalability easy for companies using this framework. This approach also offers the companies an opportunity to use inexpensive servers and operate them in parallel. This means that every server that is added to the system brings with additional processing power.
Hadoop using MapReduce programming that makes it possible for business enterprises to run their applications from many nodes that would typically use terabytes and petabytes of data. The traditional RDBMS (relational database management systems) would be unable to scale up so as to process this magnitude of data.
It is extremely flexible
Using the Hadoop framework gives business the ability to access new sources of data with ease and generate value from it. The value could be in form of business insights from email conversations, social media, etc.
Moreover, the data they get access to is not limited to the structured type; unstructured data is included, giving businesses insight that they could have only dreamed about using the traditional data analysis frameworks. Hadoop can also be used in data warehousing, fraud detection, analyzing market campaigns, recommendation systems and log processing.
It is notably fast
One big selling point of Hadoop is its unique method of storage known as distributed file system. This system basically tracks/maps data making it easy to locate it wherever it is in a cluster.
The tools used to process the data are usually within the servers in which the data to be processed is stored, increasing the speed of data processing. In just a few minutes, Hadoop processes terabytes of data making it possible for big data companies to process bulks of unstructured data fast and inexpensively.
It is based on a simple programming model
The fact that Hadoop is based on a simple model of computer programming makes it possible for programmers to come up with MapReduce applications that have the ability to handle big tasks effortlessly and efficiently.
These programs are written using Java, a widespread programming language, making it easy for many programmers to write programs that adequately meet the data processing needs of the tech companies they work for.
It is resilient to failure
Hadoop is designed to tolerate failure. This means that when a node receives data, that piece of data is also replicated across other nodes in that cluster so that in case there is failure affecting a particular node, copies of the data will still be available to be used. MapReduce helps Hadoop identify faults in the network and with speed automatically apply a recovery solution.