Every day, scientific research experiments, commercial online data collecting and physical transactions done through digital devices produce massive volumes of data. The analysis of the collected data depends on different computational algorithms, which assist in coming with meaningful results.
Large-scale data-intensive computing is the latest technology in scientific computing, and its growth has assisted people to analyze how to predict or explore more data analysis possibilities.
Since large volumes of data are collected on a daily basis, businesses have adopted parallel computing solutions like cloud pleasingly mode, which is easy to use and its features are scalable. But the challenge with this model is that it has not been accepted universally.
This is a data-centric model in cloud computing that provides computing power to implement and deploy applications within a diverse environment. Google introduced MapReduce computing model with the aim of making the analysis of data collected by Google services easier.
The open source community which was directed by Yahoo and IBM introduced Hadoop MapReduce later. After Hadoop’s introduction, it has been majorly used by academic researchers and as a big data analytics tool for other companies like Facebook.
MapReduce framework and its extensions research
The way in which the architecture of MapReduce has been programmed is just a simple concept in which the computation uses a set of input key-value pairs which are associated with input data and produces a set of output key-value pairs.
In the map stage, the input data is divided into input segments and then assigned to map tasks associated to processing nodes in the cluster.
The assigned map tasks perform computations which are designed by the user on every input key-value pair from the segment of input data assigned to the task and then generate a set of intermediate results for each key.
This is followed by the shuffle and sort stage which sorts the data generated by each map task with data from different nodes. The data is then divided into regions to allow reduce tasks to process and distribute it
Users can execute their single programs through a simple application interface provided by MapReduce. This can also be done across a distributed environment.
Users can create tasks based on the data location by introducing the program into HDFS. It then submits a job to the master node job tracker which performs the remaining functions. Thus, the process makes tasks scheduling a very interesting topic
Mapreducing is however limited by the shuffling stage. Reducers begin the reducing process when all map results have been locally sorted and passed to avoid interference in the entire process.The remaining computing resources cannot perform tasks for jobs with long final map tasks.
Applications can generate large intermediate data in the shuffling stage in addition to the shift reducer stage. Additionally, the shift-reduce introduces local reduction across the applications that generate intermediate data in the shuffling stage.
This helps in the local reduction across tasks on the same computing nodes before transferring them to the reducer.
In every large datasets, computation efficiency might not be supported by single commodity server regardless of whether it is using MapReduce or not. This lack of computational resource is solved through map-reduce-global-reduce.
This system gathers user-trusted clusters across the internet and then performs a global MapReduce framework. Then, the program inputs are distributed to each cluster before the global controller assigns the MapReduce tasks.
A global reducer collects all results at the end of each job and generates a final output. However, this does not consider locality of input, which always moves the data to computation.
Big data has boosted the growth of intensive data computing, and this has led to an emergence of big data solution tools. The MapReduce framework, developed by Google thus provides a model that assists in solving big data problems.