Integrating computers into our daily lives has led to a generation of data at staggering rates. Estimates from IDC indicated that there were over 486 exabytes in the digital universe.
The computing industry is facing a lot of challenges in handling the data they produce. The industries are being challenged to develop ways to achieve a cost-effective processing of large scale data.
The introduction of MapReduce computing has led to the discovery of various methods which can be used to deal with data intensive computing.
MapReduce has some open source software, and Hadoop is one of them. It is used to manage storage resources in different intensive computing clusters using a distributed user-level filesystem.
The file system, Hadoop Distributed File System (HDFS) is designed for easy portability and installation across various hardware and software systems. Java is used to write the storage system.
The performance of the underlying storage system is the one which determines the efficiency of Hadoop cluster. HDFS offers universal access to files in the group.
Input data from HDFS can be read by map tasks, and output data can be saved in HDFS using reduce tasks. However, the intermediary data between map and reduce tasks is stored in each local node in temporary storage, not in HDFS.
HDFS divides files into large blocks which are then saved as separate files in the local file system. Implementation of HDFS happens through two services, the DataNode and NameNode. The NameNode operates on a single node, but it is a centralized service in the structure.
It is the one that manages the HDFS directory tree. Clients can only perform common file system operations like opening, closing and deleting by contacting the NameNode. The NameNode maintains the mapping between HDFS file names, a list of blocks in the file and the DataNodes in which these blocks are stored.
HDFS blocks are stored by the DataNodes on behalf of the remote or local clients. The node’s local file system then stores each block as a separate file. It is not a must for all nodes to use the same local file system. This is because the DataNode extracts away each detail in the local storage arrangement.
The NameNode can order the DataNodes to either create or destroy the blocks. Typically, clients use the NameNode to validate or process their requests. The namespace is managed by the NameNode, but the clients can also communicate with the DataNodes directly to read or write data at the HDFS block level.
Hadoop MapReduce applications utilize storage very differently compared to general-purpose computing. First, the data files accessed are normally very large, between tens to hundreds of GB in size.
Streaming access patterns are then used to manipulate these files. These patterns are typically used to process a batch of workloads at a time. When reading the files, large data segments are retrieved per every operation.
Client applications see the HDFS file as a standard output stream for them to write data to HDFS. However, great deals of complexities are hidden by this concept in the Hadoop framework. When it comes to writing HDFS data to disk, there are three threads which perform different tasks.
The client-facing thread is the first one. First, it fragments the data stream into HDFS-sized blocks, typically 64MB then it reduces that to smaller packets of 64KB.
FIFO which can hold up to 5MB then queues each packet, thus removing the client thread from the storage system dormancy during normal operations.
The second thread then removes the queued packed from the FIFO. The thread then assigns HDFS block IDs and destinations in coordination with the NameNode. It then transmits the blocks to either local or remote DataNodes for storage.
Lastly, the third thread manages the responses from the DataNodes that all data has been committed to a disk. Now and then, the DataNodes generate a report of all the blocks stored in the NameNode. The reports verify the replication of each file and in case that fails, instruct the DataNodes to make extra copies.