Data Intensive Computing is a class of parallel computing which uses data parallelism in order to process large volumes of data. The size of this data is typically in terabytes or petabytes. This large amount of data is generated each day and it is referred to Big Data.
In 2007, exactly a decade ago, An IDC white paper sponsored by EMC Corporation estimated the amount of information currently stored in a digital form in 2007 at 281 exabytes. One can only imagine how massive it would be today.
The figure revealed by IDC proves that the amount of data generated is beyond the capacity to analyze it. The same methods cannot be used to in this case, which are generally used to solve the usual traditional problems in computational science.
In order to tackle the problem, the companies are coming up with a tool or set of tools.
Data intensive computing has some characteristics which are different from other forms of computing. They are:
- In order to achieve high performance in data intensive computing, it is necessary to minimize the movement of data. This reduces system overhead and increases performance by allowing the algorithms to execute on the node where the data resides.
- The data intensive computing system utilizes a machine independent approach where the run time system controls the scheduling, execution, load balancing, communications and the movement of programs.
- Data intensive computing hugely focuses on reliability and availability of data. Traditional large scale systems may be susceptible to hardware failures, communication errors and software bugs, and data intensive computing is designed to overcome these challenges.
- Data intensive computing is designed for scalability so it can accommodate any amount of data and so it can meet the time critical requirements. Scalability of the hardware as well as the software architecture is one of the biggest advantages of data intensive computing.