Database systems with MapReduce

Advancement in the database industry remains extremely vivid. In just the few years, new data scaling models have totally changed the efficiency that database systems can provide. These include the use of very huge, multi-terabyte memories; which can considerably increase the amount of data cached in the main memory; and the movement of database into Meta storage to speed functions such as data read, encryption, compression, and much more. One such prominent model is known as MapReduce.

MapReduce is a model for generating and processing large data. MapReduce is motivated by the map and reduce features regularly used in Functional programming.

Users specify a mapping function that will processes key/value pairs in order to create an intermediate key/value pairs.
Users specify a reduction function that merges all intermediate key/value pairs associated with the same intermediate key.

Programs written in this functional style will be parallelized automatically and executed over a large cluster of machines. The run-time systems will take care of details of dividing the input data, scheduling the execution across the cluster of machines, managing inter-system communication and finally managing machine failures. This allows developers without any prior experience with distributed and parallel systems to easily utilize the resources of a large distributed system.

Origins of MapReduce

The idea behind MapReduce originated in popular Internet Search engine companies, like Google, Yahoo, Ask etc... Search engines companies utilize massive farms of servers that crawl the web sites and retrieve useful web pages into local files. They process this data to build search indexes.
Algorithms are used to determine the quality of a web page and the significance to search terms. The mapping function will identify latent search parameter in a web page. The reduction function will determine the number of times a search parameter is used in a page.

MapReduce Implementation

A popular implementation of MapReduce is Apache Hadoop. Hadoop is a MapReduce based data processing model engineered to execute queries and other read operations against colossal datasets that can be terabyte and even petabyte in size. Here the data is added to the Hadoop Distributed File System (HDFS). Hadoop then scans through the data to produce results. These results are then finally loaded into other files. Using the MapReduce concept, Hadoop splits a problem and sends the sub-problems to different machines, and lets each machine solve its sub-problem in parallel. It then combines all the sub-problem together and writes out the final solution into files for further operations.

Conclusion

Parallel processing is now widespread and it very important to harness its full potential. MapReduce is a promising model for multicore environments, to fully utilize the power of such resources.

Reference:

http://en.wikipedia.org/wiki/MapReduce
http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf
http://www.willowgarage.com/sites/default/files/ChuCT_etal_2006.pdf
http://infolab.stanford.edu/~ullman/pub/mapred.pdf
http://hadoop.apache.org/
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf
http://www.oracle.com/technetwork/database/hadoop-nosql-oracle-twp-398488.pdf

*If you find something is misleading or not correct then please throw some light on it.