Paper: MapReduce - Simplified Data Processing on Large Clusters
This paper led to a revolution in computing after in was published in 2004.
The basic idea behind map-reduce was that many data processing tasks could be expressed as a sequence of map and reduce operations, and these operations could be scaled easily to process very large data sets.
Engineers would no longer have to think about fault tolerance, partitioning large data sets, optimal usage of network bandwidth, and a variety of other low level concerns. All they would need to do was to express their tasks as map and reduce operations, and the underlying framework would take care of the rest.
Map operations would take a record and produce an intermediate key and value. Reduce operations would then take all records with the same intermediate key and process them in one batch. This simple programming model proved to be surprisingly versatile.
When Yahoo created an open source implementation of map-reduce – named Hadoop – it rapidly gained popularity. Soon systems like Hive built on top of Hadoop and made processing huge datasets even easier – now all a programmer had to do was write some SQL, and their query could process hundreds of terabytes of data with no extra effort from them.
Why does map-reduce work so well? I think it's because map-reduce is simply a formalisation of what we have to do when processing large data sets anyway: Break the data set into small chunks, process each chunk independently, then join the chunks back together.