A general Programming model and implementation for processing large data sets.
map
and reduce
functions.Let's look at a concrete example. Imagine you have a large no of text files in a directory, and you want to calculate the occurence of each word. A non-parallel implementatin might look like:
word_count = {}
for each file f in dir:
for each word w in f:
if w in word_count:
word_count[w] += 1
else
word_count[w] = 1
This is simple enough, but offers no scope for explicit parallelism. In a map reduce paradigm, we can define our map
and reduce
function for word counting program as:
map(string key, string value):
// key: file name
// value: file contents
for each word w in file:
EmitIntermediate(w, "1")
reduce(string key, iterator values):
// key: a word
// values: a list of counts
int result = 0
for each v in values:
result += ParseInt(v)
Emit(AsString(result));
Conceptually, the map and reduce functions asupplied by the user have associated types as:
map (k1, v1) --> list(k2, v2)
reduce (k2, list(v2)) --> list(v2)
The input keys and values are drawn from a different domain than the output keys and values. Furthermore, the intermediate keys and values are from the same domain as the output keys and values.
A Go code library can be found here: https://github.com/arorashu/mapreduce
Thanks for reading. Let me know what you think by connecting on twitter @shubham_arora_0