In the we saw how to write a MapReduce program for finding the top-n items of a dataset.
The code in the mapper emits a pair key-value for every word found, passing the word as the key and 1 as the value. Since the book has roughly 38,000 words, this means that the information transmitted from mappers to reducers is proportional to that number. A way to improve network performance of this program is to rewrite the mapper as follows:public static class TopNMapper extends Mapper
As we can see, we define an HashMap that uses words as the keys and the number of occurrences as the values; inside the loop, instead of emitting every word to the reducer, we put it into the map: if the word was already put, we increase its value, otherwise we set it to one. We also overrode the cleanup method, which is a method that Hadoop calls when the mapper has finished computing its input; in this method we now can emit the words to the reducers: doing this way, we can save a lot of network transmissions because we send to the reducers every word only once.
The complete code of this class is available on . In the next post we'll see how to use combiners to leverage this approach.from: http://andreaiacono.blogspot.com/2014/03/enhanced-mapreduce-for-top-n-items.html