博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Top N之MapReduce程序加强版Enhanced MapReduce for Top N items
阅读量:5962 次
发布时间:2019-06-19

本文共 1999 字,大约阅读时间需要 6 分钟。

In the  we saw how to write a MapReduce program for finding the top-n items of a dataset. 

The code in the mapper emits a pair key-value for every word found, passing the word as the key and 1 as the value. Since the book has roughly 38,000 words, this means that the information transmitted from mappers to reducers is proportional to that number. A way to improve network performance of this program is to rewrite the mapper as follows:

public static class TopNMapper extends Mapper
{ private Map
countMap = new HashMap<>(); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " "); StringTokenizer itr = new StringTokenizer(cleanLine); while (itr.hasMoreTokens()) { String word = itr.nextToken().trim(); if (countMap.containsKey(word)) { countMap.put(word, countMap.get(word)+1); } else { countMap.put(word, 1); } } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { for (String key: countMap.keySet()) { context.write(new Text(key), new IntWritable(countMap.get(key))); } } }

As we can see, we define an HashMap that uses words as the keys and the number of occurrences as the values; inside the loop, instead of emitting every word to the reducer, we put it into the map: if the word was already put, we increase its value, otherwise we set it to one. We also overrode the cleanup method, which is a method that Hadoop calls when the mapper has finished computing its input; in this method we now can emit the words to the reducers: doing this way, we can save a lot of network transmissions because we send to the reducers every word only once. 

The complete code of this class is available on . 
In the next post we'll see how to use combiners to leverage this approach.

from: http://andreaiacono.blogspot.com/2014/03/enhanced-mapreduce-for-top-n-items.html

转载地址:http://cijax.baihongyu.com/

你可能感兴趣的文章
前端seo注意事项!
查看>>
ConcurrentHashMap 线程安全
查看>>
Centos6.5 Python2.7+Supervisor 环境安装
查看>>
Exchange2010SP1删除特定主题邮件
查看>>
Supporting Python 3(支持python3)——语言区别和暂时解决方法
查看>>
Linux 下网络性能优化方法简析
查看>>
ejs教程
查看>>
查询某个命令需要用yum安装哪个包才有
查看>>
网络协议
查看>>
Storm环境搭建
查看>>
我的友情链接
查看>>
初试Jekyll----像个GEEK一般写博客(1)
查看>>
Linux运维实战之用Eclipse写python程序
查看>>
pycharm windows 远程调试 ubuntu虚拟机python 项目
查看>>
python随机生成字符串学习
查看>>
打印机的安装
查看>>
Nginx 完整配置说明
查看>>
认识Linux
查看>>
linux 下查看某个端口是否被占用
查看>>
win7 win10 win8系统文件夹重命名要刷新下文件名才会改变,桌面也不会自动刷新...
查看>>