Hadoop MapReduce IDEA上应用开发配置

开发Hadoop MapReduce应用，首先使用idea新建一个maven项目，以便编译MapReduce程序并通过命令行或在自己的IDE中以本地(独立，standalone)模式运行它们。

1.Maven POM
编译和测试Map-Reduce 程序时需要的依赖项

<project>
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.hadoopbook</groupId>
    <artifactId>hadoop-book-mr-dev</artifactId>
    <version>4.0</version>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <hadoop.version>2.5.1</hadoop.version>
    </properties>
    <dependencies>
        <!--Hadoopmainclientartifact-->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <!--Unittestartifacts-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.mrunit</groupId>
            <artifactId>mrunit</artifactId>
            <version>1.1.0</version>
            <classifier>hadoop2</classifier>
            <scope>test</scope>
        </dependency>
        <!--Hadooptestartifactforrunningminiclusters-->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-minicluster</artifactId>
            <version>${hadoop.version}</version>
            <scope>test</scope>
        </dependency
    </dependencies>
    <build>
        <finalName>hadoop-examples</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.5</version>
                <configuration>
                    <outputDirectory>${basedir}</outputDirectory>
                </configuration>
            </p1ugin>
        </plugins>
    </build>
</project>

要想构建MapReduce项目，你只需要有hadoop-client依赖关系，它包含了和HDFS及MapReduce 交互所需要的所有Hadoop client-side 类。当运行单元测试时，我们要使用junit类;当写MapReduce测试用例时，我们使用mrunit类。hadoop-minicluster 库中包含了“mini-”集群，这有助于在一个单JVM中运行Hadoop集群进行测试。

2.计数器案例代码

public class WordCount2 {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{

        static enum CountersEnum { INPUT_WORDS }

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        private boolean caseSensitive;
        private Set<String> patternsToSkip = new HashSet<String>();

        private Configuration conf;
        private BufferedReader fis;

        @Override
        public void setup(Context context) throws IOException,
                InterruptedException {
            conf = context.getConfiguration();
            caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
            if (conf.getBoolean("wordcount.skip.patterns", false)) {
                URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
                for (URI patternsURI : patternsURIs) {
                    Path patternsPath = new Path(patternsURI.getPath());
                    String patternsFileName = patternsPath.getName().toString();
                    parseSkipFile(patternsFileName);
                }
            }
        }

        private void parseSkipFile(String fileName) {
            try {
                fis = new BufferedReader(new FileReader(fileName));
                String pattern = null;
                while ((pattern = fis.readLine()) != null) {
                    patternsToSkip.add(pattern);
                }
            } catch (IOException ioe) {
                System.err.println("Caught exception while parsing the cached file '"
                        + StringUtils.stringifyException(ioe));
            }
        }

        @Override
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            String line = (caseSensitive) ?
                    value.toString() : value.toString().toLowerCase();
            for (String pattern : patternsToSkip) {
                line = line.replaceAll(pattern, "");
            }
            StringTokenizer itr = new StringTokenizer(line);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
                Counter counter = context.getCounter(CountersEnum.class.getName(),
                        CountersEnum.INPUT_WORDS.toString());
                counter.increment(1);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
        String[] remainingArgs = optionParser.getRemainingArgs();
        if ((remainingArgs.length != 2) && (remainingArgs.length != 4)) {
            System.err.println("Usage: wordcount <in> <out> [-skip skipPatternFile]");
            System.exit(2);
        }
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount2.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        List<String> otherArgs = new ArrayList<String>();
        for (int i=0; i < remainingArgs.length; ++i) {
            if ("-skip".equals(remainingArgs[i])) {
                job.addCacheFile(new Path(remainingArgs[++i]).toUri());
                job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
            } else {
                otherArgs.add(remainingArgs[i]);
            }
        }
        FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

GenericOptionsParser是Hadoop 自带了的些辅助类。用来解释常用的Hadoop命令行选项，并根据需要为Configuration对象设置相应的取值。通常不直接使用GenericOptionsParser. 更方便的方式是实现Tool接口，通过ToolRunner来运行应用程序。

3.本地测试运行mapreduce

更改运行参数设置，添加输入、输出参数，对应GenericOptionsParser 获取的两个参数

点击 Edit Configurations

然后直接点击运行

然后就可以去输出文件查看内容

4.提交到集群运行mapreduce

打包好jar报后，上传到集群，然后运行下面命令就可以再集群运行

hadoop jar WordCount-1.0-SNAPSHOT.jar cn/busuanzi/bigdata/wordcount/WordCount2 /cn/busuanzi/big-data/wordcount/input/ /cn/busuanzi/big-data/wordcount/output/2

liferay portal 可后台配置的portlet配置页面开发 idea开发spring boot实现修改代码立即热部署 intellij idea配置maven和pom hadoop 集群配置规划与调优 hadoop mapreduce 概念及原理 hadoop hbase phoenix 大数据集群环境安装配置 spark streaming kafka 开发案例与环境配置 hadoop 修改参数后刷新查看配置 intellij idea开发spark程序连接本地集群 hadoop mapreduce和spark对比 hadoop dfs.datanode.data.dir配置多目录及数据均衡 hadoop和hbase修改配置pid文件路径 scala intellij idea 开发环境搭建 scala idea maven开发配置 hadoop 大数据集群环境搭建配置 hadoop mapreduce 教程 hadoop mapreduce 编程模型 hadoop mapreduce idea上应用开发配置 hadoop mapreduce 运行原理和机制 php yaf 应用配置