7.0 MapReduce Programming
Category Hadoop Tutorial
After learning about the use of MapReduce, we can already handle statistical and retrieval tasks such as Word Count, but objectively, there is still much more that MapReduce can do.
MapReduce mainly relies on developers to implement functionality through programming. Developers can process data by implementing methods related to Map and Reduce.
To simply demonstrate this process, we will manually write a Word Count program.
Note: MapReduce depends on the Hadoop library, but since the Hadoop running environment used in this tutorial is a Docker container, it is difficult to deploy the development environment, so the actual development work (including debugging) will require a computer running Hadoop. Here, we only learn about the deployment of the completed program.
MyWordCount.java File Code
/**
* Reference declaration
* This program is referenced from http://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html
*/
package com.tutorialpro.hadoop;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
/**
* Methods related to `Map`
*/
class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key,
Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
/**
* Methods related to `Reduce`
*/
class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public class MyWordCount {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MyWordCount.class);
conf.setJobName("my_word_count");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
// The first argument represents the input
FileInputFormat.setInputPaths(conf, new Path(args[0]));
// The second argument represents the output
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Please save the contents of this Java file to the NameNode container, suggested location:
/home/hadoop/MyWordCount/com/tutorialpro/hadoop/MyWordCount.java
Note: Depending on the current situation, some Docker environments installed with JDK do not support Chinese, so to be on the safe side, please remove the Chinese comments from the above code.
Enter the directory:
cd /home/hadoop/MyWordCount
Compile:
javac -classpath ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.1.4.jar -classpath ${HADOOP_HOME}/share/hadoop/client/hadoop-client-api-3.1.4.jar com/tutorialpro/hadoop/MyWordCount.java
Package:
jar -cf my-word-count.jar com
Execute:
hadoop jar my-word-count.jar com.tutorialpro.hadoop.MyWordCount /wordcount/input /wordcount/output2
View the results:
hadoop fs -cat /wordcount/output2/part-00000
Output:
I 4
hadoop 2
like 2
love 2
tutorialpro 2
7.0 MapReduce Programming