Hadoop Essentials Tutorial 9, the inverted index of Hadoop

Source: Internet
Author: User
Tags iterable hadoop fs

Development environment

Hardware environment: Centos 6.5 server 4 (one for master node, three for slave node)
Software Environment: Java 1.7.0_45, hadoop-1.2.1

1. Inverted index

Inverted index is the most common data structure in document retrieval system and is widely used in full-text search engine. It is primarily used to store the mapping of a word (or phrase) where a document or group of documents is stored, providing a way to find documents based on content. Instead of determining what the document contains, it does the opposite (finding the document based on the keyword) instead of the document, and is called an inverted index (inverted index). Typically, an inverted index consists of a word (phrase) and a list of related documents (the ID number of the document, or the URI where the specified document is located), as shown in:

As you can see, Word 1 appears in {Document 1, document 4, document 、......} , Word 2 appears in {document 3, document 5, document 、.....} , and Word 3 appears in {Document 1, document 8, document 、....}, you also need to add a weight to each document to indicate the relevance of each document to the search content, as shown in:

The most common use of the word frequency as a weight, that is, the number of words recorded in the document appears. In English, for example, the "MapReduce" line in the index file indicates that the word "MapReduce" appears in the text T0 1 times, 1 times in T1, and 2 times in T2. When the search conditions are "MapReduce", "is", and "simple", the corresponding collection is: {t0,t1,t2}∩{t0,t1}∩{t0,t1}={t0,t1}, that is, the text T0 and T1 contain the words to be indexed, and only T0 is contiguous.

2. Map process

The input file is first processed using the default Textinputformat class to get the offset of each line in the text and its contents. Obviously, the map process must first analyze the input < key,value> pairs, and get the three information needed in the inverted index: Word, document URI, and frequency, as shown. There are two questions here: the first,< key,value> pair can only have two values, in case the Hadoop custom data type is not used, it is necessary to combine two values into one value as a key or value, and second, The word frequency statistic and document list cannot be completed simultaneously through a reduce process, so a combine process must be added to complete the word frequency statistics.

Here the word and URI form a key value (such as "Mapreduce:1.txt"), the frequency as value, the advantage is that it can take advantage of the map end of the MapReduce framework to sort, the same document of the same word frequency of the composition list, passed to the combine process, Implements a feature similar to WordCount.
Map process Core code implementation as follows, detailed source please refer to: Invertedindex\src\com\zonesion\hdfs\invertedindex.java.

public static class InvertedIndexMapper extends     Mapper<Object,Text,Object,Text>{    private Text keyInfo = new Text();//存储单词和URI的组合    private  Text valueInfo = new Text();//存储词频    private FileSplit split;//存储Split对象    @Override    public void map(Object key, Text value, Context context)            throws IOException, InterruptedException {        split = (FileSplit)context.getInputSplit();        StringTokenizer itr = new StringTokenizer(value.toString());        while(itr.hasMoreTokens()){            //key值由单词和URI组成            keyInfo.set(itr.nextToken()+":"+split.getPath().toString());            valueInfo.set("1");            context.write(keyInfo, valueInfo);//输出:<key,value>---<"MapReduce:1.txt",1>        }    }}
3. Combine process

After the map method is processed, the combine process accumulates the value of value with the same key value, resulting in a word frequency in the document, as shown in. If you direct the output of the map as input to the reduce process, you will face a problem during the shuffle process: all records with the same word (consisting of words, URIs, and word frequency) should be processed by the same reduce, but the current key value cannot guarantee this. Therefore, the key and value values must be modified. This time, the word is a key value, and the URI and frequency as the value. The advantage of this is that the default Hashpartitioner class of the MapReduce framework can be used to complete the shuffle process, sending all records of the same word to the same reducer processing.

Combine process core code implementation as follows, detailed source please refer to: Invertedindex\src\com\zonesion\hdfs\invertedindex.java.

public static class InvertedIndexCombiner     extends Reducer<Text, Text, Text, Text>{    private Text info = new Text();    @Override    protected void reduce(Text key, Iterable<Text> values,Context context)            throws IOException, InterruptedException {        //输入:<key,value>---<"MapReduce:1.txt",list(1,1,1,1)>        int sum = 0;        for(Text value : values){            sum += Integer.parseInt(value.toString());        }        int splitIndex = key.toString().indexOf(":");        info.set(key.toString().substring(splitIndex+1)+":"+sum);        key.set(key.toString().substring(0,splitIndex));        context.write(key, info);//输出:<key,value>----<"Mapreduce","0.txt:2">    }}
4. Reduce process

After these two processes, the reduce process only needs to combine the value values of the same key value into the format required for the inverted index file, and the rest can be passed directly to the MapReduce framework, as shown in.

Reduce process Core code implementation as follows, detailed source please refer to: Invertedindex\src\com\zonesion\hdfs\invertedindex.java.

public static class InvertedIndexReducer     extends Reducer<Text, Text, Text, Text>{    private Text result = new Text();    @Override    protected void reduce(Text key, Iterable<Text> values,Context context)            throws IOException, InterruptedException {        //输入:<"MapReduce",list("0.txt:1","1.txt:1","2.txt:1")>        String fileList = new String();        for(Text value : values){//value="0.txt:1"            fileList += value.toString()+";";        }        result.set(fileList);        context.write(key, result);//输出:<"MapReduce","0.txt:1,1.txt:1,2.txt:1">    }}
5. Drive implementation

Drive core code implementation as follows, detailed source please refer to: Invertedindex\src\com\zonesion\hdfs\invertedindex.java.

public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: InvertedIndex <in> <out>");      System.exit(2);    }    Job job = new Job(conf, "InvertedIndex");    job.setJarByClass(InvertedIndex.class);    //设置Mapper类、Combiner类、Reducer类;    job.setMapperClass(InvertedIndexMapper.class);    job.setCombinerClass(InvertedIndexCombiner.class);    job.setReducerClass(InvertedIndexReducer.class);    //设置了Map过程和Reduce过程的输出类型,设置key、value的输出类型为Text;    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(Text.class);    //设置任务数据的输入、输出路径;    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    //执行job任务,执行成功后退出;    System.exit(job.waitForCompletion(true) ? 0 : 1);}
6. Deployment Run 1) Start the Hadoop cluster
[[email protected] CompanyJoinAddress]$ start-dfs.sh[[email protected] CompanyJoinAddress]$ start-mapred.sh[[email protected] CompanyJoinAddress]$ jps5283 SecondaryNameNode5445 JobTracker5578 Jps5109 NameNode
2) Deploy the source code
#设置工作环境[[email protected] ~]$ mkdir -p /usr/hadoop/workspace/MapReduce#部署源码将InvertedIndex文件夹拷贝到/usr/hadoop/workspace/MapReduce/ 路径下;

... You can download Invertedindex directly

3) Compiling files
#切换工作目录[[email protected] ~]$ cd /usr/hadoop/workspace/MapReduce/InvertedIndex#编译文件[[email protected] InvertedIndex]$ javac -classpath /usr/hadoop/hadoop-core-1.2.1.jar:/usr/hadoop/lib/commons-cli-1.2.jar -d bin src/com/zonesion/hdfs/InvertedIndex.java[[email protected] InvertedIndex]$ ls bin/com/zonesion/hdfs/ -la总用量 20drwxrwxr-x 2 hadoop hadoop 4096 9月  18 17:09 .drwxrwxr-x 3 hadoop hadoop   17 9月  18 17:09 ..-rw-rw-r-- 1 hadoop hadoop 1982 9月  18 17:09 InvertedIndex.class-rw-rw-r-- 1 hadoop hadoop 2173 9月  18 17:09 InvertedIndex$InvertedIndexCombiner.class-rw-rw-r-- 1 hadoop hadoop 2103 9月  18 17:09 InvertedIndex$InvertedIndexMapper.class-rw-rw-r-- 1 hadoop hadoop 1931 9月  18 17:09 InvertedIndex$InvertedIndexReducer.class
4) Packaging jar files
[[email protected] InvertedIndex]$ jar -cvf InvertedIndex.jar -C bin/ .已添加清单正在添加: com/(输入 = 0) (输出 = 0)(存储了 0%)正在添加: com/zonesion/(输入 = 0) (输出 = 0)(存储了 0%)正在添加: com/zonesion/hdfs/(输入 = 0) (输出 = 0)(存储了 0%)正在添加: com/zonesion/hdfs/InvertedIndex$InvertedIndexMapper.class(输入 = 2103) (输出 = 921)(压缩了 56%)正在添加: com/zonesion/hdfs/InvertedIndex$InvertedIndexCombiner.class(输入 = 2173) (输出 = 944)(压缩了 56%)正在添加: com/zonesion/hdfs/InvertedIndex$InvertedIndexReducer.class(输入 = 1931) (输出 = 830)(压缩了 57%)正在添加: com/zonesion/hdfs/InvertedIndex.class(输入 = 1982) (输出 = 1002)(压缩了 49%)
5) Upload input file
#创建InvertedIndex/input/输入文件夹[[email protected] InvertedIndex]$ hadoop fs -mkdir InvertedIndex/input/#上传文件到InvertedIndex/input/输入文件夹[[email protected] InvertedIndex]$ hadoop fs -put input/*.txt /user/hadoop/InvertedIndex/input#验证上传文件是否成功[[email protected] InvertedIndex]$ hadoop fs -ls /user/hadoop/InvertedIndex/inputFound 3 items-rw-r--r--   1 hadoop supergroup 20 2014-09-18 17:12 /user/hadoop/InvertedIndex/input/0.txt-rw-r--r--   1 hadoop supergroup 32 2014-09-18 17:12 /user/hadoop/InvertedIndex/input/1.txt-rw-r--r--   1 hadoop supergroup 30 2014-09-18 17:12 /user/hadoop/InvertedIndex/input/2.txt
6) Run the jar file
[[email protected] InvertedIndex]$ hadoop jar InvertedIndex.jar com.zonesion.hdfs.InvertedIndex InvertedIndex/input InvertedIndex/output14/09/18 17:16:40 INFO input.FileInputFormat: Total input paths to process : 314/09/18 17:16:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library14/09/18 17:16:40 WARN snappy.LoadSnappy: Snappy native library not loaded14/09/18 17:16:41 INFO mapred.JobClient: Running job: job_201409150922_000314/09/18 17:16:42 INFO mapred.JobClient:  map 0% reduce 0%14/09/18 17:16:45 INFO mapred.JobClient:  map 100% reduce 0%14/09/18 17:16:51 INFO mapred.JobClient:  map 100% reduce 33%14/09/18 17:16:53 INFO mapred.JobClient:  map 100% reduce 100%14/09/18 17:16:53 INFO mapred.JobClient: Job complete: job_201409150922_000314/09/18 17:16:53 INFO mapred.JobClient: Counters: 29......
7) View the results of the output
#查看HDFS上output目录内容 [[email protected] invertedindex]$ Hadoop fs-ls/user/hadoop/invertedindex/outputfound 3 items-rw-r--r--1 hadoop supergroup 0 2014-07-21 15:31/user/hadoop/invertedindex/output/_successdrwxr-xr-x-Hadoop  SuperGroup 0 2014-07-21 15:30/user/hadoop/invertedindex/output/_logs-rw-r--r--1 hadoop supergroup665 2014-07-21 15:31 /user/hadoop/invertedindex/output/part-r-00000# View Results output file contents [[email protected] invertedindex]$ Hadoop fs-cat/ User/hadoop/invertedindex/output/part-r-00000hello hdfs://master:9000/user/hadoop/invertedindex/input/2.txt:1; MapReduce hdfs://master:9000/user/hadoop/invertedindex/input/2.txt:2;hdfs://master:9000/user/hadoop/ Invertedindex/input/1.txt:1;hdfs://master:9000/user/hadoop/invertedindex/input/0.txt:1; Powerful Hdfs://master:9000/user/hadoop/invertedindex/input/1.txt:1;bye hdfs://master:9000/user/hadoop/inverted Index/input/2.txt:1;is hdfs://master:9000/user/hadoop/invertedindex/input/0.txt:1;hdfs://Master:9000/user/hadoop/invertedindex/input/1.txt:2;simple hdfs://master:9000/user/hadoop/invertedindex/input/ 1.txt:1;hdfs://master:9000/user/hadoop/invertedindex/input/0.txt:1;
You may like

"Basic Hadoop Tutorial" 5, Word count for Hadoop
"Basic Hadoop Tutorial" 6, Hadoop single-table association query
"Basic Hadoop Tutorial" 7, one of Hadoop for multi-correlated queries
"Basic Hadoop Tutorial" 8, one of Hadoop for multi-correlated queries
Hadoop Essentials Tutorial 9, the inverted index of Hadoop

Hadoop Essentials Tutorial 9, the inverted index of Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.