hadoop--search engine, inverted index

Source: Internet
Author: User
Tags data structures hash iterable static class

the use of inverted indexes

The key step of the search engine is to set up an inverted index, the so-called inverted index is generally expressed as a keyword, followed by its frequency (the number of occurrences), location (in which article or page, and the date, author and other information), it is equivalent to the Internet on the hundreds of millions of pages of the index, like a book directory , label General. Readers want to see which topic related chapters, directly according to the table of contents to find the relevant page. No more from the first page of the book to the last page, one page of the search. Inverted indexes are the cornerstone of search engines. After the inverted index is built, the user is looking for a query, such as entering a keyword in the search box: "Inverted Index", the search engine will not use the crawler again to crawl each page, from top to bottom scan page, see this page has not appeared this keyword, Instead, it finds and matches all pages in its pre-generated inverted index file that contain the "structure method" of the keyword. After finding it, sort by relevance and finally display the sorted results to the user

The concept of inverted indexes

The inverted table is indexed by word or word, and the record table entry for the key in the table records all documents that appear in the word or word, and a table entry is a Word field that records the ID of the document and where the character appears in the document. Since each word or word corresponds to the number of documents in the dynamic change, so the establishment and maintenance of the inverted table is more complex, but in the query because you can get the query keyword at a time for all the documents, so efficiency is higher than the row table. In full-text search, the fast response to retrieval is one of the most critical performance, and indexing is done in the background, albeit less efficient, without affecting the efficiency of the entire search engine.

composition of inverted index

Document: The general search engine is dealing with Internet Web pages, and the concept of the document is more broad, representing the existence of text storage objects, compared to Web pages, covering a variety of forms, such as word,pdf,html,xml and other different formats of files can be called documents. Another such as an email, a text message, a microblog can also be called a document. In the following sections of the book, documents are used to characterize text information in many cases.

Document Collection: A collection of documents is called a collection of documents. For example, a huge amount of internet pages or a large number of e-mails are specific examples of document collections.

Document ID: Within the search engine, each document within the document collection is assigned a unique internal number, which is used as a unique identifier for this document, so that it is easy to process internally, and the internal number of each document is called the "document Number". The following article sometimes uses DocId to easily represent document numbers.

Word ID: Similar to the document number, the search engine internally represents a word with a unique number, and the word number can be used as the unique representation of a word.

Inverted indexes (inverted index): Inverted indexes are a specific form of storage that implements the word-document matrix, and by inverted index, a list of documents containing the word can be quickly obtained based on the word. The inverted index consists mainly of two parts: the word dictionary and the inverted file.

Word dictionary (Lexicon): the usual index unit of a search engine is the word, which is a collection of strings consisting of all the words that appear in the document collection, and each index entry in the word dictionary records some information about the word itself and a pointer to the inverted list.

Inverted arrangement Table (postinglist): The inverted table records the list of documents for all documents that have a word, and the location information that the word appears in the document, each of which is called an inverted item (Posting). You can tell which documents contain a word, based on the inverted list.

Inverted files (inverted file): The inverted list of all words is often stored sequentially in a file on disk, which is called an inverted file, and the inverted file is the physical file that stores the inverted index.

The word dictionary is a very important part of the inverted index, which maintains information about all the words that appear in the document collection, and is used to record the position of the inverted list in the inverted file for a word. In support of the search, according to the user's query words, go to the Word dictionary query, you can get the corresponding inverted table, and as a basis for subsequent sorting.
For a large collection of documents, may contain hundreds of thousands of or even millions of different words, can quickly locate a word, which directly affect the response speed of the search, so the need for efficient data structures to build and find the word dictionary, commonly used data structure including loads hash linked list structure and tree-shaped dictionary structure.
4.1 Loads hash Linked list
Figure 1-7 is a schematic diagram of this dictionary structure. This dictionary structure is composed of two parts:

The main part is a hash table, each hash table entry holds a pointer to the list of conflicting links, and in the conflict list, the same hash value of the word forms the linked list structure. There are conflicting lists because two different words get the same hash value, and if so, when Hashifand Fari is called a conflict, you can store words of the same hash value in the list for subsequent lookups.

Figure 1-7 Loads hash list dictionary structure
In the process of building the index, the dictionary structure is also built accordingly. For example, when parsing a new document, for a word t appearing in a document, the hash function is used first to obtain its hash value, and then the corresponding list of conflicting links is found by reading the stored pointer from the hash table entry corresponding to the hashed value. If the word already exists in the conflict list, the word has already appeared in the previously parsed document. If the word is not found in the conflict list, the word is first encountered, then added to the conflict list. In this way, when all the documents within the document collection have been parsed, the corresponding dictionary structure is built up.

In response to a user query request, the process is similar to establishing a dictionary, except that a word is not added to the dictionary, even if there are no words in the dictionary. In Figure 1-7, for example, assume that the user entered a query request for the word 3, the word is hashed, positioned to the hash table 2nd slot, from its reserved pointer can get the conflict linked list, followed by the word 3 and the conflict linked list of the word comparison, found the word 3 in the conflict list, so found the word, You can then read the corresponding inverted list of the word to do subsequent work, if the word is not found, the document collection does not have any documents contain words, the search results are empty.

4.2 Tree-shaped structure
B-Tree (or a + + tree) is another efficient lookup structure, figure 1-8 is a B-tree structure diagram. B-trees are different from Hashefang, requiring dictionary items to be sorted by size (numeric or Word Fu She), while hashing does not require data to satisfy this requirement.
B-Tree formed a hierarchical search structure, the middle node used to indicate a certain sequence range of dictionary items stored in which subtree, based on the dictionary item comparison size to navigate the role of the bottom of the leaf node stores the word address information, according to this address can extract the word string.

Simple Case Implementation

The original data is three files A.txt b.txt c.txt Create a new folder in HDFs and upload the three files to the folder


then after the first MapReduce



Import Org.apache.commons.lang.StringUtils;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.input.FileSplit;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class Indexone {


public static class Mapone extends Mapper<longwritable, text,text,longwritable>
{
@Override
protected void Map (longwritable key, Text value,context Context)
Throws IOException, Interruptedexception {
Demand: Will hello World--"Hello--a.txt 1 world-->a.txt 1
1. Remove a row of data to a string
String line = value.tostring ();
2. Divide this line of data into the first slice according to the specified delimiter
string[] Words = Stringutils.split (line, "");
3. Get the file slice to which the read file belongs
Configuration conf = new configuration ();
FileSystem fs = Filesystem.get (conf);
4. Use the context parameter to get child objects
Filesplit inputsplit= (Filesplit) context.getinputsplit ();
5, use the Slice object to get the file name
String filename = Inputsplit.getpath (). GetName ();
6. Transfer of data
for (String word:words)
{//7, the encapsulated data output format is K:hello-->a.txt v:1
Context.write (New Text (word+ "-to" +filename), new longwritable (1));
}

}
}
======================reduce============================
public static class Reduceone extends Reducer<text, longwritable,text,longwritable>
{
@Override
protected void reduce (Text key, iterable<longwritable> values,
Context context) throws IOException, Interruptedexception {

1.reduce the data format received from the map end is Hello-->a.txt 1,1,1
2, define a for loop to accumulate the data in the Values collection
Long Count = 0;
for (longwritable value:values)
{
Count + = Value.get ();
}
Context.write (key,new longwritable (count));
}
}

=================== Main Method Commit job==========================
public static void Main (string[] args) throws IOException, ClassNotFoundException, interruptedexception {
1. Get the Job Object
Configuration conf = new configuration ();
Job Job = job.getinstance (conf);
2. Specifying the class of the jar package
Job.setjarbyclass (Indexone.class);
3. Specify a class for map ()
Job.setmapperclass (Mapone.class);
4. Specify the class for reducer ()
Job.setreducerclass (Reduceone.class);
5, specifying the output type
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Longwritable.class);
6, specify the file input path
Fileinputformat.setinputpaths (Job,new Path (args[0]));
7, make robust judgments Delete if the store file already exists

Path output = new path (args[1]);
FileSystem fs = Filesystem.get (conf);
if (fs.exists (output))
{
Fs.delete (output,true);
}
8, specify the output path of the file
Fileoutputformat.setoutputpath (Job,new Path (args[1]));
9, submit the task
System.exit (Job.waitforcompletion (true)? 0:1);
}

}

the output after the first mapreduce run is


And then after the second MapReduce,

Import java.io.IOException;


Import Org.apache.commons.lang.StringUtils;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class Indextwo {
--------------------map ()-------------------------------
Google-->c.txt1
Hadoop-->a.txt2
The above form is the output form of the last MapReduce as the input form of this maprede map
After the map () operation, the output form is Google c.txt-->1
public static class Maptwo extends Mapper<longwritable,text,text,text>
{
@Override
protected void Map (longwritable key, Text value,context Context)
Throws IOException, Interruptedexception {
1. Converting each row of data read to a string type
String line = value.tostring ();
2. Slicing in the specified format
string[] Filds = Stringutils.split (line, "\ t");
3, which is the result of the first segmentation of Google-->c.txt 1
The following second and sub-result is Google C.txt 1
string[] Words = Stringutils.split (Filds[0], "--");
4. Get the contents of a field
String word = words[0];
String Filename=words[1];
Long Count = Long.parselong (filds[1]);
5. Re-assemble the contents of the field for the output output format of Google a.txt-->2;
Context.write (new text (word), new text (filename+ "-to" +count));
}
}
public static class Reducertwo extends Reducer<text, text,text,text>
{
@Override
protected void reduce (Text key, iterable<text> values,context Context)
Throws IOException, Interruptedexception {

1, accept data definition link from map side
String link = "";
for (Text value:values)
{
2 removing data from a collection for linking
Link+=value+ "";
}
3 output data output format for Google a.txt-->2 b.txt-->1
Context.write (link) (Key,new Text);

}
}
public static void Main (string[] args) throws IOException, ClassNotFoundException, interruptedexception {
Configuration conf = new configuration ();
Job Job = job.getinstance (conf);
Job.setjarbyclass (Indextwo.class);
Job.setmapperclass (Maptwo.class);
Job.setreducerclass (Reducertwo.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);
Fileinputformat.setinputpaths (Job,new Path (args[0]));

Path output = new path (args[1]);
FileSystem fs = Filesystem.get (conf);
if (fs.exists (output))
{
Fs.delete (output,true);
}
8, specify the output path of the file
Fileoutputformat.setoutputpath (Job,new Path (args[1]));
9, submit the task
System.exit (Job.waitforcompletion (true)? 0:1);

}


}

After the second MapReduce run, the output results are


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.