The study here is to use the distributed programming model of MapReduce to implement a simple inverted index.
First, what is an inverted index?
Inverted index is the most commonly used data structure in document retrieval and is widely used in full-text search engine.
It is mainly used to store a word (or phrase) in a document or a set of documents stored in a map of the location, that can be found through the content of the document;
Instead of using a document to determine what the document contains, it is called an inverted index (inverted).
The basic principle of inverted index and the establishment of the process can be illustrated by the diagram.
Various types of files are parsed into plain text, followed by Chinese word segmentation, and the corresponding document number is combined,
The simplest inverted index file is formed to sort the table.
The structure of the inverted sort table is such that some tuple sets:< vocabulary,< document ID, vocabulary location >>.
Using MapReduce to implement inverted indexes