NLP Chinese Information Processing-inverted index

Last Update:2013-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Inverted index (English: Inverted index), also known as reverse index, put into a file or reverse file, is an index method, it is used to store the ing of a word stored in a document or a group of documents in full-text search. It is the most common data structure in the document retrieval system.

Inverted index analysis:

Take English as an example. The text to be indexed is as follows:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
For the same text, we get the paired data consisting of the number of documents and the word result of the current query. Similarly, the number of documents and the word results of the current query start from scratch. Therefore, "banana": {(2, 3)} Means "banana" in the third document (T2 ), in addition, the third document is located at the Fourth word (Address: 3 ).
"A": {(2, 2 )}
"Banana": {(2, 3 )}
"Is": {(0, 1), (0, 4), (1, 1), (2, 1 )}
"It": {(0, 0), (0, 3), (1, 2), (2, 0 )}
"What": {(0, 2), (1, 0 )}
If we perform the phrase search "what is it", we will get the results of all the words of this phrase. their respective documents are document 0 and document 1. However, the continuous condition of this phrase retrieval is only obtained in document 1.

Python Implementation of inverted indexes:

[Python]
#-------------------------------------------------------------------------------
# Name: InvertedIndex
# Purpose: inverted index
# Created: 02/04/2013
# Copyright: (c) neng2013
# Licence: <your licence>
#-------------------------------------------------------------------------------
Import re
Import string
Processing 199801.txt to remove part-of-speech tagging, date and some impurities. (retain the paragraph structure)
# Input: 199801.txt
# Output: 1998020.new.txt
Def pre_file (filename ):
Print ("read the corpus file % r..." % filename)
Src_data = open (filename). read ()
# Remove part-of-speech tagging, '2017-01-001-001 ', impurities such as' [','] nt'
Des_data = re. compile (R' (\/\ w +) | (\ d + \-\ S +) | (\ [) | (\] \ S + )'). sub ('', src_data)
Des_filename = "1998020.new.txt"
Print ("Writing File % r..." % des_filename)
Open (des_filename, 'w'). writelines (des_data)
Print ("processing completed! ")

# Creating inverted Indexes
# Input: 1998020.new.txt
# Output: my_index.txt format (starting from 0): word (section number, Section position )..
Def create_inverted_index (filename ):
Print ("Read File % r..." % filename)
Src_data = open (filename). read ()
# Variable description
Sub_list = [] # list of all words, used to find deduplication
Word = [] # word Table File
Result = {}# output result {word: index}

# Creating word lists
Sp_data = src_data.split ()
Set_data = set (sp_data) # deduplication
Word = list (set_data) # set is converted to list. Otherwise, the index cannot be obtained.

Src_list = src_data.split ("\ n") # split it into a single segment vv
# Create an index
For w in range (0, len (word )):
Index = [] # record the paragraph and paragraph position [(paragraph number, position), (paragraph number, position)...]
For I in range (0, len (src_list): # traverse all paragraphs
# Print (src_list [I])
Sub_list = src_list [I]. split ()
# Print (sub_list)
For j in range (0, len (sub_list): # traverse all words in a paragraph
# Print (sub_list [j])
If sub_list [j] = word [w]:
Index. append (I, j ))
Result [word [w] = index

Des_filename = "my_index.txt"
Print ("Writing File % r..." % des_filename)
# Print (result)
# Print (word)
# Print (len (word ))
Writefile = open (des_filename, 'w ')
For k in result. keys ():
Writefile. writelines (str (k) + str (result [k]) + "\ n ")
Print ("processing completed! ")

# Main Function
Def main ():
# Pre_file ("199801.txt") # For the initial processing of the corpus, you must go to 199801_new.txt.
When, new.txt was used to create an index, myindex.txt(because the value of was too large, and the index creation time was too long. For this reason, only one snapshot can be used to set the index to 199801_test.txt)
Create_inverted_index ("1998020.test.txt ")

# Run
If _ name _ = '_ main __':
Main ()

#-------------------------------------------------------------------------------
# Name: InvertedIndex
# Purpose: inverted index
# Created: 02/04/2013
# Copyright: (c) neng2013
# Licence: <your licence>
#-------------------------------------------------------------------------------
Import re
Import string
Processing 199801.txt to remove part-of-speech tagging, date and some impurities. (retain the paragraph structure)
# Input: 199801.txt
# Output: 1998020.new.txt
Def pre_file (filename ):
Print ("read the corpus file % r..." % filename)
Src_data = open (filename). read ()
# Remove part-of-speech tagging, '2017-01-001-001 ', impurities such as' [','] nt'
Des_data = re. compile (R' (\/\ w +) | (\ d + \-\ S +) | (\ [) | (\] \ S + )'). sub ('', src_data)
Des_filename = "1998020.new.txt"
Print ("Writing File % r..." % des_filename)
Open (des_filename, 'w'). writelines (des_data)
Print ("processing completed! ")

# Creating inverted Indexes
# Input: 1998020.new.txt
# Output: my_index.txt format (starting from 0): word (section number, Section position )..
Def create_inverted_index (filename ):
Print ("Read File % r..." % filename)
Src_data = open (filename). read ()
# Variable description
Sub_list = [] # list of all words, used to find deduplication
Word = [] # word Table File
Result = {}# output result {word: index}

# Creating word lists
Sp_data = src_data.split ()
Set_data = set (sp_data) # deduplication
Word = list (set_data) # set is converted to list. Otherwise, the index cannot be obtained.

Src_list = src_data.split ("\ n") # split it into a single segment vv
# Create an index
For w in range (0, len (word )):
Index = [] # record the paragraph and paragraph position [(paragraph number, position), (paragraph number, position)...]
For I in range (0, len (src_list): # traverse all paragraphs
# Print (src_list [I])
Sub_list = src_list [I]. split ()
# Print (sub_list)
For j in range (0, len (sub_list): # traverse all words in a paragraph
# Print (sub_list [j])
If sub_list [j] = word [w]:
Index. append (I, j ))
Result [word [w] = index

Des_filename = "my_index.txt"
Print ("Writing File % r..." % des_filename)
# Print (result)
# Print (word)
# Print (len (word ))
Writefile = open (des_filename, 'w ')
For k in result. keys ():
Writefile. writelines (str (k) + str (result [k]) + "\ n ")
Print ("processing completed! ")

# Main Function
Def main ():
# Pre_file ("199801.txt") # For the initial processing of the corpus, you must go to 199801_new.txt.
When, new.txt was used to create an index, myindex.txt(because the value of was too large, and the index creation time was too long. For this reason, only one snapshot can be used to set the index to 199801_test.txt)
Create_inverted_index ("1998020.test.txt ")

# Run
If _ name _ = '_ main __':
Main ()

Main Algorithm Description:

Input: file to be retrieved (1998020.test.txt)

Output: index file (my_index.txt)

The inverted index algorithm uses full-text search, and records the section number and position in the Section.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

NLP Chinese Information Processing-inverted index

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

NLP Chinese Information Processing-inverted index

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support