NLP Chinese Information Processing-inverted index

Source: Internet
Author: User

Inverted index (English: Inverted index), also known as reverse index, put into a file or reverse file, is an index method, it is used to store the ing of a word stored in a document or a group of documents in full-text search. It is the most common data structure in the document retrieval system.

 


Inverted index analysis:


Take English as an example. The text to be indexed is as follows:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
For the same text, we get the paired data consisting of the number of documents and the word result of the current query. Similarly, the number of documents and the word results of the current query start from scratch. Therefore, "banana": {(2, 3)} Means "banana" in the third document (T2 ), in addition, the third document is located at the Fourth word (Address: 3 ).
"A": {(2, 2 )}
"Banana": {(2, 3 )}
"Is": {(0, 1), (0, 4), (1, 1), (2, 1 )}
"It": {(0, 0), (0, 3), (1, 2), (2, 0 )}
"What": {(0, 2), (1, 0 )}
If we perform the phrase search "what is it", we will get the results of all the words of this phrase. their respective documents are document 0 and document 1. However, the continuous condition of this phrase retrieval is only obtained in document 1.

 

Python Implementation of inverted indexes:


[Python]
#-------------------------------------------------------------------------------
# Name: InvertedIndex
# Purpose: inverted index
# Created: 02/04/2013
# Copyright: (c) neng2013
# Licence: <your licence>
#-------------------------------------------------------------------------------
Import re
Import string
Processing 199801.txt to remove part-of-speech tagging, date and some impurities. (retain the paragraph structure)
# Input: 199801.txt
# Output: 1998020.new.txt
Def pre_file (filename ):
Print ("read the corpus file % r..." % filename)
Src_data = open (filename). read ()
# Remove part-of-speech tagging, '2017-01-001-001 ', impurities such as' [','] nt'
Des_data = re. compile (R' (\/\ w +) | (\ d + \-\ S +) | (\ [) | (\] \ S + )'). sub ('', src_data)
Des_filename = "1998020.new.txt"
Print ("Writing File % r..." % des_filename)
Open (des_filename, 'w'). writelines (des_data)
Print ("processing completed! ")
 
 
# Creating inverted Indexes
# Input: 1998020.new.txt
# Output: my_index.txt format (starting from 0): word (section number, Section position )..
Def create_inverted_index (filename ):
Print ("Read File % r..." % filename)
Src_data = open (filename). read ()
# Variable description
Sub_list = [] # list of all words, used to find deduplication
Word = [] # word Table File
Result = {}# output result {word: index}
 
# Creating word lists
Sp_data = src_data.split ()
Set_data = set (sp_data) # deduplication
Word = list (set_data) # set is converted to list. Otherwise, the index cannot be obtained.
 
Src_list = src_data.split ("\ n") # split it into a single segment vv
# Create an index
For w in range (0, len (word )):
Index = [] # record the paragraph and paragraph position [(paragraph number, position), (paragraph number, position)...]
For I in range (0, len (src_list): # traverse all paragraphs
# Print (src_list [I])
Sub_list = src_list [I]. split ()
# Print (sub_list)
For j in range (0, len (sub_list): # traverse all words in a paragraph
# Print (sub_list [j])
If sub_list [j] = word [w]:
Index. append (I, j ))
Result [word [w] = index
 
Des_filename = "my_index.txt"
Print ("Writing File % r..." % des_filename)
# Print (result)
# Print (word)
# Print (len (word ))
Writefile = open (des_filename, 'w ')
For k in result. keys ():
Writefile. writelines (str (k) + str (result [k]) + "\ n ")
Print ("processing completed! ")
 
# Main Function
Def main ():
# Pre_file ("199801.txt") # For the initial processing of the corpus, you must go to 199801_new.txt.
When, new.txt was used to create an index, myindex.txt(because the value of was too large, and the index creation time was too long. For this reason, only one snapshot can be used to set the index to 199801_test.txt)
Create_inverted_index ("1998020.test.txt ")
 
# Run
If _ name _ = '_ main __':
Main ()

#-------------------------------------------------------------------------------
# Name: InvertedIndex
# Purpose: inverted index
# Created: 02/04/2013
# Copyright: (c) neng2013
# Licence: <your licence>
#-------------------------------------------------------------------------------
Import re
Import string
Processing 199801.txt to remove part-of-speech tagging, date and some impurities. (retain the paragraph structure)
# Input: 199801.txt
# Output: 1998020.new.txt
Def pre_file (filename ):
Print ("read the corpus file % r..." % filename)
Src_data = open (filename). read ()
# Remove part-of-speech tagging, '2017-01-001-001 ', impurities such as' [','] nt'
Des_data = re. compile (R' (\/\ w +) | (\ d + \-\ S +) | (\ [) | (\] \ S + )'). sub ('', src_data)
Des_filename = "1998020.new.txt"
Print ("Writing File % r..." % des_filename)
Open (des_filename, 'w'). writelines (des_data)
Print ("processing completed! ")


# Creating inverted Indexes
# Input: 1998020.new.txt
# Output: my_index.txt format (starting from 0): word (section number, Section position )..
Def create_inverted_index (filename ):
Print ("Read File % r..." % filename)
Src_data = open (filename). read ()
# Variable description
Sub_list = [] # list of all words, used to find deduplication
Word = [] # word Table File
Result = {}# output result {word: index}

# Creating word lists
Sp_data = src_data.split ()
Set_data = set (sp_data) # deduplication
Word = list (set_data) # set is converted to list. Otherwise, the index cannot be obtained.

Src_list = src_data.split ("\ n") # split it into a single segment vv
# Create an index
For w in range (0, len (word )):
Index = [] # record the paragraph and paragraph position [(paragraph number, position), (paragraph number, position)...]
For I in range (0, len (src_list): # traverse all paragraphs
# Print (src_list [I])
Sub_list = src_list [I]. split ()
# Print (sub_list)
For j in range (0, len (sub_list): # traverse all words in a paragraph
# Print (sub_list [j])
If sub_list [j] = word [w]:
Index. append (I, j ))
Result [word [w] = index

Des_filename = "my_index.txt"
Print ("Writing File % r..." % des_filename)
# Print (result)
# Print (word)
# Print (len (word ))
Writefile = open (des_filename, 'w ')
For k in result. keys ():
Writefile. writelines (str (k) + str (result [k]) + "\ n ")
Print ("processing completed! ")

# Main Function
Def main ():
# Pre_file ("199801.txt") # For the initial processing of the corpus, you must go to 199801_new.txt.
When, new.txt was used to create an index, myindex.txt(because the value of was too large, and the index creation time was too long. For this reason, only one snapshot can be used to set the index to 199801_test.txt)
Create_inverted_index ("1998020.test.txt ")

# Run
If _ name _ = '_ main __':
Main ()

 

Main Algorithm Description:

Input: file to be retrieved (1998020.test.txt)

Output: index file (my_index.txt)

The inverted index algorithm uses full-text search, and records the section number and position in the Section.

 


 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.