Python full-text search engine details,
Python full-text search engine details
Recently, I have been exploring how to use Python to implement keyword search functions like Baidu. Speaking of keyword search, we will involuntarily think of regular expressions. Regular Expressions are the basis for all searches. There is a re class in python, which is specially used for regular expression matching. However, the retrieval function is not very good when the light is a regular expression.
Python has a whoosh package dedicated to full-text search engines.
Whoosh is rarely used in China, and its performance is not yet mature in sph.pdf/coreseek. However, unlike the former, it is a pure python library, which is more convenient for python enthusiasts. The specific code is as follows:
Install
Enter the command line pip install whoosh
The packages to be imported include:
fromwhoosh.index import create_infromwhoosh.fields import *fromwhoosh.analysis import RegexAnalyzerfromwhoosh.analysis import Tokenizer,Token
Chinese Word Divider
Class ChineseTokenizer (Tokenizer): "Chinese Word Segmentation parser" def _ call _ (self, value, positions = False, chars = False, keeporiginal = True, removestops = True, start_pos = 0, start_char = 0, mode = '', ** kwargs): assert isinstance (value, text_type ), "% r is not unicode" % value t = Token (positions, chars, removestops = removestops, mode = mode, ** kwargs) list_seg = jieba. cut_for_search (value) for w in list_seg: t. original = t. text = w t. boost = 0.5 if positions: t. pos = start_pos + value. find (w) if chars: t. startchar = start_char + value. find (w) t. endchar = start_char + value. find (w) + len (w) yield tdef chinese_analyzer (): return ChineseTokenizer ()
Index building function
@staticmethod def create_index(document_dir): analyzer = chinese_analyzer() schema = Schema(titel=TEXT(stored=True, analyzer=analyzer), path=ID(stored=True), content=TEXT(stored=True, analyzer=analyzer)) ix = create_in("./", schema) writer = ix.writer() for parents, dirnames, filenames in os.walk(document_dir): for filename in filenames: title = filename.replace(".txt", "").decode('utf8') print title content = open(document_dir + '/' + filename, 'r').read().decode('utf-8') path = u"/b" writer.add_document(titel=title, path=path, content=content) writer.commit()
Search functions
@staticmethod def search(search_str): title_list = [] print 'here' ix = open_dir("./") searcher = ix.searcher() print search_str,type(search_str) results = searcher.find("content", search_str) for hit in results: print hit['titel'] print hit.score print hit.highlights("content", top=10) title_list.append(hit['titel']) print 'tt',title_list return title_list
Thank you for reading this article. I hope it will help you. Thank you for your support for this site!