Python full-text search engine details,

Source: Internet
Author: User

Python full-text search engine details,

Python full-text search engine details

Recently, I have been exploring how to use Python to implement keyword search functions like Baidu. Speaking of keyword search, we will involuntarily think of regular expressions. Regular Expressions are the basis for all searches. There is a re class in python, which is specially used for regular expression matching. However, the retrieval function is not very good when the light is a regular expression.

Python has a whoosh package dedicated to full-text search engines.

Whoosh is rarely used in China, and its performance is not yet mature in sph.pdf/coreseek. However, unlike the former, it is a pure python library, which is more convenient for python enthusiasts. The specific code is as follows:

Install

Enter the command line pip install whoosh

The packages to be imported include:

fromwhoosh.index import create_infromwhoosh.fields import *fromwhoosh.analysis import RegexAnalyzerfromwhoosh.analysis import Tokenizer,Token

Chinese Word Divider

Class ChineseTokenizer (Tokenizer): "Chinese Word Segmentation parser" def _ call _ (self, value, positions = False, chars = False, keeporiginal = True, removestops = True, start_pos = 0, start_char = 0, mode = '', ** kwargs): assert isinstance (value, text_type ), "% r is not unicode" % value t = Token (positions, chars, removestops = removestops, mode = mode, ** kwargs) list_seg = jieba. cut_for_search (value) for w in list_seg: t. original = t. text = w t. boost = 0.5 if positions: t. pos = start_pos + value. find (w) if chars: t. startchar = start_char + value. find (w) t. endchar = start_char + value. find (w) + len (w) yield tdef chinese_analyzer (): return ChineseTokenizer ()

Index building function

@staticmethod  def create_index(document_dir):    analyzer = chinese_analyzer()    schema = Schema(titel=TEXT(stored=True, analyzer=analyzer), path=ID(stored=True),            content=TEXT(stored=True, analyzer=analyzer))    ix = create_in("./", schema)    writer = ix.writer()    for parents, dirnames, filenames in os.walk(document_dir):      for filename in filenames:        title = filename.replace(".txt", "").decode('utf8')        print title        content = open(document_dir + '/' + filename, 'r').read().decode('utf-8')        path = u"/b"        writer.add_document(titel=title, path=path, content=content)    writer.commit()

Search functions

 @staticmethod  def search(search_str):    title_list = []    print 'here'    ix = open_dir("./")    searcher = ix.searcher()    print search_str,type(search_str)    results = searcher.find("content", search_str)    for hit in results:      print hit['titel']      print hit.score      print hit.highlights("content", top=10)      title_list.append(hit['titel'])    print 'tt',title_list    return title_list

Thank you for reading this article. I hope it will help you. Thank you for your support for this site!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.