Self-developed Website full-text retrieval system

Source: Internet
Author: User

This article is I wrote a database related job report, posted here

1. Overview
1.1. Questions raised

If you have a huge website and more content, it's often difficult for visitors to find what they need, and you'll need a site search to help visitors find the information they want faster!

1.2. Solution

Build your own full-text retrieval system.
1.2.1. What is full-text search
Full-Text Search is a kind of text data retrieval method that matches all the text in a file with the retrieved item. The full-text retrieval system is a software system based on full-text retrieval theory to provide full-text retrieval service.
Currently the largest search engine Google and Baidu are using full-text search technology. Of course, Google based on Google Sambo (GFS, MapReduce, BigTable) to build a huge large data processing platform. Full-text retrieval is an important part of search engine technology.
1.2.2. Data classification

    • Structured data: Refers to data that has a fixed or finite length, such as a database, metadata, and so on.

    • Unstructured data: Refers to data that has no fixed format or indefinite length, such as messages, Word documents, and so on.

Unstructured data is also called: full-text data.
1.2.3. By data classification, search is also divided into two

    • Search for structured data:

such as a search for a database: SQL statements. Again like the search for Windows: File name, type, modified time.

    • Search for unstructured data:

such as Windows search for file content. Linux under the grep command. such as Google and Baidu can search a large amount of content data.
For unstructured data search is also called the search for full-text data. To get a good search experience, full-text search technology can achieve this. Search for full-text data can also be divided into two types:

    1. Sequential Scan: If you want to find a file containing a string of content, a document is searched from beginning to end, such as a look.

    2. Index scanning: The content of unstructured data is extracted from a part of the organization, let it become structured, this part of the data we extracted is called the index.

Simple use of the database to provide full-text search has been unable to meet the requirements of the site for the search function, the Big Data era to obtain the necessary data from the massive data, the establishment of a powerful full-text retrieval system is necessary.

2. Full-text retrieval system design and implementation strategy
2.1. Architecture of the system

Here is a picture that illustrates:

2.2. Module design

The full-text search is divided into two processes:
Index Creation (Indexer) and search index.
Search index: Is the process of getting the user's query request, searching for the created index, and then returning the result.
2.2.1. Information Processing

The core of the information Processing module is the creation of the index.
What does the index store? (Index)
The information stored in the index is generally as follows:

Let's say I now have 100 documents, from 1 to 100.
Dictionary: A series of strings are saved.
Inverted table: points to the list of documents that contain strings.
How do I create an index?
The index creation process for full-text indexing generally has the following steps:

    • Some documents that require an index to be created.

    • Pass the original document to the sub-phrase (tokenizer).

    • The resulting Word element (Token) is passed to the language processing component (linguistic Processor).

    • Pass the resulting word (term) to the index component (Indexer).

    • Merge the same word (term) into a document inverted (Posting list) linked list

(Index structure illustration)

Document Frequenc: That is, the frequency of documents, indicating how many files contain this word (term)
Frequency: The word frequency, which indicates that this file contains several words (term)
Add: Creating an index is the core task of the system, which requires topic dictionary processing, information weight-dissipation, document modeling, document analysis and filtering, and the creation of inverted indexes. There is also a need to deal with stop words (such as some small meaning).
2.2.2. Query Service

The focus of the query service is the search index.
How to search for an index
Search is mainly divided into the following steps:

    • The first step: the user enters a query statement.

    • The second step: lexical analysis, grammar analysis, and language processing of query statements

    • The third step: Search index, get the document that conforms to the syntax tree.

    • Fourth step: The results are sorted according to the relevance of the obtained documents and query statements.

View Google search:

2.3. Overall system Operation process

2.3.1. Indexing process:

    1. There is a series of indexed files

    2. The indexed file is parsed and linguistically processed to form a series of words (term).

    3. The index is created to form a dictionary and a reverse index table.

    4. Writes an index to the hard disk through the index store.

2.3.2. Search process:
A) User input query statement.
b) A series of words (term) are obtained from the syntax analysis and linguistic analysis of query statements.
c) A query tree is obtained by parsing the syntax.
d) Read the index into memory through the index store.
e) Use the query tree to search the index, so that each word (term) of the document linked list, the document linked to the table, poor, and get the results of the document.
f) Sorts the query's relevance to the search results document.
g) Returns the result of the query to the user.

3. Experiment/System Execution (experiment)
3.1. Objectives of the experiment

The experimental search engine is able to function normally, indexing and retrieving two phases of work. Through this experiment, the author will verify the feasibility of design thinking.
Based on the open source search engine Coreseek to build this experiment system. Coreseek is a Chinese full-text retrieval system based on Sphinx, which supports Chinese word segmentation well. But the rationale is that the configuration method is similar to Sphinx.
Sphinx is an SQL-based full-text search engine that can be combined with mysql,postgresql for full-text searching, which provides a more specialized search function than the database itself, making it easier for applications to implement specialized full-text searches.

(Image from open source China)

3.2. Experimental steps

1) Preparatory work
Official website Download coreseek:http://www.coreseek.cn
Build Php+mysql environment;
2) Create a data source
Create a new MySQL database, database name: Fulltext, and create a new test data table and add test data
The Target data table structure is as follows:

3) Installation Test Coreseek
To configure a data source:

Index configuration and Search service configuration:

To test the configuration and build the index:

Command line Search test:
Search Keywords Linux

Command line Chinese search test
Because the WIN32 command line does not support the UTF-8 input, the following generic search instructions are not able to test Chinese directly, and use Coreseek to test Chinese with the iconv command of Chinese support:

Four) build PHP Web full-Text Search
Start the Search service first:

Interface design:
index.php

Search handlers: search.php

setserver  (  ' 127.0.0.1 ',  9312);  $cl->setconnecttimeout  ( 3 );  $CL- >SetArrayResult  ( true );  $cl->setmatchmode  ( sph_match_any);  $res  =  $cl->query  (  $keyword,  "*"  );  $info  = array ();  if   ($res  != null)  {  $info [' id '] =  $res  [' id '];  $info [' Total ']  =  $res  [' total '];  $info [' Time '] =  $res  [' time '];  $idarr  = array () ; foreach  (  $res  [' matches '] as  $doc  )  {  $idarr [] = $ doc[' id ']; } //retrieving results from MySQL id  $ids  = join  (  ', ',  $idarr  );  mysql_connect  (  "localhost",  "root",  ""  ); mysql_select_db  (  "fulltext " );  $sql  = " Select * from documents where id in ({$ids}) ";  mysql_query  (  " Set names utf8 " );  $resdb  = mysql_query  ($sql),  //set highlighting properties   $opts  = array (  ' before_match ' + ',  ' after_match ' = '  );  $htmlres  =  Array (); while  ($row  = mysql_fetch_assoc ($resdb))  {  $res 2 =  $CL Buildexcerpts ($row,  "MySQL", $keyword,  $opts);  $id  =  $res 2[0]; echo  $title  =  $res 2[1]; echo  $content  =  $res 2[2];  $htmlres [] = array (' id ' = > $id, ' title ' and ' $title, ' content ' and ' $content ');  } } $_session[' info '] =  $info;  $_session[' htmlres '] =  $htmlres;  header ("location: index.php"); ?>

Enter the Linux keyword search instance:

Full support for Chinese:

4. Summary

This experiment verifies the feasibility of the full-text retrieval system of the independent Building website.
Before considering to make a similar Baidu can crawl Web page and complete index processing to provide search small search engine, but when the system architecture design began to find that a complete search engine to complete the task too much, before learning PHP get_file_content () This function is very fun, Write a simple script program from Baidu paste inside download cute sister pictures O (∩_∩) o~, so direct access to the entire Internet content way is simply cool.
Later found that Baidu is the full-text search engine, and Baidu completed the function in addition to crawling Web pages, Web content information processing is also a major technology, there is a search for natural language processing, so I intend to only the full text search this point to see.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.