Use the full-text search function of Microsoft SQL Server 2000 to build a web search application

Source: Internet
Author: User
Tags how to use sql server sql server books how to use sql
Introduction

The full-text search function of Microsoft SQL Server 2000 allows you to quickly and flexibly query indexes generated on unstructured text data. A common full-text search tool is the search engine of a website. To help readers understand the best use of full-text search, this article introduces a large number of abstract concepts, and optimizes full-text indexing and query to achieve maximum throughput and optimal performance, tips and tips are provided.

Full-text search

The full-text search function is introduced in SQL Server 7.0. The full-text search Core Engine is built on Microsoft Search (MSSEARCH) technology. Microsoft Exchange, Microsoft SharePoint Portal Server, and other products also use this technology.

SQL Server 7.0 full-text search provides basic text search functions and uses earlier versions of MSSEARCH; the full-text search Implementation of SQL Server 2000 includes a set of reliable indexing and query functions, and several enhancements are added on the basis of SQL Server 7.0. These enhancements include: Microsoft cluster service fully supports cluster operations, filtering and indexing documents stored in the image column, and providing improved language support, it also improves performance, scalability, and reliability.

MSSEARCH generates, maintains, and queries full-text indexes stored in the file system (instead of in SQL Server. The logical and physical storage units used by MSSEARCH for full-text indexing are directories. The full-text directory contains one or more full-text indexes in each database-you can create a full-text index for each table in SQL Server, and the index can contain one or more columns in the table. Each table can belong to only one directory, and each table can only create one index. We will briefly introduce the best solution for organizing full-text directories and indexes-but first, let's take a look at the working principles of full-text search.

Configure full-text search

To create a full-text index for text data stored in SQL Server, complete the following steps. The first step is to enable the database containing the text data to be indexed in full text (if you have not performed this operation ).

Note:Execute the following statement to discard and recreate all full-text directories of the database to enable full-text search. Unless you want to recreate the full-text directory, make sure that no full-text directory is created in the specific database that you want to enable.

If you are a member of the SysAdmin role or db_owner of the database, you can continue and issue the following statement:

use Northwind exec sp_fulltext_database 'enable'

Next, you need to create a full-text directory to store full-text indexes. As mentioned above, the data in this directory is stored in the file system (rather than in SQL Server). Therefore, you should carefully select the storage location of the full-text directory. Unless other locations are specified, the full-text directory is stored in the sub-directory of the ftdata directory (in the Microsoft SQL Server/MSSQL storage location. The following describes how to create a full-text directory in a non-default location:

exec sp_fulltext_catalog 'Cat_Desc', 'create', 'f:/ft'

In this example, the full-text directory is created as a sub-directory of "F:/FT". If you view this part of the file system, you will see that it has its own directory. The naming rules for full-text directories used by MSSEARCH are as follows:

SQL+dbid+catalogID

The directory ID starts from 00005 and increases by 1 every time a new directory is created.

If possible, it is best to create a full-text directory on its physical drive. If the process that generates the full-text index requires a lot of I/O operations (specifically, it reads data from SQL server and then writes the index to the file system ), i/O subsystem should be avoided.

How big is the full-text directory? Generally, the system overhead of the full-text directory is about 30% higher than the data stored in SQL Server (full-text indexing). However, this rule depends on the distribution of unique words (or primary keys) in the data, and the distribution of words that you consider as interfering words. Interfering words (or terminologies) are words that are excluded from full-text indexing and queries (because they are not the search term you are interested in and are frequently used, so it will only make the index very large, but will not have the actual effect ). Later, we will introduce the precautions for the selection of interference words and how to optimize the interference words to improve the query performance.

If you have not performed this operation, create a unique, single column, non-null index, on each table to generate a full-text index. This unique index is used to map each row in the table to a unique compress primary key used in MSSEARCH. Next, you need to let MSSEARCH know that you want to create a full-text index for the table. Run the following statement on the table to add the table to the selected full-text directory (in this example, it is the "cat_desc" we created earlier "):

exec sp_fulltext_table 'Categories', 'create', 'Cat_Desc',   'PK_Categories'

The next step is to add columns to this full-text index. You can select a language for each column. If the column type is image, you must specify another column to indicate the document type stored in each row of the image column.

There are some important but unwritten considerations for column language selection. These considerations are related to the text marking method and the text indexing method of MSSEARCH. The indexed text is provided by a component called the word separator (used as the word boundary mark. In English, word delimiters are usually space or some form of punctuation. In other languages (such as German), words or characters can be combined. Therefore, the selected column language indicates the language of the row to be stored in the column. If you are not sure, the best way is to use a neutral word separator (only space and punctuation are used for marking ). Another benefit of selecting a column language is "root tracing ". In full-text query, root sourcing refers to the process of searching for all forms of change of a word in a specific language.

Another consideration for language selection is related to the data representation method. For non-image column data, special filtering operations are not required, while text usually needs to pass the word separation component as is. Word delimiters are mainly used to process written text. Therefore, if the text contains any type of tags (such as HTML), the language accuracy will not be very high during indexing and searching. In this case, you have two options-the preferred method is to store text data only in the image column and specify its document type for filtering. If you do not select this method, you can consider using a neutral word separator. If possible, Add Tag data (such as "Br" in HTML) to the interference word list "). You cannot perform any language-based root tracing in the column with specified neutral language, but some environments may require you to select this method.

After you know the column options, you can add one or two columns to the full-text index by issuing the following statement:

exec sp_fulltext_column 'Categories', 'Description', 'add'

You may notice that no language is specified here-in this case, the default full-text language is used. You can use the system stored procedure "sp_configure" to set the default full-text language for the server.

After adding all columns to the full-text index, you can perform the fill operation. There are too many filling methods, which are not described in detail here. In this example, you only need to fully fill the table and wait for it to finish executing:

exec sp_fulltext_table 'Categories', 'start_full'

You may want to use the fulltextcatalogproperty or objectproperty function to monitor the fill status. To obtain the directory fill status, run the following command:

select FULLTEXTCATALOGPROPERTY('Cat_Desc', 'Populatestatus')

Normally, if the full filling is in progress, the returned result is "1 ". For more information about how to use fulltextcatalogproperty and objectproperty, see SQL Server books online.

Full-text Query

The full-text index query is slightly different from the standard relational query in SQL Server. Because indexes are stored and managed outside the SQL Server, full-text query processing is mostly done by MSSEARCH (therefore, some relational and full-text-based queries will be processed separately ), this may sometimes damage performance.

In essence, when a full-text query is executed, the query term is passed to MSSEARCH, which traverses its internal data structure (INDEX) and returns the primary key and rank value to SQL Server. If you perform a contains or freetext query, the primary key or qualifying value is usually not visible. However, if you perform a containstable or freetexttable query, these values are obtained, and these values are usually merged with the base table. The process of merging primary keys with the base table requires high system overhead-later, we will introduce you to some clever methods to minimize or completely avoid such merging.

If you have a preliminary understanding of how to return data in full-text queries, You can infer that only the contains/freetext query executes the containstable/freetexttable query and merges it with the base table. With this understanding, you should avoid using these types of queries, unless you do not have a higher overhead. In Web Search applications, using containstable and freetexttable is much better than using similar functions without table.

Until now, you know that full-text query is a special method used to access data from MSSEARCH indexes stored outside SQL Server. If you blindly Merge data with the base table, you will have trouble. Another important thing to understand is the essential difference between the contains style query and the freetext style query.

Contains query is used to perform a full match query on all the words in the query. Whether you only search for a single word or all words starting with "orange", the system returns only results containing all search words. Therefore, the contains query speed is very fast, because they usually return very few results and do not need to execute too much additional processing. Disadvantages of contains query include annoying interference word filtering. Experienced developers and database administrators who used full-text search in the past try to match words or phrases that only contain a single interfering word, I have encountered an astonishing error like "your query only contains interfering words. To avoid this error, filter out the interfering words before performing full-text queries. It is impossible to return results to the ins query that contains interfering words, because such queries only return results that exactly match the entire query string. Because the interfering word is not a full-text index, no rows are returned for contains queries containing the interfering word.

The freetext query eliminates all occasional warnings in the contains query. When a freetext query is sent, the Root query is actually sent. Therefore, when you search for "root beer", "root" and "beer" contain all their forms (root tracing is related to language; the language used is determined by the full-text column language specified when the index is generated, and must be the same among all queried columns), and the system returns all rows that match at least one of these words.

The negative impact of freetext queries is that they usually consume more CPU than contains queries-because more complex ranking calculations are required to trace the root and return more results. However, freetext-based queries are extremely flexible and fast. It is the best choice for web-based search applications.

Ranking and Optimization

I often see users who use full-text search. They ask me what the ranking number means and how to convert the ranking number to a value that some users can understand. The answer to this question can be long or short. Here I will give a brief answer. In short, these ranking numbers are not as important as the order returned by the results. That is to say, when you sort the results by ranking, the results with the highest degree of association are always returned first. The ranking value itself often changes-full-text search uses the probability ranking algorithm, that is, the relevance of each returned document is directly affected by any or all other documents in the full-text index.

Some people think that a technique that helps increase the row ranking is to repeat frequently-used search keywords in the full-text index columns of these rows. Although, to some extent, this method may increase the probability that these rows will return first due to certain keywords, in other cases, it may be counterproductive-and there are also risks of reducing the word query performance. A better solution is to implement the "Best Choice" system for the search application (see the following example), so that you can ensure that some documents are returned first. Repeated use of keywords will expand the full-text index of these specific keywords, and it will take time for MSSEARCH to find the correct rows and calculate the ranking. If the full-text index has a large amount of data and you try this method, you may find that some full-text queries are time-consuming. If you can implement a more detailed (or more accurate) "Best Choice" system, you will find that it significantly improves query performance.

Another problem with repeated data is related to common techniques used to combine relational queries and full-text queries. Many users who use full-text search suffer from this problem. This problem occurs whenever they try to apply a filter to the results returned by full-text search. As mentioned above, full-text query returns a primary key and a ranking for each matching row-to collect any detailed information about these rows, it must be merged with its base table. As an unlimited full-text query may return any number of results, merging may require a large amount of system overhead. One effective way to avoid merging is to add only the data to be filtered in the full-text index (if possible ). In other words, if you want to search for the keyword "Ichiro" from the text of all the articles in the newspaper and only want to return the articles in the newspaper's sports column, the query statement is usually as follows:

-- [Method 1:] -- maximum Overhead: select all first, then merge and filter select articles_tbl.author, articles_tbl.body, articles_tbl.dateline, ft_tbl. [Rank] From freetexttable (articles, body, 'ichiro ') as ft_tblinner join articles as articles_tblon ft_tbl. [Key] = articles_tbl.articleidwhere articles_tbl.category = 'Sports '-- [Method 2:] -- can be used, but it may cause unexpected results to slow down or return inaccurate results: -- perform full-text filtering and extract only the primary key and the ranking -- (processing completed on the Web server) select [Key], [Rank] From containstable (articles ,*, 'formsof (inflectional ('ichiro ') and "Sports "')

These two queries either consume a large amount of system overhead or may return incorrect results (in the second query, "Sports" is likely to appear in all types of articles ). There are other variants of these two technologies, but they are two very simple models. If feasible, I usually recommend that you divide the data horizontally. That is, each possible value in the "category" column is a self-contained column (or table), and the searchable keywords related to this Article are only stored in this column. Instead of using a "body" column and a "category" column, you can remove the "category" column, use the "body _ <Category>" column that stores the searchable keywords. As shown in the following example:

-- If you can adjust the architecture, this is very effective-each category-is your own column (or table) and requires fewer hit-full-text indexes. This obviously requires some explanation ...... Select [Key], [Rank] From freetexttable (articles, body_sports, 'ichiro ')

The performance of a system that contains a large amount of data that can adapt to this architecture (perhaps the primary architecture) changes will be significantly improved. However, there are obvious restrictions on when to apply multiple filters or not apply filters. Of course, there are other ways to solve these problems. Through the above example, you will understand a method to abstract certain search conditions to the architecture-in fact, it is "spoofing" the optimization program (more specifically "becoming" the optimization program ), because there is no local optimization in the full-text query of SQL server itself.

Other performance skills

When chatting, people often ask me another question: how to display the full-text query results by page. In other words, if I want to issue a "root beer" query, 40 results will be displayed on a web page at a time, and only 40 results on the page will be returned (for example, if I want to return only 81st to 120th results on the third page ).

I have seen multiple methods for Displaying results by page, but none of them can be effective. The recommended method can minimize the number of full-text query executions (in fact, only one execution is required for each result set to be displayed by PAGE ), the Web server is used as a simple cache. At a higher level, you only need to retrieve a complete set of primary key and qualifying value rows in full-text queries (if necessary, you can use the best choice in the architecture and extract common filters ), and store it in the memory of the Web server (depending on your application and load, imagine adding the <32-byte typical primary key size and the <4-byte ranking size [equal to <36 byte], then multiplying by the result set usually returned <1000 rows, and finally equal to <35 K. Assume that an active cache set in the <1000 active query result sets is returned at any given time, you will find that the active cache set occupies less than 35 MB of memory on the Web server-this is acceptable ).

To display results by page, the process only traverses the arrays stored in the memory of the Web server, and issues a SELECT statement to the SQL Server to display only the rows and columns to be displayed. This returns to the concept that full-text queries only return the primary key and rank.-select (or even many such query statements) is much faster than full-text queries. You can retain more CPU cycles on the SQL server computer by using select instead of combining multiple rows with the base table and combining multiple other policies, and more effective and cost-effective use of the Web field.

Another method that can replace the Web server cache is to cache the result set in the SQL server itself and define multiple methods for browsing the results. Although this article focuses on the application design at the web server (ASP) level, the programmable function of SQL Server also provides a powerful framework for generating high-performance Web Search applications.

Summary

Microsoft SQL Server 2000 provides a reliable, fast, and flexible way to index and query unstructured text data stored in a database. To widely apply this fast and accurate search function to a variety of applications, it is necessary to make full use of its speed and accuracy to implement a full-text search solution. By distributing computing loads and organizing data in some clever ways, you can save money to purchase other hardware and software to get rid of the troubles caused by unnecessary slow queries. Many factors and precautions should be taken into account when developing excellent search applications, we hope that the information and examples provided in this article will help you learn how to use SQL Server 2000 to generate excellent web search applications.

Appendix A: best choice for full-text search

One feasible way to improve the performance and effectiveness of full-text query is to implement the "Best Choice" system. This system is a simple method to ensure that some rows matching a specific query expression are returned before other rows. The best choice is that there is no complex pre-programming logic (for example, the SharePoint Portal Server contains such logic). Therefore, it is usually the first choice.

In this example, select the best choice and store the unique primary key and some keywords in a separate table. The freetexttable query is performed on the (very small) Best Selection table, and any results returned from the query are returned together with the freetexttable query results of the base table. Given these search conditions, all "Best Choice" rows are first returned, followed by rows that are most correlated by MSSEARCH (returned in descending order ).

The following is a very simple example script for creating the best selection system.

Use mydbcreate table documenttable (ftkey int not null, document ntext) create unique index dtftkey_idx on documenttable (ftkey)/* Insert a document here (all documents to generate a full-text index) */-- create full-text directories and indexes for all document tables exec sp_fulltext_catalog 'documents _ cat', 'create', 'f:/ftcats' exec sp_fulltext_table 'documenttable', 'create ', 'documents _ cat', 'dtftkey _ idx' exec sp_fulltext_column 'documenttable', 'document', 'add' exec sp_fulltext_table 'documenttable ', 'Start _ change_tracking' exec sp_fulltext_table 'documenttable', 'start _ background_updateindex'/* create the best selection table and index (Add the document that should always be returned first) */create table bestbets (ftkey int not null, keywords ntext) create unique index bbftkey_idx on bestbets (ftkey) /* Insert the best choice here */-- create the full-text directory and index exec sp_fulltext_catalog 'bestbets _ cat', 'create', 'f: /ftcats 'exec sp_fulltext_table 'bestbets', 'create', 'bestbets _ cat', 'bbftkey _ idx' exec sp_fulltext_column 'bestbets', 'keyword ', 'add' exec sp_fulltext_table 'bestbets', 'start _ change_tracking' exec sp_fulltext_table 'bestbets', 'start _ background_updateindex'

First, a general "all documents" table is created to store all documents to be indexed in full text. Generally, the document table contains other columns, but in this article, it only contains two columns-primary key index and the document itself. Full-text directories and indexes are created for the document table.

The "Best Choice" table is created to store the special documents that are first returned from all full-text queries. This table only needs to have the full-text primary key columns and the document itself (the policy to use some documents as the query target is optimized, including adding other keywords to documents not included in this document itself ). Full-text directories and indexes are created for the optimal table selection.

The best choice table and document table can share the document (the best choice document is also stored in the general document table, they share the same primary key value ), they can also be mutually exclusive (the best choice document is only stored in the best choice table ). To facilitate retrieval, it is easier to make the best selection table and the document table mutually exclusive-in this way, you do not need to delete the sharing operation from the best selection and the returned normal search result row set. On the other hand, it may be difficult to maintain the document using this method, because in this method, you need to add logic in the query to delete the shared documents between the returned row sets.

If the preceding table is specified, you can create two stored procedures to search for the best selection table and document table. You can use web server-Level Logic or other stored procedures to cache and display the desired results (when used together with the best choice, see the following complete and valid examples for caching, display, and paging ).

First, create a stored procedure for retrieving the best selection rows (if any:

create procedure BBSearch @searchTerm varchar(1024) asselect [key], [rank] from freetexttable(bestBets, keywords, @searchTerm) order by [rank] desc

Make sure that incoming search strings are cleaned up to avoid arbitrary T-SQL on the server and to ensure that they are enclosed in single quotes. In this case, using freetexttable is better than using containstable, because freetexttable will use the root tracing function and find the best choice that matches any search term.

Next, the second Stored Procedure searches for documents that match the general search criteria (if any ):

create procedure FTSearch @searchTerm varchar(1024) asselect [key], [rank] from freetexttable(documentTable, keywords, @searchTerm) order by [rank] desc

In addition, make sure that you have cleared the input search string and enclose it in single quotes.

When executing these stored procedures, you should input the same search term in the two stored procedures. First, perform the best selection search and then perform the common full-text search. The next section describes how to use the best choice with other full-text search technologies when building a web search application.

Appendix B: sample applications that use the best choice, result paging, and valid full-text query Logic

In this example, we have implemented a web search application that uses almost all the optimization solutions described in this article. We use a simple search engine solution for the Directory of online retailers, and assume that all users expect results in a very short response time when the traffic is high. This example uses the best selection table and stored procedure in the previous section.

This application is just a simple example of an advanced policy that can be used to achieve optimal full-text search performance. In this example, ASP is used. You can also use ISAPI, ASP. NET, or other platforms to implement similar solutions with their respective advantages and disadvantages. Session objects are not necessarily applicable to all applications. improper use may pose a certain level of risk. In this example, we use session objects to implement a fast and effective cache mechanism-of course there are many other ways to implement this function to varying degrees.

The following is the general code of the ASP page:

<% @ Language = "VBScript" %> <% response. buffer = true %> <HTML> 

As shown in the preceding two code examples, creating a web application that can execute valid full-text queries (with the best choice) and cache and display results by page does not require much effort. With the lowest system overhead, you can add the logic used to provide other data, enhance the appearance of the best choice, and navigate to the search results (In addition, we strongly recommend that you implement other strict logic for error handling, security setting, and clearing incoming data ).

With the above advanced recommendations and examples, it is easy to use SQL Server 2000 full-text search to design and implement a fast and scalable Web search application.

Appendix C: Resources

Full-text search deployment)

It is the best reference for users who are first engaged in full-text search. This section describes filling methods, hardware and software requirements, and provides tips, tips, and other documents for using SQL Server 2000 full-text search.

Full-text search for public news groups (Microsoft. Public. sqlserver. Fulltext)

Find answers to full-text search questions and useful tips and tips. The full-text search newsgroup is a frequent place for SQL Server development teams and knowledgeable Microsoft MVP members.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.