Small pits in full-text indexing of SQL Server (reproduced)

Source: Internet
Author: User

First, the business scenario

We encountered the need to retrieve subtree data for a parent-child relationship in a real-world production environment

It is estimated that you have encountered a similar scenario, most typical of which is the provincial and municipal data, where the path field is the administrative path generated by the hierarchical relationship:

If we know of a city name, want to find out the same class and high-level provincial names, such as Baoding and superior Hebei Province, then we have what way to achieve?

Give everyone 10 seconds, quick answer

.

.

.

Time to

There are probably several ways to do this:

Newly added hierarchical data types in a.2008

B.cte Recursive method

C. Direct program processing

D. The more silly way: full-text Index ...

Hierarchical relationships are the best way to implement this kind of business logic, and the classic way is recursive, but we don't talk about both methods content and implementation, because today's topic is: The pit in the full-text index, because this is the case that I actually encountered

Our business logic is this, known one or several IDs, but do not know the hierarchy of the ID, the required data is to extract all levels of the corresponding ID, for example, I passed ID in (9,910) then all the data in Hebei province, if I pass ID in (910), Then the data is Baoding and all the subordinate districts.

Second, the problem arises

To simplify the presentation, I reduced the data to the table Pathtest, the data as the form

According to the business requirements described in the previous paragraph, the path requires a full-text index for easy retrieval, a process that I believe everyone knows how to build, not to describe the process.

Let's take a look at the actual number of data rows that contain 66. (or id=66 and the following sub-tree data sets)

SELECT * from pathtest where path like '%,66,% '

(49 rows affected)

The result is omitted, but can be seen as 49 rows

Then I run the following query, the query path contains 66 of the data set, that is, id=66 and sub-tree below

SELECT * from Pathtest where FREETEXT (path, ' "66")

(17 rows affected)

Display is 17 rows of results??

So much worse? That is, the full-text index is not accurate, this is too much worse

OK, let's go.

SELECT * from Pathtest where FREETEXT (path, ' "66,480")

Also, logically, this query retrieves the subtree under the 66-level ID (PS: Here 480 is under 66 subtrees)

But:

This is more convenient to see, there is no data, what is the matter?

Three, the problem research

Some of the experience of the students must have thought that this is the full-text indexing when the participle of the problem, then we look at

Sys.dm_fts_parser This function is a keyword split for a full-text index of a string, see MSDN for more information on using

I take two path strings respectively

1.path= ', 66, 73, '

2.path= ', 66,480, '

Let's take a look at the results of the split

SELECT * from Sys.dm_fts_parser (' ", 66, 73," ', 1033,0,1)

Well, split the 66,73 into the string keyword "66", "73" and the value (NN start) of 66, 731 total four records, no problem

SELECT * from Sys.dm_fts_parser (' ", 66,480," ', 1033,0,1)

The problem arises, and the string does not break down into "66" and "480" keywords as we would like, but instead generates two keywords of "66,480" and 66480.

So we return a record that satisfies the condition when we search for freetext (path, ' "66,480"), and FREETEXT (path, ' 480 ") does not satisfy the keyword's record.

Here is a problem with a full-text index keyword splitting algorithm, whichSQL Server treats like 66,480 as a pure number with a thousand-bit delimiter, so it is no longer split

But a string like 66,73, which is not a decimal separator representation of numbers, is split

If we rewrite the 66,480 string to 66,480, 1, the string will be split as follows:

SELECT * from Sys.dm_fts_parser (' ", 66,480, 1," ', 1033,0,1)

At the same time, we know that freetext this function is also to establish a full-text index of the search term after the split and the full-text index comparison, so in the where FREETEXT (path, ' "66,480") retrieval, but also the string is processed, However, it is also possible to retrieve a path= ' 66,480 ' record by separating the string with a thousand-digit separator number, but in the case where FREETEXT (path, ' "480") is in full-text indexing, And did not split 480 of this string keyword, so there is no record to meet the criteria

If we rewrite the retrieval string to where FREETEXT (path, ' "66 480"), then for this logic it is to retrieve the result set that satisfies the 66 or 480 keyword in the full-text index, in this case, the result set is 17, that is, with the Where FREETEXT ( Path, ' "66" ') is the same (because the space character is broken to break characters to "66" and "480", the table satisfies 66 result set is 17, 480 is 0, so the total number of results is 17)

Iv. Solutions

Now that we know it's "," the string that caused the mistake, we'll replace it.

Update pathtest set Path=replace (Path, ', ', '; ')

When full-text index repopulate is complete, execute

SELECT * from Pathtest where FREETEXT (path, ' "66")

(49 rows affected)

Good! Consistent with the results of direct like

Then we run the SELECT * from Pathtest where FREETEXT (path, ' "66,480")

(0 rows affected)

!!! What's the situation!?

We said before, "66,480" will be mistaken for split, so the search term here can not be so written, you may write a space, you can also write a semicolon, anyway, it is impossible to use a comma, the query instead of

SELECT * from Pathtest where FREETEXT (path, ' "66 480" ')

(49 rows affected)

SELECT * from Pathtest where FREETEXT (path, ' "480")

(1 rows affected)

Okay, this is consistent with the expected results.

Five, PostScript

This problem was resolved before an episode, the "," replaced with "/", but in the split when the/66/480/split into "66" and "480/", and later will/join the stoplist, this problem solved, this test I did not reproduce the situation, Should be set at that time hyphenation language caused, interested students can play for themselves.

Remember that if you want to build a full-text index with a similar field, clear the comma problem, while in the FREETEXT search, also pay attention to "," the problem.

In addition, we see that the full-text index of SQL Server is actually filtering the data by hitting the keywords searched in the query criteria, and SQL Server matches the keywords searched in the query statement with the keywords collected in the full-text catalog, returning the data rows in the data table where the keyword matches successfully. However, this article can be seen to match the search for full-text catalog keywords, not the same as the LIKE keyword 100% return to meet the criteria of the data, if the full-text catalog is not enough or incorrect collection of keywords, full-text index matching will fail, resulting in incorrect query results, So use full-text indexing in SQL Server and be careful!

Original link

Small pits in full-text indexing of SQL Server (reproduced)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.