First, the business scenario
We encountered the need to retrieve subtree data for a parent-child relationship in a real-world production environment
It is estimated that you have encountered a similar scenario, most typical of which is the provincial and municipal data, where the path field is the administrative path generated by the hierarchical relationship:
If we know of a city name, want to find out the same class and high-level provincial names, such as Baoding and superior Hebei Province, then we have what way to achieve?
Give everyone 10 seconds, quick answer
.
.
.
Time to
There are probably several ways to do this:
Newly added hierarchical data types in a.2008
B.cte Recursive method
C. Direct program processing
D. The more silly way: full-text Index ...
Hierarchical relationships are the best way to implement this kind of business logic, and the classic way is recursive, but we don't talk about both methods content and implementation, because today's topic is: The pit in the full-text index, because this is the case that I actually encountered
Our business logic is this, known one or several IDs, but do not know the hierarchy of the ID, the required data is to extract all levels of the corresponding ID, for example, I passed ID in (9,910) then all the data in Hebei province, if I pass ID in (910), Then the data is Baoding and all the subordinate districts.
Second, the problem arises
To simplify the presentation, I reduced the data to the table Pathtest, the data as the form
According to the business requirements described in the previous paragraph, the path requires a full-text index for easy retrieval, a process that I believe everyone knows how to build, not to describe the process.
Let's take a look at the actual number of data rows that contain 66. (or id=66 and the following sub-tree data sets)
SELECT * from pathtest where path like '%,66,% '
(49 rows affected)
The result is omitted, but can be seen as 49 rows
Then I run the following query, the query path contains 66 of the data set, that is, id=66 and sub-tree below
SELECT * from Pathtest where FREETEXT (path, ' "66")
(17 rows affected)
Display is 17 rows of results??
So much worse? That is, the full-text index is not accurate, this is too much worse
OK, let's go.
SELECT * from Pathtest where FREETEXT (path, ' "66,480")
Also, logically, this query retrieves the subtree under the 66-level ID (PS: Here 480 is under 66 subtrees)
But:
This is more convenient to see, there is no data, what is the matter?
Three, the problem research
Some of the experience of the students must have thought that this is the full-text indexing when the participle of the problem, then we look at
Sys.dm_fts_parser This function is a keyword split for a full-text index of a string, see MSDN for more information on using
I take two path strings respectively
1.path= ', 66, 73, '
2.path= ', 66,480, '
Let's take a look at the results of the split
SELECT * from Sys.dm_fts_parser (' ", 66, 73," ', 1033,0,1)
Well, split the 66,73 into the string keyword "66", "73" and the value (NN start) of 66, 731 total four records, no problem
SELECT * from Sys.dm_fts_parser (' ", 66,480," ', 1033,0,1)
The problem arises, and the string does not break down into "66" and "480" keywords as we would like, but instead generates two keywords of "66,480" and 66480.
So we return a record that satisfies the condition when we search for freetext (path, ' "66,480"), and FREETEXT (path, ' 480 ") does not satisfy the keyword's record.
Here is a problem with a full-text index keyword splitting algorithm, whichSQL Server treats like 66,480 as a pure number with a thousand-bit delimiter, so it is no longer split
But a string like 66,73, which is not a decimal separator representation of numbers, is split
If we rewrite the 66,480 string to 66,480, 1, the string will be split as follows:
SELECT * from Sys.dm_fts_parser (' ", 66,480, 1," ', 1033,0,1)
At the same time, we know that freetext this function is also to establish a full-text index of the search term after the split and the full-text index comparison, so in the where FREETEXT (path, ' "66,480") retrieval, but also the string is processed, However, it is also possible to retrieve a path= ' 66,480 ' record by separating the string with a thousand-digit separator number, but in the case where FREETEXT (path, ' "480") is in full-text indexing, And did not split 480 of this string keyword, so there is no record to meet the criteria
If we rewrite the retrieval string to where FREETEXT (path, ' "66 480"), then for this logic it is to retrieve the result set that satisfies the 66 or 480 keyword in the full-text index, in this case, the result set is 17, that is, with the Where FREETEXT ( Path, ' "66" ') is the same (because the space character is broken to break characters to "66" and "480", the table satisfies 66 result set is 17, 480 is 0, so the total number of results is 17)
Iv. Solutions
Now that we know it's "," the string that caused the mistake, we'll replace it.
Update pathtest set Path=replace (Path, ', ', '; ')
When full-text index repopulate is complete, execute
SELECT * from Pathtest where FREETEXT (path, ' "66")
(49 rows affected)
Good! Consistent with the results of direct like
Then we run the SELECT * from Pathtest where FREETEXT (path, ' "66,480")
(0 rows affected)
!!! What's the situation!?
We said before, "66,480" will be mistaken for split, so the search term here can not be so written, you may write a space, you can also write a semicolon, anyway, it is impossible to use a comma, the query instead of
SELECT * from Pathtest where FREETEXT (path, ' "66 480" ')
(49 rows affected)
SELECT * from Pathtest where FREETEXT (path, ' "480")
(1 rows affected)
Okay, this is consistent with the expected results.
Five, PostScript
This problem was resolved before an episode, the "," replaced with "/", but in the split when the/66/480/split into "66" and "480/", and later will/join the stoplist, this problem solved, this test I did not reproduce the situation, Should be set at that time hyphenation language caused, interested students can play for themselves.
Remember that if you want to build a full-text index with a similar field, clear the comma problem, while in the FREETEXT search, also pay attention to "," the problem.
In addition, we see that the full-text index of SQL Server is actually filtering the data by hitting the keywords searched in the query criteria, and SQL Server matches the keywords searched in the query statement with the keywords collected in the full-text catalog, returning the data rows in the data table where the keyword matches successfully. However, this article can be seen to match the search for full-text catalog keywords, not the same as the LIKE keyword 100% return to meet the criteria of the data, if the full-text catalog is not enough or incorrect collection of keywords, full-text index matching will fail, resulting in incorrect query results, So use full-text indexing in SQL Server and be careful!
Original link
Small pits in full-text indexing of SQL Server (reproduced)