Test methods and Word Segmentation techniques for HubbleDotNet index Segmentation

Last Update:2013-11-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Word Segmentation has always been a concern for everyone. Today, let's talk about the test methods and Word Segmentation techniques for HubbleDotNet index word segmentation. If you need to know about word segmentation, refer to the following article.

In Chinese search, word segmentation is a key technology. We often encounter a problem where a keyword cannot match the corresponding document, this problem is often caused by unsatisfactory word segmentation of indexes. the technical characteristics of inverted indexes determine that if the query keyword is not in the word segmentation of the index, the corresponding documents cannot be found. To help users analyze word segmentation problems, hubbledotnet provides several stored procedures to help check the situation of index word segmentation.

Test Method of Word Segmentation
First, find the original text to test.
We often find that some records contain query keywords but cannot be found. In this case, we need to first find the original text of the problematic record. There are many ways to search for original text. You can search for it by docid or id or other conditions.

The following is an example of id-based search. If we find that the record title with id = 1 does not match, we can execute

Select * from table where id = 1 find the original text of the record.

After finding the original text, we have two ways to view the word segmentation of records in the index.

Method 1: SP_TestAnalyzer
SP_TestAnalyzer: This stored procedure is used to test the Tokenize method of the analyzer on the server side.

This stored procedure has two parameters. The first parameter is the name of the tokenizer. Here we enter 'pangusegment ', and the second parameter is the sentence to be tested.

Run the following statement to check the effect:

SP_TestAnalyzer 'pangusegment ', 'six-Party Talks Working Group Meeting on the nuclear-free will be held in Shenyang'

As shown in, the result of Word Segmentation is displayed after execution. From this word segmentation result, we can see that there are some problems with the word segmentation of the original text. For example, if the word "Working Group" is searched, this record cannot be matched. We need to add the nuclear-free and Working Group words to the pangu dictionary for further testing. If the word segmentation is correct, the problem can be solved after the index is re-indexed.

Method 2: SP_FieldAnalyze
SP_FieldAnalyze: This stored procedure is used for word segmentation of specified fields in a specified table.

It has four parameters. Parameter 1 is the table name, parameter 2 is the field name, parameter 3 is the sentence to be segmented, and parameter 4 specifies whether to use the Tokenize function or the TokenizedForSqlClient function for word segmentation. The 4th parameters are optional,

If you do not enter a word, it is to use the Tokenize function for word segmentation. If you enter 'sqlclient', it is to use the TokenizedForSqlClient function for word segmentation.

Next, we will first execute the default situation, that is, use the Tokenize function for word segmentation.

SP_FieldAnalyze 'vnews', 'title', 'six-Party Talks Working Group Meeting on the nuclear-free State will be held in Shenyang'
This statement uses the Title field tokenizer in the VNews table to perform word segmentation on the sentence 'six-Party Talks Working Group Meeting on the nuclear-free will be held in Shenyang. We can see that, the result of word segmentation is the same as that of method 1.

Let's add the SqlClient parameter to the next word segmentation to see the effect:

SP_FieldAnalyze 'vnews', 'title', 'six-Party Talks Working Group meeting on nuclear-free will be held in Shenyang ', 'sqlclient'

The result of Word Segmentation is different after SqlClient is added. This is because SqlClient uses different word segmentation parameters. For pangu word segmentation, if SqlClient is called for word segmentation, the server side calls program/hubbledotnet/default/PanGuSqlClient. the xml configuration file is used to split text. If the SqlClient parameter is not added, the program/hubbledotnet/default/PanGu is called. xml configuration file word segmentation.

The SqlClient function mainly helps to split query strings. The GetKeywordAnalyzerStringFromServer function in the HubbleCommand class is used to split query strings. This function is also called in the hubble sample code, in fact, this function calls the SP_FieldAnalyze stored procedure and SqlClient for word segmentation. Of course, in actual projects, it is not necessary to call this function to split query strings. Users can use their own programs to split query strings.

Word Segmentation skills
For search, the ratio of query to query is a conflict between precision and accuracy. In order to balance this conflict as much as possible, we can adopt some techniques in indexing and query.

Tip 1.

Maximize word segmentation during indexing. If you use pangu word segmentation for indexing, enable multiple word segmentation and force one-dollar word segmentation during indexing. However, exact word segmentation is used for the query string to ensure a high query accuracy.

Tip 2.

Word segmentation. For Word Segmentation of synonyms, we do not want to perform word segmentation in the index, but add the synonym decomposition to the query string to make the query more flexible, in addition, you can set different weights for the original words and synonyms during the query to affect the score ranking.

Tip 3.

You can use the index component to perform a match, such as like '% xxx %. Because the index component knows the position of a word in the original text, it can theoretically implement a function similar to like '% xxx %'. This function is more effective for short text search, you do not need to care about Chinese word segmentation. Hubbledotnet will provide this quick solution in future versions. The current version provides the like '* xxx *' function, but this function is not complete enough and is slow.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Test methods and Word Segmentation techniques for HubbleDotNet index Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Test methods and Word Segmentation techniques for HubbleDotNet index Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support