Whether the scws_get_words function of SCWS has a bug

Source: Internet
Author: User
Tags idf
SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now we can see a problem with one of the scws_get_words functions. This function is used to obtain the word segmentation result. The second parameter can specify the result to be returned, which is its capi... SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now one of the functions is found scws_get_wordsFunction problems. This function is used to obtain word segmentation results. The second parameter can specify the results to be returned. This is the c api documentation description of this function (php is similar)

· Scws _ top_t scws_get_words (scws_t s, char * xattr );
Description: return the keyword table of a specified part of speech. The system inserts a list based on the sequence of words that appear. The xattr parameter is used to describe to exclude
Or participate in the statistical word part of speech. Multiple parts of speech are separated by commas. When ~ It indicates that the statistical results do not contain these parts of speech,
Otherwise, it indicates that it must be included, and NULL indicates counting all parts of speech.
Return Value: return the head pointer of the linked list of the Word Table set. The word table set must call scws_free_tops for release.
Error: None

That is to say, I only need to add a comma-separated parameter to the second parameter. For example, I add'~Ag,~a,~ad,~b,~c,~Dg,~d,~e'Character, indicating that I want to filter out these characters in the result.

However, the actual result is that no matter how many filter conditions you add, but on the contrary, if you only add one filter condition, such'~a', That is, when there is no comma, it can filter out the corresponding results. So I figured out whether there was a bug in it. The c implementation code of this function is attached below. Let's take a look.

// get words by attr (rand order)scws_top_t scws_get_words(scws_t s, char *xattr){int off, cnt, xmode = SCWS_NA;xtree_t xt;scws_res_t res, cur;scws_top_t top, tail, base;char *word;word_attr *at = NULL;if (!s || !s->txt || !(xt = xtree_new(0,1)))return NULL;__PARSE_XATTR__;// save the offset.off = s->off;s->off = 0;base = tail = NULL;while ((cur = res = scws_get_result(s)) != NULL){do{/* check attribute filter */if (at != NULL){if ((xmode == SCWS_NA) && !_attr_belong(cur->attr, at))continue;if ((xmode == SCWS_YEA) && _attr_belong(cur->attr, at))continue;}/* put to the stats */if (!(top = xtree_nget(xt, s->txt + cur->off, cur->len, NULL))){top = (scws_top_t) malloc(sizeof(struct scws_topword));top->weight = cur->idf;top->times = 1;top->next = NULL;top->word = (char *)_mem_ndup(s->txt + cur->off, cur->len);strncpy(top->attr, cur->attr, 2);// add to the chainif (tail == NULL)base = tail = top;else{tail->next = top;tail = top;}xtree_nput(xt, top, sizeof(struct scws_topword), s->txt + cur->off, cur->len);}else{top->weight += cur->idf;top->times++;}}while ((cur = cur->next) != NULL);scws_free_result(res);}// free at & xtreeif (at != NULL)free(at);xtree_free(xt);// restore the offsets->off = off;return base;}

I found its__PARSE_XATTR__Macro has some problems. Here is an attachment.word_attrStructure Definition

/* macro to parse xattr -> xmode, at */#define__PARSE_XATTR__do {\if (xattr == NULL) break;\if (*xattr == '~') { xattr++; xmode = SCWS_YEA; }\if (*xattr == '\0') break;\cnt = ((strlen(xattr)/2) + 2) * sizeof(word_attr);\at = (word_attr *) malloc(cnt);\memset(at, 0, cnt);\cnt = 0;\for (cnt = 0; (word = strchr(xattr, ',')); cnt++) {\strncpy(at[cnt], xattr, 2);\xattr = word + 1;\}\strncpy(at[cnt], xattr, 2);\} while (0)typedef char word_attr[4];

In this way, xattr can be processed only when the part of speech is 2 characters, because itstrncpy(at[cnt], xattr, 2);. This is too sloppy. There is a heap of Characters in the part of speech table. If it is copied, it will copy the comma.

I tried filtering all the parts of speech with 2 characters. That's all right... Let's think about how to change it here.

Reply content:

SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now one of the functions is foundscws_get_wordsFunction problems. This function is used to obtain word segmentation results. The second parameter can specify the results to be returned. This is the c api documentation description of this function (php is similar)

· Scws _ top_t scws_get_words (scws_t s, char * xattr );
Description: return the keyword table of a specified part of speech. The system inserts a list based on the sequence of words that appear. The xattr parameter is used to describe to exclude
Or participate in the statistical word part of speech. Multiple parts of speech are separated by commas. When ~ It indicates that the statistical results do not contain these parts of speech,
Otherwise, it indicates that it must be included, and NULL indicates counting all parts of speech.
Return Value: return the head pointer of the linked list of the Word Table set. The word table set must call scws_free_tops for release.
Error: None

That is to say, I only need to add a comma-separated parameter to the second parameter. For example, I add'~Ag,~a,~ad,~b,~c,~Dg,~d,~e'Character, indicating that I want to filter out these characters in the result.

However, the actual result is that no matter how many filter conditions you add, but on the contrary, if you only add one filter condition, such'~a', That is, when there is no comma, it can filter out the corresponding results. So I figured out whether there was a bug in it. The c implementation code of this function is attached below. Let's take a look.

// get words by attr (rand order)scws_top_t scws_get_words(scws_t s, char *xattr){int off, cnt, xmode = SCWS_NA;xtree_t xt;scws_res_t res, cur;scws_top_t top, tail, base;char *word;word_attr *at = NULL;if (!s || !s->txt || !(xt = xtree_new(0,1)))return NULL;__PARSE_XATTR__;// save the offset.off = s->off;s->off = 0;base = tail = NULL;while ((cur = res = scws_get_result(s)) != NULL){do{/* check attribute filter */if (at != NULL){if ((xmode == SCWS_NA) && !_attr_belong(cur->attr, at))continue;if ((xmode == SCWS_YEA) && _attr_belong(cur->attr, at))continue;}/* put to the stats */if (!(top = xtree_nget(xt, s->txt + cur->off, cur->len, NULL))){top = (scws_top_t) malloc(sizeof(struct scws_topword));top->weight = cur->idf;top->times = 1;top->next = NULL;top->word = (char *)_mem_ndup(s->txt + cur->off, cur->len);strncpy(top->attr, cur->attr, 2);// add to the chainif (tail == NULL)base = tail = top;else{tail->next = top;tail = top;}xtree_nput(xt, top, sizeof(struct scws_topword), s->txt + cur->off, cur->len);}else{top->weight += cur->idf;top->times++;}}while ((cur = cur->next) != NULL);scws_free_result(res);}// free at & xtreeif (at != NULL)free(at);xtree_free(xt);// restore the offsets->off = off;return base;}

I found its__PARSE_XATTR__Macro has some problems. Here is an attachment.word_attrStructure Definition

/* macro to parse xattr -> xmode, at */#define__PARSE_XATTR__do {\if (xattr == NULL) break;\if (*xattr == '~') { xattr++; xmode = SCWS_YEA; }\if (*xattr == '\0') break;\cnt = ((strlen(xattr)/2) + 2) * sizeof(word_attr);\at = (word_attr *) malloc(cnt);\memset(at, 0, cnt);\cnt = 0;\for (cnt = 0; (word = strchr(xattr, ',')); cnt++) {\strncpy(at[cnt], xattr, 2);\xattr = word + 1;\}\strncpy(at[cnt], xattr, 2);\} while (0)typedef char word_attr[4];

In this way, xattr can be processed only when the part of speech is 2 characters, because itstrncpy(at[cnt], xattr, 2);. This is too sloppy. There is a heap of Characters in the part of speech table. If it is copied, it will copy the comma.

I tried filtering all the parts of speech with 2 characters. That's all right... Let's think about how to change it here.

After talking with the author, hightman provides a patch to modify the macro definition.

diff -c -r1.28 -r1.29*** libscws/scws.c  5 Aug 2011 04:39:33 -0000   1.28--- libscws/scws.c  26 Oct 2011 08:41:44 -0000  1.29****************** 1278,1284 ****    memset(at, 0, cnt);                                 \    cnt = 0;                                            \    for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { \!       strncpy(at[cnt], xattr, 2);                     \        xattr = word + 1;                               \    }                                                   \    strncpy(at[cnt], xattr, 2);                         \--- 1278,1285 ----    memset(at, 0, cnt);                                 \    cnt = 0;                                            \    for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { \!       at[cnt][0] = *xattr++;                          \!       at[cnt][1] = xattr == word ? '\0' : *xattr;     \        xattr = word + 1;                               \    }                                                   \    strncpy(at[cnt], xattr, 2);                         \

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.