Whether the scws_get_words function of SCWS has a bug

Last Update:2018-05-25 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now we can see a problem with one of the scws_get_words functions. This function is used to obtain the word segmentation result. The second parameter can specify the result to be returned, which is its capi... SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now one of the functions is found scws_get_wordsFunction problems. This function is used to obtain word segmentation results. The second parameter can specify the results to be returned. This is the c api documentation description of this function (php is similar)

· Scws _ top_t scws_get_words (scws_t s, char * xattr );
Description: return the keyword table of a specified part of speech. The system inserts a list based on the sequence of words that appear. The xattr parameter is used to describe to exclude
Or participate in the statistical word part of speech. Multiple parts of speech are separated by commas. When ~ It indicates that the statistical results do not contain these parts of speech,
Otherwise, it indicates that it must be included, and NULL indicates counting all parts of speech.
Return Value: return the head pointer of the linked list of the Word Table set. The word table set must call scws_free_tops for release.
Error: None

That is to say, I only need to add a comma-separated parameter to the second parameter. For example, I add'~Ag,~a,~ad,~b,~c,~Dg,~d,~e'Character, indicating that I want to filter out these characters in the result.

However, the actual result is that no matter how many filter conditions you add, but on the contrary, if you only add one filter condition, such'~a', That is, when there is no comma, it can filter out the corresponding results. So I figured out whether there was a bug in it. The c implementation code of this function is attached below. Let's take a look.

// get words by attr (rand order)scws_top_t scws_get_words(scws_t s, char *xattr){int off, cnt, xmode = SCWS_NA;xtree_t xt;scws_res_t res, cur;scws_top_t top, tail, base;char *word;word_attr *at = NULL;if (!s || !s->txt || !(xt = xtree_new(0,1)))return NULL;__PARSE_XATTR__;// save the offset.off = s->off;s->off = 0;base = tail = NULL;while ((cur = res = scws_get_result(s)) != NULL){do{/* check attribute filter */if (at != NULL){if ((xmode == SCWS_NA) && !_attr_belong(cur->attr, at))continue;if ((xmode == SCWS_YEA) && _attr_belong(cur->attr, at))continue;}/* put to the stats */if (!(top = xtree_nget(xt, s->txt + cur->off, cur->len, NULL))){top = (scws_top_t) malloc(sizeof(struct scws_topword));top->weight = cur->idf;top->times = 1;top->next = NULL;top->word = (char *)_mem_ndup(s->txt + cur->off, cur->len);strncpy(top->attr, cur->attr, 2);// add to the chainif (tail == NULL)base = tail = top;else{tail->next = top;tail = top;}xtree_nput(xt, top, sizeof(struct scws_topword), s->txt + cur->off, cur->len);}else{top->weight += cur->idf;top->times++;}}while ((cur = cur->next) != NULL);scws_free_result(res);}// free at & xtreeif (at != NULL)free(at);xtree_free(xt);// restore the offsets->off = off;return base;}

I found its__PARSE_XATTR__Macro has some problems. Here is an attachment.word_attrStructure Definition

/* macro to parse xattr -> xmode, at */#define__PARSE_XATTR__do {\if (xattr == NULL) break;\if (*xattr == '~') { xattr++; xmode = SCWS_YEA; }\if (*xattr == '\0') break;\cnt = ((strlen(xattr)/2) + 2) * sizeof(word_attr);\at = (word_attr *) malloc(cnt);\memset(at, 0, cnt);\cnt = 0;\for (cnt = 0; (word = strchr(xattr, ',')); cnt++) {\strncpy(at[cnt], xattr, 2);\xattr = word + 1;\}\strncpy(at[cnt], xattr, 2);\} while (0)typedef char word_attr[4];

In this way, xattr can be processed only when the part of speech is 2 characters, because itstrncpy(at[cnt], xattr, 2);. This is too sloppy. There is a heap of Characters in the part of speech table. If it is copied, it will copy the comma.

I tried filtering all the parts of speech with 2 characters. That's all right... Let's think about how to change it here.

Reply content:

SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now one of the functions is foundscws_get_wordsFunction problems. This function is used to obtain word segmentation results. The second parameter can specify the results to be returned. This is the c api documentation description of this function (php is similar)

· Scws _ top_t scws_get_words (scws_t s, char * xattr );
Description: return the keyword table of a specified part of speech. The system inserts a list based on the sequence of words that appear. The xattr parameter is used to describe to exclude
Or participate in the statistical word part of speech. Multiple parts of speech are separated by commas. When ~ It indicates that the statistical results do not contain these parts of speech,
Otherwise, it indicates that it must be included, and NULL indicates counting all parts of speech.
Return Value: return the head pointer of the linked list of the Word Table set. The word table set must call scws_free_tops for release.
Error: None

// get words by attr (rand order)scws_top_t scws_get_words(scws_t s, char *xattr){int off, cnt, xmode = SCWS_NA;xtree_t xt;scws_res_t res, cur;scws_top_t top, tail, base;char *word;word_attr *at = NULL;if (!s || !s->txt || !(xt = xtree_new(0,1)))return NULL;__PARSE_XATTR__;// save the offset.off = s->off;s->off = 0;base = tail = NULL;while ((cur = res = scws_get_result(s)) != NULL){do{/* check attribute filter */if (at != NULL){if ((xmode == SCWS_NA) && !_attr_belong(cur->attr, at))continue;if ((xmode == SCWS_YEA) && _attr_belong(cur->attr, at))continue;}/* put to the stats */if (!(top = xtree_nget(xt, s->txt + cur->off, cur->len, NULL))){top = (scws_top_t) malloc(sizeof(struct scws_topword));top->weight = cur->idf;top->times = 1;top->next = NULL;top->word = (char *)_mem_ndup(s->txt + cur->off, cur->len);strncpy(top->attr, cur->attr, 2);// add to the chainif (tail == NULL)base = tail = top;else{tail->next = top;tail = top;}xtree_nput(xt, top, sizeof(struct scws_topword), s->txt + cur->off, cur->len);}else{top->weight += cur->idf;top->times++;}}while ((cur = cur->next) != NULL);scws_free_result(res);}// free at & xtreeif (at != NULL)free(at);xtree_free(xt);// restore the offsets->off = off;return base;}

I found its__PARSE_XATTR__Macro has some problems. Here is an attachment.word_attrStructure Definition

/* macro to parse xattr -> xmode, at */#define__PARSE_XATTR__do {\if (xattr == NULL) break;\if (*xattr == '~') { xattr++; xmode = SCWS_YEA; }\if (*xattr == '\0') break;\cnt = ((strlen(xattr)/2) + 2) * sizeof(word_attr);\at = (word_attr *) malloc(cnt);\memset(at, 0, cnt);\cnt = 0;\for (cnt = 0; (word = strchr(xattr, ',')); cnt++) {\strncpy(at[cnt], xattr, 2);\xattr = word + 1;\}\strncpy(at[cnt], xattr, 2);\} while (0)typedef char word_attr[4];

I tried filtering all the parts of speech with 2 characters. That's all right... Let's think about how to change it here.

After talking with the author, hightman provides a patch to modify the macro definition.

diff -c -r1.28 -r1.29*** libscws/scws.c  5 Aug 2011 04:39:33 -0000   1.28--- libscws/scws.c  26 Oct 2011 08:41:44 -0000  1.29****************** 1278,1284 ****    memset(at, 0, cnt);                                 \    cnt = 0;                                            \    for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { \!       strncpy(at[cnt], xattr, 2);                     \        xattr = word + 1;                               \    }                                                   \    strncpy(at[cnt], xattr, 2);                         \--- 1278,1285 ----    memset(at, 0, cnt);                                 \    cnt = 0;                                            \    for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { \!       at[cnt][0] = *xattr++;                          \!       at[cnt][1] = xattr == word ? '\0' : *xattr;     \        xattr = word + 1;                               \    }                                                   \    strncpy(at[cnt], xattr, 2);                         \

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More