Is there a bug in Scws's Scws_get

Is there a bug in Scws's Scws_get_words function?

Last Update:2016-06-06 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scws is a very good Chinese word thesaurus, its PHP extension can be easily processed word segmentation. Now find one of the functions scws_get_wordsfunction problem, this function is used to get the result of the participle, its second parameter can specify you need to return the result, this is its C API document description (PHP is similar)

scws_top_t scws_get_words (scws_t s, Char *xattr);
Description: Returns the keyword table for the specified part of speech, and the system inserts the list according to the occurrence of the word. The parameter xattr is used to describe what to exclude
or participate in the statistical vocabulary of parts of speech, multiple parts of speech separated by commas. When you start with a ~ that does not include these parts of speech in the statistical results,
Otherwise the representation must be included, and the incoming NULL represents the total part of speech statistics.
Return value: Returns the head pointer of the list of thesaurus lists, which must call Scws_free_tops to release
Error: None

That is, I just need to add a comma-separated argument to the second argument, like I'm adding a '~Ag,~a,~ad,~b,~c,~Dg,~d,~e' character, to filter out the results.

But the actual result is that no matter how many filters you add, it doesn't work, but instead, if you add only one filter, '~a' i.e. without a comma, it can filter out the corresponding results. So I think about whether there are bugs in this. The C implementation code for this function is attached below, so let me see it for you.

Get words by attr (rand Order) scws_top_t Scws_get_words (scws_t s, char *xattr) {int off, cnt, Xmode = scws_na;xtree_t xt ; scws_res_t Res, cur;scws_top_t top, tail, Base;char *word;word_attr *at = null;if (!s | |!s->txt | |! (  XT = Xtree_new (0,1))) return null;__parse_xattr__;//Save the Offset.off = S->off;s->off = 0;base = Tail = Null;while ((cur = res = Scws_get_result (s)) = NULL) {do{/* Check attribute filter */if (at! = NULL) {if ((Xmode = = Scws_na) &&!_attr_belong (Cur->attr, at)) continue ; if ((Xmode = = Scws_yea) && _attr_belong (Cur->attr, at)) continue;} /* Put to the stats */if (! top = Xtree_nget (XT, S->txt + Cur->off, Cur->len, NULL))) {top = (scws_top_t) malloc (sizeof (struct Scws_topword)) ; top->weight = Cur->idf;top->times = 1;top->next = Null;top->word = (char *) _mem_ndup (s->txt + cur-> Off, Cur->len); strncpy (Top->attr, cur->attr, 2);//Add to the chainif (tail = NULL) base = Tail = top;else{tail-& Gt;next = Top;taIl = top;} Xtree_nput (XT, top, sizeof (struct Scws_topword), S->txt + Cur->off, Cur->len);} Else{top->weight + = cur->idf;top->times++;}} while ((cur = cur->next) = NULL); Scws_free_result (res);} Free AT & Xtreeif (at! = NULL) free (at); Xtree_free (XT);//restore the Offsets->off = Off;return base;}

I found __PARSE_XATTR__ some problems with its macros, and here is word_attr an additional structure definition

/* Macro to parse xattr-Xmode, at */#define__PARSE_XATTR__do {\if (xattr = = NULL) break;\if (*xattr = = ' ~ ') {xattr+ +; Xmode = Scws_yea;  }\if (*xattr = = ' + ') break;\cnt = ((strlen (xattr)/2) + 2) * sizeof (WORD_ATTR); \at = (word_attr *) malloc (CNT); \memset (AT, 0, cnt); \cnt = 0;\for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\strncpy (at[cnt], xattr, 2); \xattr = Word + 1;\}\strncpy (at[cnt], xattr, 2); \} while (0) typedef char WORD_ATTR[4];

This way of dealing with xattr words, can only deal with part of speech is a 2-character case as it is strncpy(at[cnt], xattr, 2); . This is too sloppy, the part of speech table has a bunch of characters of the part of speech ah, it copy of the words will be copied into the comma.

Oneself all with 2 characters of speech filter to try it, sure enough ... Let's think about how to change it here.

Reply content:

Scws is a very good Chinese word thesaurus, its PHP extension can be easily processed word segmentation. Now find one of the functions scws_get_words of the problem, this function is used to get the result of the word, its second parameter can specify you need to return the result, which is its C API document description (PHP is similar)

scws_top_t scws_get_words (scws_t s, Char *xattr);
Description: Returns the keyword table for the specified part of speech, and the system inserts the list according to the occurrence of the word. The parameter xattr is used to describe what to exclude
or participate in the statistical vocabulary of parts of speech, multiple parts of speech separated by commas. When you start with a ~ that does not include these parts of speech in the statistical results,
Otherwise the representation must be included, and the incoming NULL represents the total part of speech statistics.
Return value: Returns the head pointer of the list of thesaurus lists, which must call Scws_free_tops to release
Error: None

That is, I just need to add a comma-separated argument to the second argument, like I'm adding a '~Ag,~a,~ad,~b,~c,~Dg,~d,~e' character, to filter out the results.

Get words by attr (rand Order) scws_top_t Scws_get_words (scws_t s, char *xattr) {int off, cnt, Xmode = scws_na;xtree_t xt ; scws_res_t Res, cur;scws_top_t top, tail, Base;char *word;word_attr *at = null;if (!s | |!s->txt | |! (  XT = Xtree_new (0,1))) return null;__parse_xattr__;//Save the Offset.off = S->off;s->off = 0;base = Tail = Null;while ((cur = res = Scws_get_result (s)) = NULL) {do{/* Check attribute filter */if (at! = NULL) {if ((Xmode = = Scws_na) &&!_attr_belong (Cur->attr, at)) continue ; if ((Xmode = = Scws_yea) && _attr_belong (Cur->attr, at)) continue;} /* Put to the stats */if (! top = Xtree_nget (XT, S->txt + Cur->off, Cur->len, NULL))) {top = (scws_top_t) malloc (sizeof (struct Scws_topword)) ; top->weight = Cur->idf;top->times = 1;top->next = Null;top->word = (char *) _mem_ndup (s->txt + cur-> Off, Cur->len); strncpy (Top->attr, cur->attr, 2);//Add to the chainif (tail = NULL) base = Tail = top;else{tail-& Gt;next = Top;taIl = top;} Xtree_nput (XT, top, sizeof (struct Scws_topword), S->txt + Cur->off, Cur->len);} Else{top->weight + = cur->idf;top->times++;}} while ((cur = cur->next) = NULL); Scws_free_result (res);} Free AT & Xtreeif (at! = NULL) free (at); Xtree_free (XT);//restore the Offsets->off = Off;return base;}

I found __PARSE_XATTR__ some problems with its macros, and here is word_attr an additional structure definition

/* Macro to parse xattr-Xmode, at */#define__PARSE_XATTR__do {\if (xattr = = NULL) break;\if (*xattr = = ' ~ ') {xattr+ +; Xmode = Scws_yea;  }\if (*xattr = = ' + ') break;\cnt = ((strlen (xattr)/2) + 2) * sizeof (WORD_ATTR); \at = (word_attr *) malloc (CNT); \memset (AT, 0, cnt); \cnt = 0;\for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\strncpy (at[cnt], xattr, 2); \xattr = Word + 1;\}\strncpy (at[cnt], xattr, 2); \} while (0) typedef char WORD_ATTR[4];

Oneself all with 2 characters of speech filter to try it, sure enough ... Let's think about how to change it here.

With the author, Hightman gave a patch and modified the macro definition.

Diff-c-r1.28-r1.29*** libscws/scws.c 5 04:39:33-0000 1.28---libscws/scws.c 2011 Oct 08:41:44-0000 1                                 .29****************** 1278,1284 * * * * * memset (AT, 0, CNT);                                            \ cnt = 0; \ for (cnt = 0; (Word = STRCHR (xattr, ', '));       cnt++) {\!                     strncpy (at[cnt], xattr, 2);                               \ xattr = word + 1;                         \} \ strncpy (at[cnt], xattr, 2);                                 \---1278,1285----memset (at, 0, CNT);                                            \ cnt = 0; \ for (cnt = 0; (Word = STRCHR (xattr, ', '));       cnt++) {\!                          At[cnt][0] = *xattr++;       \! AT[CNT][1] = Xattr = = Word?     ' + ': *xattr;                               \ xattr = word + 1; \} \ strncpy (at[cnt], xattr, 2); \



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More