Is there a bug in Scws's Scws_get_words function?

Source: Internet
Author: User
Tags idf
Scws is a very good Chinese word thesaurus, its PHP extension can be easily processed word segmentation. Now find one of the functions scws_get_wordsfunction problem, this function is used to get the result of the participle, its second parameter can specify you need to return the result, this is its C API document description (PHP is similar)

scws_top_t scws_get_words (scws_t s, Char *xattr);
Description: Returns the keyword table for the specified part of speech, and the system inserts the list according to the occurrence of the word. The parameter xattr is used to describe what to exclude
or participate in the statistical vocabulary of parts of speech, multiple parts of speech separated by commas. When you start with a ~ that does not include these parts of speech in the statistical results,
Otherwise the representation must be included, and the incoming NULL represents the total part of speech statistics.
Return value: Returns the head pointer of the list of thesaurus lists, which must call Scws_free_tops to release
Error: None

That is, I just need to add a comma-separated argument to the second argument, like I'm adding a '~Ag,~a,~ad,~b,~c,~Dg,~d,~e' character, to filter out the results.

But the actual result is that no matter how many filters you add, it doesn't work, but instead, if you add only one filter, '~a' i.e. without a comma, it can filter out the corresponding results. So I think about whether there are bugs in this. The C implementation code for this function is attached below, so let me see it for you.

Get words by attr (rand Order) scws_top_t Scws_get_words (scws_t s, char *xattr) {int off, cnt, Xmode = scws_na;xtree_t xt ; scws_res_t Res, cur;scws_top_t top, tail, Base;char *word;word_attr *at = null;if (!s | |!s->txt | |! (  XT = Xtree_new (0,1))) return null;__parse_xattr__;//Save the Offset.off = S->off;s->off = 0;base = Tail = Null;while ((cur = res = Scws_get_result (s)) = NULL) {do{/* Check attribute filter */if (at! = NULL) {if ((Xmode = = Scws_na) &&!_attr_belong (Cur->attr, at)) continue ; if ((Xmode = = Scws_yea) && _attr_belong (Cur->attr, at)) continue;} /* Put to the stats */if (! top = Xtree_nget (XT, S->txt + Cur->off, Cur->len, NULL))) {top = (scws_top_t) malloc (sizeof (struct Scws_topword)) ; top->weight = Cur->idf;top->times = 1;top->next = Null;top->word = (char *) _mem_ndup (s->txt + cur-> Off, Cur->len); strncpy (Top->attr, cur->attr, 2);//Add to the chainif (tail = NULL) base = Tail = top;else{tail-& Gt;next = Top;taIl = top;} Xtree_nput (XT, top, sizeof (struct Scws_topword), S->txt + Cur->off, Cur->len);} Else{top->weight + = cur->idf;top->times++;}} while ((cur = cur->next) = NULL); Scws_free_result (res);} Free AT & Xtreeif (at! = NULL) free (at); Xtree_free (XT);//restore the Offsets->off = Off;return base;}

I found __PARSE_XATTR__ some problems with its macros, and here is word_attr an additional structure definition

/* Macro to parse xattr-Xmode, at */#define__PARSE_XATTR__do {\if (xattr = = NULL) break;\if (*xattr = = ' ~ ') {xattr+ +; Xmode = Scws_yea;  }\if (*xattr = = ' + ') break;\cnt = ((strlen (xattr)/2) + 2) * sizeof (WORD_ATTR); \at = (word_attr *) malloc (CNT); \memset (AT, 0, cnt); \cnt = 0;\for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\strncpy (at[cnt], xattr, 2); \xattr = Word + 1;\}\strncpy (at[cnt], xattr, 2); \} while (0) typedef char WORD_ATTR[4];

This way of dealing with xattr words, can only deal with part of speech is a 2-character case as it is strncpy(at[cnt], xattr, 2); . This is too sloppy, the part of speech table has a bunch of characters of the part of speech ah, it copy of the words will be copied into the comma.

Oneself all with 2 characters of speech filter to try it, sure enough ... Let's think about how to change it here.

Reply content:

Scws is a very good Chinese word thesaurus, its PHP extension can be easily processed word segmentation. Now find one of the functions scws_get_words of the problem, this function is used to get the result of the word, its second parameter can specify you need to return the result, which is its C API document description (PHP is similar)

scws_top_t scws_get_words (scws_t s, Char *xattr);
Description: Returns the keyword table for the specified part of speech, and the system inserts the list according to the occurrence of the word. The parameter xattr is used to describe what to exclude
or participate in the statistical vocabulary of parts of speech, multiple parts of speech separated by commas. When you start with a ~ that does not include these parts of speech in the statistical results,
Otherwise the representation must be included, and the incoming NULL represents the total part of speech statistics.
Return value: Returns the head pointer of the list of thesaurus lists, which must call Scws_free_tops to release
Error: None

That is, I just need to add a comma-separated argument to the second argument, like I'm adding a '~Ag,~a,~ad,~b,~c,~Dg,~d,~e' character, to filter out the results.

But the actual result is that no matter how many filters you add, it doesn't work, but instead, if you add only one filter, '~a' i.e. without a comma, it can filter out the corresponding results. So I think about whether there are bugs in this. The C implementation code for this function is attached below, so let me see it for you.

Get words by attr (rand Order) scws_top_t Scws_get_words (scws_t s, char *xattr) {int off, cnt, Xmode = scws_na;xtree_t xt ; scws_res_t Res, cur;scws_top_t top, tail, Base;char *word;word_attr *at = null;if (!s | |!s->txt | |! (  XT = Xtree_new (0,1))) return null;__parse_xattr__;//Save the Offset.off = S->off;s->off = 0;base = Tail = Null;while ((cur = res = Scws_get_result (s)) = NULL) {do{/* Check attribute filter */if (at! = NULL) {if ((Xmode = = Scws_na) &&!_attr_belong (Cur->attr, at)) continue ; if ((Xmode = = Scws_yea) && _attr_belong (Cur->attr, at)) continue;} /* Put to the stats */if (! top = Xtree_nget (XT, S->txt + Cur->off, Cur->len, NULL))) {top = (scws_top_t) malloc (sizeof (struct Scws_topword)) ; top->weight = Cur->idf;top->times = 1;top->next = Null;top->word = (char *) _mem_ndup (s->txt + cur-> Off, Cur->len); strncpy (Top->attr, cur->attr, 2);//Add to the chainif (tail = NULL) base = Tail = top;else{tail-& Gt;next = Top;taIl = top;} Xtree_nput (XT, top, sizeof (struct Scws_topword), S->txt + Cur->off, Cur->len);} Else{top->weight + = cur->idf;top->times++;}} while ((cur = cur->next) = NULL); Scws_free_result (res);} Free AT & Xtreeif (at! = NULL) free (at); Xtree_free (XT);//restore the Offsets->off = Off;return base;}

I found __PARSE_XATTR__ some problems with its macros, and here is word_attr an additional structure definition

/* Macro to parse xattr-Xmode, at */#define__PARSE_XATTR__do {\if (xattr = = NULL) break;\if (*xattr = = ' ~ ') {xattr+ +; Xmode = Scws_yea;  }\if (*xattr = = ' + ') break;\cnt = ((strlen (xattr)/2) + 2) * sizeof (WORD_ATTR); \at = (word_attr *) malloc (CNT); \memset (AT, 0, cnt); \cnt = 0;\for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\strncpy (at[cnt], xattr, 2); \xattr = Word + 1;\}\strncpy (at[cnt], xattr, 2); \} while (0) typedef char WORD_ATTR[4];

This way of dealing with xattr words, can only deal with part of speech is a 2-character case as it is strncpy(at[cnt], xattr, 2); . This is too sloppy, the part of speech table has a bunch of characters of the part of speech ah, it copy of the words will be copied into the comma.

Oneself all with 2 characters of speech filter to try it, sure enough ... Let's think about how to change it here.

With the author, Hightman gave a patch and modified the macro definition.

Diff-c-r1.28-r1.29*** libscws/scws.c 5 04:39:33-0000 1.28---libscws/scws.c 2011 Oct 08:41:44-0000 1                                 .29****************** 1278,1284 * * * * * memset (AT, 0, CNT);                                            \ cnt = 0; \ for (cnt = 0; (Word = STRCHR (xattr, ', '));       cnt++) {\!                     strncpy (at[cnt], xattr, 2);                               \ xattr = word + 1;                         \} \ strncpy (at[cnt], xattr, 2);                                 \---1278,1285----memset (at, 0, CNT);                                            \ cnt = 0; \ for (cnt = 0; (Word = STRCHR (xattr, ', '));       cnt++) {\!                          At[cnt][0] = *xattr++;       \! AT[CNT][1] = Xattr = = Word?     ' + ': *xattr;                               \ xattr = word + 1; \} \ strncpy (at[cnt], xattr, 2); \
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.