Scws is a very good Chinese word thesaurus, its PHP extension can be easily processed word segmentation. Now find one of the functions
scws_get_words
function problem, this function is used to get the result of the participle, its second parameter can specify you need to return the result, this is its C API document description (PHP is similar)
scws_top_t scws_get_words (scws_t s, Char *xattr);
Description: Returns the keyword table for the specified part of speech, and the system inserts the list according to the occurrence of the word. The parameter xattr is used to describe what to exclude
or participate in the statistical vocabulary of parts of speech, multiple parts of speech separated by commas. When you start with a ~ that does not include these parts of speech in the statistical results,
Otherwise the representation must be included, and the incoming NULL represents the total part of speech statistics.
Return value: Returns the head pointer of the list of thesaurus lists, which must call Scws_free_tops to release
Error: None
That is, I just need to add a comma-separated argument to the second argument, like I'm adding a '~Ag,~a,~ad,~b,~c,~Dg,~d,~e'
character, to filter out the results.
But the actual result is that no matter how many filters you add, it doesn't work, but instead, if you add only one filter, '~a'
i.e. without a comma, it can filter out the corresponding results. So I think about whether there are bugs in this. The C implementation code for this function is attached below, so let me see it for you.
Get words by attr (rand Order) scws_top_t Scws_get_words (scws_t s, char *xattr) {int off, cnt, Xmode = scws_na;xtree_t xt ; scws_res_t Res, cur;scws_top_t top, tail, Base;char *word;word_attr *at = null;if (!s | |!s->txt | |! ( XT = Xtree_new (0,1))) return null;__parse_xattr__;//Save the Offset.off = S->off;s->off = 0;base = Tail = Null;while ((cur = res = Scws_get_result (s)) = NULL) {do{/* Check attribute filter */if (at! = NULL) {if ((Xmode = = Scws_na) &&!_attr_belong (Cur->attr, at)) continue ; if ((Xmode = = Scws_yea) && _attr_belong (Cur->attr, at)) continue;} /* Put to the stats */if (! top = Xtree_nget (XT, S->txt + Cur->off, Cur->len, NULL))) {top = (scws_top_t) malloc (sizeof (struct Scws_topword)) ; top->weight = Cur->idf;top->times = 1;top->next = Null;top->word = (char *) _mem_ndup (s->txt + cur-> Off, Cur->len); strncpy (Top->attr, cur->attr, 2);//Add to the chainif (tail = NULL) base = Tail = top;else{tail-& Gt;next = Top;taIl = top;} Xtree_nput (XT, top, sizeof (struct Scws_topword), S->txt + Cur->off, Cur->len);} Else{top->weight + = cur->idf;top->times++;}} while ((cur = cur->next) = NULL); Scws_free_result (res);} Free AT & Xtreeif (at! = NULL) free (at); Xtree_free (XT);//restore the Offsets->off = Off;return base;}
I found __PARSE_XATTR__
some problems with its macros, and here is word_attr
an additional structure definition
/* Macro to parse xattr-Xmode, at */#define__PARSE_XATTR__do {\if (xattr = = NULL) break;\if (*xattr = = ' ~ ') {xattr+ +; Xmode = Scws_yea; }\if (*xattr = = ' + ') break;\cnt = ((strlen (xattr)/2) + 2) * sizeof (WORD_ATTR); \at = (word_attr *) malloc (CNT); \memset (AT, 0, cnt); \cnt = 0;\for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\strncpy (at[cnt], xattr, 2); \xattr = Word + 1;\}\strncpy (at[cnt], xattr, 2); \} while (0) typedef char WORD_ATTR[4];
This way of dealing with xattr words, can only deal with part of speech is a 2-character case as it is strncpy(at[cnt], xattr, 2);
. This is too sloppy, the part of speech table has a bunch of characters of the part of speech ah, it copy of the words will be copied into the comma.
Oneself all with 2 characters of speech filter to try it, sure enough ... Let's think about how to change it here.
Reply content:
Scws is a very good Chinese word thesaurus, its PHP extension can be easily processed word segmentation. Now find one of the functions scws_get_words
of the problem, this function is used to get the result of the word, its second parameter can specify you need to return the result, which is its C API document description (PHP is similar)
scws_top_t scws_get_words (scws_t s, Char *xattr);
Description: Returns the keyword table for the specified part of speech, and the system inserts the list according to the occurrence of the word. The parameter xattr is used to describe what to exclude
or participate in the statistical vocabulary of parts of speech, multiple parts of speech separated by commas. When you start with a ~ that does not include these parts of speech in the statistical results,
Otherwise the representation must be included, and the incoming NULL represents the total part of speech statistics.
Return value: Returns the head pointer of the list of thesaurus lists, which must call Scws_free_tops to release
Error: None
That is, I just need to add a comma-separated argument to the second argument, like I'm adding a '~Ag,~a,~ad,~b,~c,~Dg,~d,~e'
character, to filter out the results.
But the actual result is that no matter how many filters you add, it doesn't work, but instead, if you add only one filter, '~a'
i.e. without a comma, it can filter out the corresponding results. So I think about whether there are bugs in this. The C implementation code for this function is attached below, so let me see it for you.
Get words by attr (rand Order) scws_top_t Scws_get_words (scws_t s, char *xattr) {int off, cnt, Xmode = scws_na;xtree_t xt ; scws_res_t Res, cur;scws_top_t top, tail, Base;char *word;word_attr *at = null;if (!s | |!s->txt | |! ( XT = Xtree_new (0,1))) return null;__parse_xattr__;//Save the Offset.off = S->off;s->off = 0;base = Tail = Null;while ((cur = res = Scws_get_result (s)) = NULL) {do{/* Check attribute filter */if (at! = NULL) {if ((Xmode = = Scws_na) &&!_attr_belong (Cur->attr, at)) continue ; if ((Xmode = = Scws_yea) && _attr_belong (Cur->attr, at)) continue;} /* Put to the stats */if (! top = Xtree_nget (XT, S->txt + Cur->off, Cur->len, NULL))) {top = (scws_top_t) malloc (sizeof (struct Scws_topword)) ; top->weight = Cur->idf;top->times = 1;top->next = Null;top->word = (char *) _mem_ndup (s->txt + cur-> Off, Cur->len); strncpy (Top->attr, cur->attr, 2);//Add to the chainif (tail = NULL) base = Tail = top;else{tail-& Gt;next = Top;taIl = top;} Xtree_nput (XT, top, sizeof (struct Scws_topword), S->txt + Cur->off, Cur->len);} Else{top->weight + = cur->idf;top->times++;}} while ((cur = cur->next) = NULL); Scws_free_result (res);} Free AT & Xtreeif (at! = NULL) free (at); Xtree_free (XT);//restore the Offsets->off = Off;return base;}
I found __PARSE_XATTR__
some problems with its macros, and here is word_attr
an additional structure definition
/* Macro to parse xattr-Xmode, at */#define__PARSE_XATTR__do {\if (xattr = = NULL) break;\if (*xattr = = ' ~ ') {xattr+ +; Xmode = Scws_yea; }\if (*xattr = = ' + ') break;\cnt = ((strlen (xattr)/2) + 2) * sizeof (WORD_ATTR); \at = (word_attr *) malloc (CNT); \memset (AT, 0, cnt); \cnt = 0;\for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\strncpy (at[cnt], xattr, 2); \xattr = Word + 1;\}\strncpy (at[cnt], xattr, 2); \} while (0) typedef char WORD_ATTR[4];
This way of dealing with xattr words, can only deal with part of speech is a 2-character case as it is strncpy(at[cnt], xattr, 2);
. This is too sloppy, the part of speech table has a bunch of characters of the part of speech ah, it copy of the words will be copied into the comma.
Oneself all with 2 characters of speech filter to try it, sure enough ... Let's think about how to change it here.
With the author, Hightman gave a patch and modified the macro definition.
Diff-c-r1.28-r1.29*** libscws/scws.c 5 04:39:33-0000 1.28---libscws/scws.c 2011 Oct 08:41:44-0000 1 .29****************** 1278,1284 * * * * * memset (AT, 0, CNT); \ cnt = 0; \ for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\! strncpy (at[cnt], xattr, 2); \ xattr = word + 1; \} \ strncpy (at[cnt], xattr, 2); \---1278,1285----memset (at, 0, CNT); \ cnt = 0; \ for (cnt = 0; (Word = STRCHR (xattr, ', ')); cnt++) {\! At[cnt][0] = *xattr++; \! AT[CNT][1] = Xattr = = Word? ' + ': *xattr; \ xattr = word + 1; \} \ strncpy (at[cnt], xattr, 2); \