SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now we can see a problem with one of the scws_get_words functions. This function is used to obtain the word segmentation result. The second parameter can specify the result to be returned, which is its capi... SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now one of the functions is found
scws_get_words
Function problems. This function is used to obtain word segmentation results. The second parameter can specify the results to be returned. This is the c api documentation description of this function (php is similar)
· Scws _ top_t scws_get_words (scws_t s, char * xattr );
Description: return the keyword table of a specified part of speech. The system inserts a list based on the sequence of words that appear. The xattr parameter is used to describe to exclude
Or participate in the statistical word part of speech. Multiple parts of speech are separated by commas. When ~ It indicates that the statistical results do not contain these parts of speech,
Otherwise, it indicates that it must be included, and NULL indicates counting all parts of speech.
Return Value: return the head pointer of the linked list of the Word Table set. The word table set must call scws_free_tops for release.
Error: None
That is to say, I only need to add a comma-separated parameter to the second parameter. For example, I add'~Ag,~a,~ad,~b,~c,~Dg,~d,~e'
Character, indicating that I want to filter out these characters in the result.
However, the actual result is that no matter how many filter conditions you add, but on the contrary, if you only add one filter condition, such'~a'
, That is, when there is no comma, it can filter out the corresponding results. So I figured out whether there was a bug in it. The c implementation code of this function is attached below. Let's take a look.
// get words by attr (rand order)scws_top_t scws_get_words(scws_t s, char *xattr){int off, cnt, xmode = SCWS_NA;xtree_t xt;scws_res_t res, cur;scws_top_t top, tail, base;char *word;word_attr *at = NULL;if (!s || !s->txt || !(xt = xtree_new(0,1)))return NULL;__PARSE_XATTR__;// save the offset.off = s->off;s->off = 0;base = tail = NULL;while ((cur = res = scws_get_result(s)) != NULL){do{/* check attribute filter */if (at != NULL){if ((xmode == SCWS_NA) && !_attr_belong(cur->attr, at))continue;if ((xmode == SCWS_YEA) && _attr_belong(cur->attr, at))continue;}/* put to the stats */if (!(top = xtree_nget(xt, s->txt + cur->off, cur->len, NULL))){top = (scws_top_t) malloc(sizeof(struct scws_topword));top->weight = cur->idf;top->times = 1;top->next = NULL;top->word = (char *)_mem_ndup(s->txt + cur->off, cur->len);strncpy(top->attr, cur->attr, 2);// add to the chainif (tail == NULL)base = tail = top;else{tail->next = top;tail = top;}xtree_nput(xt, top, sizeof(struct scws_topword), s->txt + cur->off, cur->len);}else{top->weight += cur->idf;top->times++;}}while ((cur = cur->next) != NULL);scws_free_result(res);}// free at & xtreeif (at != NULL)free(at);xtree_free(xt);// restore the offsets->off = off;return base;}
I found its__PARSE_XATTR__
Macro has some problems. Here is an attachment.word_attr
Structure Definition
/* macro to parse xattr -> xmode, at */#define__PARSE_XATTR__do {\if (xattr == NULL) break;\if (*xattr == '~') { xattr++; xmode = SCWS_YEA; }\if (*xattr == '\0') break;\cnt = ((strlen(xattr)/2) + 2) * sizeof(word_attr);\at = (word_attr *) malloc(cnt);\memset(at, 0, cnt);\cnt = 0;\for (cnt = 0; (word = strchr(xattr, ',')); cnt++) {\strncpy(at[cnt], xattr, 2);\xattr = word + 1;\}\strncpy(at[cnt], xattr, 2);\} while (0)typedef char word_attr[4];
In this way, xattr can be processed only when the part of speech is 2 characters, because itstrncpy(at[cnt], xattr, 2);
. This is too sloppy. There is a heap of Characters in the part of speech table. If it is copied, it will copy the comma.
I tried filtering all the parts of speech with 2 characters. That's all right... Let's think about how to change it here.
Reply content:
SCWS is an excellent word segmentation dictionary made by Chinese people. Its php extension can easily process Chinese word segmentation. Now one of the functions is foundscws_get_words
Function problems. This function is used to obtain word segmentation results. The second parameter can specify the results to be returned. This is the c api documentation description of this function (php is similar)
· Scws _ top_t scws_get_words (scws_t s, char * xattr );
Description: return the keyword table of a specified part of speech. The system inserts a list based on the sequence of words that appear. The xattr parameter is used to describe to exclude
Or participate in the statistical word part of speech. Multiple parts of speech are separated by commas. When ~ It indicates that the statistical results do not contain these parts of speech,
Otherwise, it indicates that it must be included, and NULL indicates counting all parts of speech.
Return Value: return the head pointer of the linked list of the Word Table set. The word table set must call scws_free_tops for release.
Error: None
That is to say, I only need to add a comma-separated parameter to the second parameter. For example, I add'~Ag,~a,~ad,~b,~c,~Dg,~d,~e'
Character, indicating that I want to filter out these characters in the result.
However, the actual result is that no matter how many filter conditions you add, but on the contrary, if you only add one filter condition, such'~a'
, That is, when there is no comma, it can filter out the corresponding results. So I figured out whether there was a bug in it. The c implementation code of this function is attached below. Let's take a look.
// get words by attr (rand order)scws_top_t scws_get_words(scws_t s, char *xattr){int off, cnt, xmode = SCWS_NA;xtree_t xt;scws_res_t res, cur;scws_top_t top, tail, base;char *word;word_attr *at = NULL;if (!s || !s->txt || !(xt = xtree_new(0,1)))return NULL;__PARSE_XATTR__;// save the offset.off = s->off;s->off = 0;base = tail = NULL;while ((cur = res = scws_get_result(s)) != NULL){do{/* check attribute filter */if (at != NULL){if ((xmode == SCWS_NA) && !_attr_belong(cur->attr, at))continue;if ((xmode == SCWS_YEA) && _attr_belong(cur->attr, at))continue;}/* put to the stats */if (!(top = xtree_nget(xt, s->txt + cur->off, cur->len, NULL))){top = (scws_top_t) malloc(sizeof(struct scws_topword));top->weight = cur->idf;top->times = 1;top->next = NULL;top->word = (char *)_mem_ndup(s->txt + cur->off, cur->len);strncpy(top->attr, cur->attr, 2);// add to the chainif (tail == NULL)base = tail = top;else{tail->next = top;tail = top;}xtree_nput(xt, top, sizeof(struct scws_topword), s->txt + cur->off, cur->len);}else{top->weight += cur->idf;top->times++;}}while ((cur = cur->next) != NULL);scws_free_result(res);}// free at & xtreeif (at != NULL)free(at);xtree_free(xt);// restore the offsets->off = off;return base;}
I found its__PARSE_XATTR__
Macro has some problems. Here is an attachment.word_attr
Structure Definition
/* macro to parse xattr -> xmode, at */#define__PARSE_XATTR__do {\if (xattr == NULL) break;\if (*xattr == '~') { xattr++; xmode = SCWS_YEA; }\if (*xattr == '\0') break;\cnt = ((strlen(xattr)/2) + 2) * sizeof(word_attr);\at = (word_attr *) malloc(cnt);\memset(at, 0, cnt);\cnt = 0;\for (cnt = 0; (word = strchr(xattr, ',')); cnt++) {\strncpy(at[cnt], xattr, 2);\xattr = word + 1;\}\strncpy(at[cnt], xattr, 2);\} while (0)typedef char word_attr[4];
In this way, xattr can be processed only when the part of speech is 2 characters, because itstrncpy(at[cnt], xattr, 2);
. This is too sloppy. There is a heap of Characters in the part of speech table. If it is copied, it will copy the comma.
I tried filtering all the parts of speech with 2 characters. That's all right... Let's think about how to change it here.
After talking with the author, hightman provides a patch to modify the macro definition.
diff -c -r1.28 -r1.29*** libscws/scws.c 5 Aug 2011 04:39:33 -0000 1.28--- libscws/scws.c 26 Oct 2011 08:41:44 -0000 1.29****************** 1278,1284 **** memset(at, 0, cnt); \ cnt = 0; \ for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { \! strncpy(at[cnt], xattr, 2); \ xattr = word + 1; \ } \ strncpy(at[cnt], xattr, 2); \--- 1278,1285 ---- memset(at, 0, cnt); \ cnt = 0; \ for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { \! at[cnt][0] = *xattr++; \! at[cnt][1] = xattr == word ? '\0' : *xattr; \ xattr = word + 1; \ } \ strncpy(at[cnt], xattr, 2); \