標籤:字串 演算法
trie -- suffix tree -- suffix automa 有這麼一些應用情境:
即時響應使用者輸入的AJAX搜尋方塊時, 顯示候選列表。
搜尋引擎的關鍵字個數統計。
尾碼樹(Suffix Tree): 從根到葉子表示一個尾碼。
僅僅從這一個簡單的描述,我們可以概念上解決下面的幾個問題:
P:尋找字串o是否在字串S中
A:若o在S中,則o必然是S的某個尾碼的首碼。 用S構造尾碼樹,按在trie中搜尋字串的方法搜尋o即可。
P: 指定字串T在字串S中的重複次數。
A: 如果T在S中重複了兩次,則S應有兩個尾碼以T為首碼,搜尋T節點下的分葉節點數目即為重複次數。
P: 字串S中的最長重複子串。
A: 同上,找到最深的非分葉節點T。
P: 兩個字串S1,S2的最長公用子串。
A: 廣義尾碼樹(Generalized Suffix Tree)儲存_多個_字串各自的所有尾碼。把兩個字串S1#,S2$加入到廣義尾碼樹中,然後同上。
(A longest substring common to s1 and s2 will be the path-label of an internal node with the
greatest string depth in the suffix tree which has leaves labelled with suffixes from both the
strings.)
Suffix Automa: 識別文本所有子串的輔助索引結構。
下面的代碼是直接翻譯[1]中演算法A:
/*Directed Acyclic Word Graph*/#include <stdlib.h>#include <string.h>typedef struct State{struct State *first[26], *second[26];struct State *suffix;}State;State *sink, *source;State *new_state(void){State *s = malloc(sizeof *s);if(s){memset(s, 0, sizeof *s);}return s;}/*state: parent -- [x] with xa = tail(wa) child -- [tail(wa)] new child -- [tail(wa)]_{wa}*/State *split(State *parent, int a){int i;/*current state, child, new child*/State *cs = parent, *c = parent->second[a], *nc = new_state(); //S1parent->first[a] = parent->second[a] = nc; //S2for(i = 0; i < 26; ++i){nc->second[i] = c->second[i]; //S3}nc->suffix = c->suffix; //S4c->suffix = nc; //S5for(cs = parent; cs != source; ){//S6,7cs = cs->suffix; //S7.afor(i = 0; i < 26; ++i){if(cs->second[i] == c)cs->second[i] = nc; //S7.belse goto _out; //S7.c}}_out:return nc; //S8}/*state: new sink -- [wa] */void update(int a){/*suffix state, current state, new sink*/State *ss = NULL, *cs = sink, *ns = new_state(); //U1,2 sink->first[a] = ns;while(cs != source && ss == NULL){//U3cs = cs->suffix; //U3.aif(!cs->first[a] && !cs->second[a]){cs->second[a] = ns; //U3.b.1}else if(cs->first[a]){ss = cs->first[a]; //U3.b.2}else if(cs->second[a]){ss = split(cs, a); //U3.b.3}}if(ss == NULL){ss = source;} //U4ns->suffix = ss; sink = ns; //U5}int build_dawg(char *w){sink = source = new_state();for(; *w; ++w){update(*w-'a');}}
我還在努力理解中,沒有測試。
[1] the smallest automation recognizing the subwords of a text
https://cbse.soe.ucsc.edu/sites/default/files/smallest_automaton1985.pdf