字串尾碼自動機:Directed Acyclic Word Graph

來源:互聯網
上載者:User

標籤:字串   演算法   

trie -- suffix tree -- suffix automa 有這麼一些應用情境:

即時響應使用者輸入的AJAX搜尋方塊時, 顯示候選列表。
搜尋引擎的關鍵字個數統計。


尾碼樹(Suffix Tree): 從根到葉子表示一個尾碼。

僅僅從這一個簡單的描述,我們可以概念上解決下面的幾個問題:

P:尋找字串o是否在字串S中
A:若o在S中,則o必然是S的某個尾碼的首碼。 用S構造尾碼樹,按在trie中搜尋字串的方法搜尋o即可。 

P: 指定字串T在字串S中的重複次數。
A: 如果T在S中重複了兩次,則S應有兩個尾碼以T為首碼,搜尋T節點下的分葉節點數目即為重複次數。

P: 字串S中的最長重複子串。
A: 同上,找到最深的非分葉節點T。

P: 兩個字串S1,S2的最長公用子串。
A: 廣義尾碼樹(Generalized Suffix Tree)儲存_多個_字串各自的所有尾碼。把兩個字串S1#,S2$加入到廣義尾碼樹中,然後同上。
(A longest substring common to s1 and s2 will be the path-label of an internal node with the
greatest string depth in the suffix tree which has leaves labelled with suffixes from both the
strings.)

Suffix Automa: 識別文本所有子串的輔助索引結構。


下面的代碼是直接翻譯[1]中演算法A:

/*Directed Acyclic Word Graph*/#include <stdlib.h>#include <string.h>typedef struct State{struct State *first[26], *second[26];struct State *suffix;}State;State *sink, *source;State *new_state(void){State *s = malloc(sizeof *s);if(s){memset(s, 0, sizeof *s);}return s;}/*state: parent -- [x] with xa = tail(wa) child  -- [tail(wa)] new child -- [tail(wa)]_{wa}*/State *split(State *parent, int a){int i;/*current state, child, new child*/State *cs = parent, *c = parent->second[a], *nc = new_state(); //S1parent->first[a] = parent->second[a] = nc; //S2for(i = 0; i < 26; ++i){nc->second[i] = c->second[i]; //S3}nc->suffix = c->suffix; //S4c->suffix = nc; //S5for(cs = parent; cs != source; ){//S6,7cs = cs->suffix; //S7.afor(i = 0; i < 26; ++i){if(cs->second[i] == c)cs->second[i] = nc; //S7.belse goto _out; //S7.c}}_out:return nc; //S8}/*state: new sink -- [wa] */void update(int a){/*suffix state, current state, new sink*/State *ss = NULL, *cs = sink, *ns = new_state(); //U1,2 sink->first[a] = ns;while(cs != source && ss == NULL){//U3cs = cs->suffix; //U3.aif(!cs->first[a] && !cs->second[a]){cs->second[a] = ns; //U3.b.1}else if(cs->first[a]){ss = cs->first[a]; //U3.b.2}else if(cs->second[a]){ss = split(cs, a); //U3.b.3}}if(ss == NULL){ss = source;} //U4ns->suffix = ss; sink = ns; //U5}int build_dawg(char *w){sink = source = new_state();for(; *w; ++w){update(*w-'a');}}


我還在努力理解中,沒有測試。


[1] the smallest automation recognizing the subwords of a text 

 https://cbse.soe.ucsc.edu/sites/default/files/smallest_automaton1985.pdf


聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.