Ngram model Chinese corpus experiment step by step (2)-ngram Model Data Structure Representation and Creation

Source: Internet
Author: User
The n-yuan ngram model is essentially Trie treeStructure

Layer-by-layer status transfer. Sun pinyin uses the vector representation layer by layer in order.Layer-by-layer Binary Search. Sun pinyin's method for creating the ngram model is alsoSort the <ngram tuples, times> sequence in alphabetical order as input.Created.

Sequential storage + binary search should be the most space-saving. However, the efficiency must be affected. Other trie tree implementations include map (hash_map is a little more space-consuming), sun pinyin also has a map-based trie tree implementation, sirlm is similar to its own LHash implementation. In addition, the use of double array trie is suitable for such scenarios where the pre-sorted order does not need to be dynamically added or deleted, but the space usage is large. TODO will try to compare it later. At present, I also implement storage in the order of sun pinyin.

Core Data Structure Representation
Struct Leaf // The underlying ngram level element {WordID id; // The word number union {FREQ_TYPE freq; // the number of times that need to be counted before creation, at last, only the following probability values are required.
            PR_TYPE pr;        };}
Struct Node: Leaf // non-underlying ngram level element {int child; // The coefficient used to point to the first address PR_TYPE bow of the next level, the bottom layer leaves are not required}
Typedef std: vector <Node> NodeLevel; typedef std: vector <Leaf> LeafLevel;/** Core Data Structure */vector <NodeLevel> m_nodeLevel; /// non-underlying 0 root, 1 gram-(n-1) gramLeafLevel m_leafLevel; // The underlying ngram
In order to save memory space, sun pinyin's representation distinguishes the underlying nodes from the upper-Layer Nodes. The trouble is that the Code is a little annoying because it needs to be differentiated.
For example, for a 3-element model, m_nodeLevel [0] indicates root m_nodeLevel [1], m_nodeLevel [2] indicates the first two layers, and m_leafLevel indicates the bottom layer.
Note that this model can improve acceleration. sunpinyin uses the clue method. P (ABC) P (BCD) transfers P (ABC) to C directly to P (B) the level1 binary search process is eliminated,
In fact, I think a simple method is to directly change the storage of level1 to a large vector and then directly access it by id and subscript index O (1, because level1 occupies a small amount of memory and its binary search is the most time-consuming.
After better optimization, let's think about it. This model is basically enough. I use this model to store ngram information that accounts for MB of memory, for a simple 3-element model, the probability word segmentation can reach 4-5/M per second.
Ngram model creation Algorithm
/*** Initialize the resource for building the n gram model */void init (int n) {// ------- set the n-element model m_nLevel = n; // ------- initialize the core data structure space m_nodeLevel.resize (n); for (size_t I = 1; I <n; I ++) {m_nodeLevel [I]. reserve (kMemInitAllocSize);} m_leafLevel.reserve (kMemInitAllocSize); // m_level.resize (n + 1, NULL); // Add the root node m_nodeLevel [0]. push_back (Node (0, 0, 0); // ------- initialize the nr two-dimensional array m_nr.resize (n + 1); for (size_t I = 0; I <n + 1; I ++) {m_nr [I]. resize (kngrammaxcsp3, 0);} // ------- initialize the pruning array m_cut.resize (n + 1, 0 );}
/*** Note that the input is sorted by word id dictionary * If ngram [I] (0 <I <= N-1) is excluded words, then only count ngram [0 .. the number of occurrences of the I-1]; * If ngram [I] (0 <= I <= N-1) is the ID of the sentence segmentation word (we specify 10 in the parameter ), then only count ngram [0 .. i. * If a trigram is (9, x, y), it is skipped. * if it is (x, 9, y), only unigram (x) is counted ); if it is (10, x, y), calculate unigram (10), bigram (10, x), trigram (10, x, y); * if it is (x, 10, y), then calculate unigram (x) and bigram (x, 10 ). */Void addNGram (const WordID * ngram, FREQ_TYPE fr) {int ch; bool brk = isExcludeID (* ngram); // if it is a location that needs to be ignored, in particular, if the indicator is ambiguous, no information (9, x, y) is collected; otherwise, if (! Brk) {m_nodeLevel [0] [0]. freq + = fr; // count the total number of 1 gram corresponding to 0 gram} else return; // if you need to manually increase the space reallocMem (); bool branch = false; // I = 1 1st levels correspond to ngram [I-1], that is, ngram [0] requires that ngram [0] is not exclude id 9 // If ngram <100,200,300> is in order-> <100> <100,200> <100,200,300> for (int I = 1; (! Brk & I <m_nLevel); I ++) {NodeLevel & pv = m_nodeLevel [I-1]; NodeLevel & v = m_nodeLevel [I]; // 1. if a new status appears, the status is always new. branch = branch | //, 3 |, 3 For <2, 2>, this is a new State and needs push_back because the previous level has found a new state. // 2. when addNGram is used for the first time and I = 1, branch = (pv. back (). child> = v. size () | // 3. if the current id changes, it must be in the new status branch = (v. back (). id! = Ngram [I-1]) | // 1, 2, 3 | 1, 4, 3 is a new State for <> because this level is composed of 2-> 4 // another example: 1 2 3, 1 2 4, 2 2 4, 2 4 4 Note that the 3rd 2 of level 2 is a new state of level 2, and the 2nd 2 is not! Branch = branch | (pv. back (). child> = v. size () | (v. back (). id! = Ngram [I-1]); if (branch) {if (I = m_nLevel-1) ch = m_leafLevel.size (); else ch = m_nodeLevel [I + 1]. size (); v. push_back (Node (ngram [I-1], ch, fr);} else {v. back (). freq + = fr;} brk = (I> 1 & isBreakID (ngram [I-1]) | isExcludeID (ngram [I]); // determine whether the ngram [I] corresponding to the next I + 1 level is valid. Note that all unigram, bigram, and, trigram also counts} // n tuples are added to the leaf level if (! Brk) {if (fr> m_cut [m_nLevel]) {m_leafLevel.push_back (Leaf (ngram [m_nLevel-1], fr ));} else // The number of else nodes that have not been added or removed is counted. {m_nr [m_nLevel] [0] + = fr; m_nr [m_nLevel] [fr] + = fr ;}}}

In this way, we have created the ngram model. Note that we need to add tail to facilitate binary search. For example, if the first layer of the status A position iterator is iter, find the-> C state, that is, the second layer corresponds to the status where the current status of A is C, you only need to search in the range where m_nodeLevel [2] subscript is [iter-> child, (iter + 1)-> child.

Ngram simple frequency model and probability model test results:

Pr corresponds to relative probability P (END | AB...) pr2 corresponds to absolute probability P (AB... END). If the probability is 0, it is marked as-1. This is a simple ngram model that fully complies with the maximum likelihood without any smoothness. This will cause many non-probability situations and we will continue to introduce the smoothing method of the model.

TestNgram/test_freq.inputline ------- beautiful day freq ------- 0 line ------- the greatest freq ------- 0 line ------- Xiaohong read freq ------- 1 line ------- James Read freq ------- 1 line ------- a book freq ------- 0 line ------- B beautiful freq ------- 3 line ------- beautiful freq ------- 3 line ------- beautiful world freq ------- 1 line ------- beautiful soul freq ------- 2 line ------- beautiful freq ------- 3 line ------- book bfreq ------- 2 line ------- book freq ------- 2 line ------- one book bfreq ------- 2 line ------- mind bfreq ------- 2 line ------- B James Read freq ------- 1 line ------- B bfreq ------- 0 line ------- B bfreq ------- 0Line ------- bfreq ------- 11 // note that the Count of occurrence of break id 10 is 11. Here, the end of each sentence and the start of the next sentence are counted as one break id, the end of the last sentence is not counted. The number of break IDS is actually equal to the number of start pos.I ------- 2i ------- 1i ------- 0 line ------- beautiful day pr --------1pr2 --------1 line ------- the greatest pr --------1pr2 --------1 line ------- Xiaohong read pr ------- 1pr2 ------- 0.025 line ------- James Read pr ------- 1pr2 ------- 0.025 line ------- a Book pr --------1pr2 --------1 line ------- B beautiful pr ------- 1pr2 ------- 0.075 line ------- beautiful pr ------- 1pr2 ------- 0.075 line ------- beautiful world pr ------- 0.333333pr2 ------- 0.025 line ------- beautiful soul pr ------- 0.666667pr2 ------- 0.05 line ------- beautiful pr ------- 0.075pr2 ------- 0.075 line ------- this book bpr ------- 1pr2 ------- 0.05 line ------- this Book pr ------- 0.05pr2 ------- 0.05 line ------- a book bpr ------- 1pr2 ------- 0.05 line ------- spiritual bpr ------- 1pr2 ------- 0.05 line ------- B James Read pr ------- 1pr2 ------- 0.025 line ------- B bpr ------- -1pr2 --------1 line ------- B bpr --------1pr2 --------1 line ------- bpr ------- 0.275pr2 ------- 0.275
 
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.