Treesplitter---Tree-shaped word segmentation algorithm

Source: Internet
Author: User
Tags foreach empty serialization zip

Note: The idea is not original, first thanks to the whim of thinking

One, Description:

At present, many algorithms, ready-made pieces of the phrase is also a lot, but it is difficult to find a I need, I just a participle function, a word of course to complete the work of things, of course, refers to the words in the library what words can be divided into what words. Some intelligent participle of the goal is beyond doubt, the degree of difficulty is also increased with the level of intelligence, not you and I (only less than I) casually walk in the street can be a whim out. Some mature Word segmentation method is based on the thesaurus, in line with dry principles so that dro (Don ' t repeat others), you have to understand please see here or directly Google. One of the deficiencies, the original idea of the author has said very clearly, I have experienced, it is likely you have also experienced! This problem bothered me for a long time, on the day before the Ching Ming Holiday, search a few articles on participle, decided to use this three days to understand the way of participle. See when found, most of them are "maximum matching method", but this article has been tree-shaped word to let me see the hope that it is what I need! Originally also want to adhere to the principle of dro, on the internet search source, but has been very welcome, in this case can only DIY (do it yourself).

Second, to achieve:

1. Set up node class

Node represents a node of a tree, the node has a value, there is a parent node more than a child node, but also to know whether it is a suffix, that is, from the root to this is a word (hereinafter used as "word"), method Gettermvalue is traced back to the word.

Node class
    
    [Serializable] public
    class Node
    {public
        char Value {get; set;}
        Public Node Parent {get; set;}
        public bool End {get; set;}
        Public Dictionary<char, node> Children {get; set;}
        public static Node Empty = new Node (', null ');
        Public Node (char value, Node parent)
        {this
            . value = value;
            This. parent = parent;
            Children = new Dictionary<char, node> ();
        }
        public string Gettermvalue ()
        {
            list<char> chars = new list<char> (5);
            Node node = this;
            while (node. Parent!= null)
            {
                chars. Insert (0, node. Value);
                node = node. Parent;
            }
            return new string (chars. ToArray ());
        }
    

2, the creation of tree-shaped word library

If you can imagine the structure of the tree in the mind, this step can be easily done, and about the structure of the tree, we high school students, the ability to express self-knowledge is very humble, so please see the author's paper. The simplest thing about storage is direct serialization, but because the parent node to save all the child nodes, the first generation is really a shock, the text of the word library with Pangu Word collation, probably 15w word 1.5M size, the tree-shaped word library after the size of 80 m! This time must begin to optimize:

1 the node attribute name all with a letter, after testing found himself stupid, the size only reduced by more than 1M, serialization will not be like MongoDB storage, haha really think more ....

2 can change the storage structure of the tree? Because the retrieval efficiency of participle is very important, after the tree is loaded, it must be formed as the node class, it is to be able to store into a flat structure or let the parent node reference the first child node, the first child node reference to the next sibling knot, and so on, considering a few, a test success is not! When the book to hate less Ah, did not study the data structure, so the thinking in this place is also very limited, may really exist very good method!

3 compression flow, this is very simple,. NET class library has its own, test found that not only simple, the effect is also significant, tree-shaped thesaurus reduced to more than 20 m, this I can receive!

4 unmanaged code, has always been in awe of C language, although a lot of do not understand, but it is very interesting to have a look, in. NET also has been no use of unsafe code, take advantage of this opportunity to try, only to use the character in the string access there, after the CTRL+F5 also really run! Just basically did not change, it seems to use in this place safe code efficiency is also very high ah, so this basic not calculate optimization, just to meet the heart of their own vanity!

The tree type looks at the source code, the following creation method is the tree type static method, because thinks the word storehouse maintenance, the build is to represent the creation and the append,append to use the Loadrootnode method, next will give.

Create a tree thesaurus public static void Builddict (string[] lines, string dictpath) {var capitals = Li Nes. Select ((line) => {return line[0];}). Distinct ().
            ToArray (); First-level node foreach (Var capital in capitals) {node RootNode = RootNode = Loadrootnod
                E (capital, Dictpath, true);
                if (RootNode = = null) RootNode = new Node (capital, Node.empty); #region set up a group of nodes with Rootprefix as the first word var terms = lines.
                Where (line) => {return line[0] = = capital;}; foreach (var term in terms) {int length = term.
                    Length;
                    if (length = = 1)//The first word is also the case of a word rootnode.end = true;
                    var parentnode = RootNode;
                            unsafe {fixed (char* cs = term) {
                         int i = 1;   Char Curchar; while (Curchar = * (cs + i))!= char.
                                MinValue) {Node curdictnode = null;
                                    if (ParentNode.Children.Keys.Contains (Curchar)) {
                                Curdictnode = Parentnode.children[curchar];  } else {Curdictnode =
                                    New Node (Curchar, parentnode);
                                PARENTNODE.CHILDREN.ADD (Curchar, Curdictnode);
                                } if (i = = length-1) Curdictnode.end = true;
                                ParentNode = Curdictnode;
                            i++; #endregion using (FIlestream stream = new FileStream (getdictfilename (capital, Dictpath), FileMode.Create)) {
                        using (GZipStream zip = new GZipStream (stream, System.IO.Compression.CompressionMode.Compress)) {
                        var formatter = new BinaryFormatter (); Formatter.
                        Filterlevel = System.Runtime.Serialization.Formatters.TypeFilterLevel.Low; Formatter.
                        Typeformat = System.Runtime.Serialization.Formatters.FormatterTypeStyle.TypesWhenNeeded; Formatter.
                    Serialize (Zip, rootnode);
            } RootNode = null; } GC.
        Collect (); }

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.