Huffman tree (optimal binary tree) and Huffman algorithm

Source: Internet
Author: User
Tags data structures
Huffman Tree

In the general data structure of the book, the tree behind that chapter, the author will generally introduce Huffman (HUFFMAN) tree and Huffman code. Huffman coding is an application of Huffman tree. Huffman coding is widely used, such as JPEG in the application of Huffman coding.

First introduce what is Huffman tree. Huffman tree, also known as the optimal binary tree, is a two-fork tree with the shortest length of a weighted path. The so-called path length of the tree is the length of all the leaf nodes in the tree (the Joghen node is 0 layers, the path length of the leaf node to the root node is the layer of the leaf node). The tree's weighted path length is recorded as wpl= (W1*L1+W2*L2+W3*L3+...+WN*LN), n weights wi (i=1,2,... N) constitutes a two-fork tree with n leaf nodes, and the corresponding leaf node path length is Li (i=1,2,... N). It can be proved that the wpl of Huffman tree is the smallest.

In the early 50, Huffman proposed this code, according to the probability of the occurrence of characters to construct the average length of the shortest encoding. It is a variable-length encoding. In the code, if each code word length strictly according to the code word corresponding symbol occurrence probability of the size of the reverse order, the average length of the encoding is the smallest. (Note: Code word is the symbol by Huffman encoded after the code, the length of the symbol appears because of the probability and different, so that Huffman code is a variable length encoding. )

But how to construct a Huffman tree. The most common law of the construction method is Huffman algorithm. The general data structure of the book can be found in its description:

First, for a given n weights {w1,w2,w3,..., Wi,..., Wn} constitute the initial set of n binary tree f={t1,t2,t3,..., ti,..., Tn}, where
only one of each binary tree ti is the root node of Wi, Its left and right sub-trees are empty. (in order to facilitate the implementation of the algorithm on the computer, it is generally required that the weight of ti in the
ascending order of WI.)
second, in F, select two root nodes with the lowest weight of the tree as the new structure of the two-fork tree of the left and right sub-tree, the root node of the new binary tree is the weight of its left and right subtree root node
of the sum of the weights.
Remove the two trees from F and add the new two-fork tree in ascending order to the set F.
repeat two and 32 steps until there is only one binary tree in the set F.

Using the C language to implement the above algorithm, the static two-tree or dynamic two-fork tree can be used.                     The following data structures are available for dynamic two-fork trees:

struct tree{
float weight;/* Weight value */
union{
Char leaf;             /* Leaf node information character */
struct tree *left;/* Left point of tree */ 
};
struct tree *right; /* The right node of the tree */
};
struct forest{/*f collection, in the form of a list representing the */
struct tree *ti;/* tree in/* f/
struct forest *next;/* A node */
};
Huffman Code

Huffman Coding (Huffman Coding) is a coding method, with Huffman tree-namely the optimal binary tree, with a weight path length of the smallest two fork tree, often applied to data compression. In Computer information processing, "Huffman coding" is a consistency coding method (also known as "Entropy Coding Method"), which is used for lossless compression of data. This term refers to the use of a special coding table to encode the source character (for example, a symbol in a file). The special aspect of this coding table is that it is based on the estimated probabilities of each source character (high probability characters use shorter encodings, whereas lower probabilities use longer encodings, which reduces the average expected length of the encoded string, thus achieving lossless compression of the data). This method is developed by David.a.huffman.

For example, in English, the probability of E appearing is very high, and the probability of Z appearing is the lowest. When using Huffman coding to compress a piece of English, the e pole is likely to be represented by a bit (bit), while Z may take 25 bits (not 26). With ordinary notation, each English letter occupies one byte (byte), or 8 bits. In comparison, E uses a 1/8 length of the general encoding, and Z uses 3 times times more. If we can achieve a more accurate estimate of the probability of each letter in English, we can greatly improve the ratio of lossless compression.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.