The problem with the coding of the Harman tree and the Harman tree
Today, I saw a question about the coding of the user. Given a string named abcdabaa, I asked the user what the total length of the binary string after the encoding is. The answer is 14.
I don't know much about the Harman tree at all. So after a look-up, I summarized the following knowledge points and shared them with you: Of course, part of the content has been referred to by Baidu
The Harman tree, also known as the optimal binary tree, is a binary tree with the shortest weight path. A binary tree is an application that is commonly used in information retrieval.
Some related concepts:
1. path length between nodes: the number of branches between one node and the other is called the path length between two nodes.
2. Tree path length: The sum of the path lengths from the root node to each node in the tree.
3. Length of a node's weighted path: the product of the path length between the node and the root node and the node's weight.
4. Length of a tree's weighted path: the sum of the lengths of the weighted paths of all leaf nodes in the tree.
The binary tree with the smallest weighted path is called the Harman tree or the optimal binary tree.
There is a very important theorem for the Harman tree: For the Harman tree with n leaf nodes, a total of 2 * n-1 nodes are required.
This theorem is explained as follows: For a binary tree, there are three types of nodes, that is, nodes with a degree (only calculated) of 2, nodes with a degree of 1, and leaf nodes with a degree of 0. While the non-leaf nodes of the Harman tree are generated by two nodes, So there cannot be a node with a degree of 1, and the number of non-leaf nodes generated is one fewer than the number of leaf nodes, this theorem is proved.
The following is an algorithm used to construct a user-defined tree:
1) {W1, W2, W3 ,..., wi ,..., wn} is the initial set of n Binary Trees. F = {T1, T2, T3 ,..., ti ,..., tn}, where each binary tree Ti has only one root node with the weight of Wi, and its left and right subtree are empty.
2) In F, select the Left and Right Subtrees with the smallest root node weights as the newly constructed binary tree, the weight of the root node of the new binary tree is the sum of the weight of the root node of the left and right subtree.
3) Delete these two shards from F and add the new binary tree to the set F in ascending order.
4) Repeat 2) and 3) until there is only one binary tree in set F.
We can calculate the length of the path WPL = (1 + 3) x 3 + 2*5 + 1*7 = 26.
User code:
We can solve the packet encoding problem based on the user-defined tree. Assume that you need to encode the string "abcdabcaba" and convert it to a unique binary code, but the length of the converted binary code must be the minimum.
Assume that the frequency of each character in a string is W, the encoding length is L, and the number of encoding characters is n. The total length of the encoded binary code is W1L1 + W2L2 +... + WnLn, which is exactly the processing principle of the Harman tree. Therefore, binary encoding can be performed based on the construction principle of the Harman tree to minimize the message length.
For "abcdabcaba", there are a, B, c, and d4 characters in total. The occurrences are 4, 3, 2, and 1, which are equivalent to their values, construct a, B, c, and d with the number of occurrences as the weight value and obtain the result in the lower left corner.
From the Harman root node, the code "0" is assigned to the left subtree, and "1" is assigned to the right subtree until it reaches the leaf node. Then, the codes from the root of the tree to the leaf node along each path are arranged to get the Harman code for each leaf node, as shown in the right figure below.
It can be seen that the codes of a, B, c, and d are 0, 10, 110, and 111 respectively, and then the "abcdabcaba" string is converted to the corresponding binary code 0101101110101100100, the length is only 19. This is the shortest binary encoding, also known as the Harman encoding.
According to the rules described above, it is not difficult to find that there is a pattern in the Harman encoding: Assume that there are N leaf nodes that need to be encoded, and the resulting Harman tree must have N layers, the maximum length of the binary code obtained by Harman encoding is the N-1.