Entropy
Time limit:2000/1000 MS (java/others) Memory limit:65536/32768 K (java/others)
Total submission (s): 4609 Accepted Submission (s): 1900
Problem Descriptionan Entropy encoder is a data encoding method, achieves lossless data compression by encoding a mess Age with "wasted" or "extra" information removed. In other words, entropy encoding removes information that is not necessary in the first place to accurately encode the Me Ssage. A high degree of entropy implies a message with a great deal of wasted information; Chinese text encoded in ASCII are an example of a message type the have very high entropy. Already compressed messages, such as JPEG graphics or ZIP archives, with very little entropy and do not benefit from Furth Er attempts at entropy encoding.
中文版 text encoded in ASCII have a high degree of entropy because all characters is encoded using the same number of bit S, eight. It's a known fact that the letters E, L, N, R, S and T occur at a considerably higher frequency than doing most other letter s in 中文版 text. If A-to-could is found to encode just these letters with four bits, then the new encoding would is smaller, would contain All the original information, and would has less entropy. ASCII uses a fixed number of bits for a reason, However:it's easy, since one are always dealing with a fixed number of bit s to represent each possible glyph or character. How would an encoding scheme this used four bits for the above letters is able to distinguish between the four-bit codes a nd eight-bit codes? This seemingly difficult problem are solved using what are known as a "Prefix-free variable-length" encoding.
In such a encoding, any number of bits can is used to represent any glyph, and glyphs not present in the message is Simp LY not encoded. However, in order to being able to recover the information, no bit pattern that encodes a glyph are allowed to be the prefix O F any other encoding bit pattern. This allows the encoded bitstream to being read bit by bit, and whenever a set of bits are encountered that represents a glyph , that's glyph can be decoded. If the prefix-free constraint is not enforced and then such a decoding would is impossible.
Consider the text "AAAAABCD". Using ASCII, encoding this would require bits. If, instead, we encode "A" with the bit pattern "xx", "B" with "Down", "C" with "ten", and "D" with "one" then we can encode T His text is only n bits; The resulting bit pattern would be "0000000000011011". This is still a fixed-length encoding, however; We ' re using both bits per glyph instead of eight. Since the Glyph "A" occurs with greater frequency, could we did better by encoding it with fewer bits? In fact we can, but in order to maintain a prefix-free encoding, some of the other bit patterns would become longer than TW o bits. An optimal encoding are to encode ' A ' with ' 0 ', ' B ' with ' ten ', ' C ' with ' a ', and ' D ' with ' 111 '. (This was clearly not the only optimal encoding, as it was obvious, the encodings for B, C and D could be interchanged f Reely for any given encoding without increasing the size of the final encoded message.) Using this encoding, the message encodes with only bits to "0000010110111", a COmpression ratio of 4.9 to 1 (that's, each bit in the final encoded message represents as much information as did 4.9 bit s in the original encoding). Read through this bit pattern from left to right and you'll see that the Prefix-free encoding makes it simple to decode th is into the original text even though the codes has varying bit lengths.
As a second example, consider the text "the CAT in the HAT". In this text, the letter "T" and the space character both occur with the highest frequency, so they would clearly have the Shortest encoding bit patterns in an optimal encoding. The letters "C", "I" and "N" only occur once, however, so they would have the longest codes.
There was many possible sets of prefix-free variable-length bit patterns that would yield the optimal encoding, which is, t Hat would allow the text to being encoded in the fewest number of bits. One such optimal encoding is to encode spaces with "xx", "A" with "+", "C" with "1110", "E" with "1111", "H" with "110", ' I ' with ' 1010 ', ' N ' with ' 1011 ' and ' T ' with ' 01 '. The optimal encoding therefore requires only on bits compared to the 144 that would being necessary to encode the message wit H 8-bit ASCII encoding, a compression ratio of 2.8 to 1.
Inputthe input file would contain a list of text strings, one per line. The text strings would consist only of uppercase alphanumeric characters and underscores (which is used in place of spaces ). The end of the input would be signalled to a line containing only the word "end" as the text string. This is should not being processed.
Outputfor each text string in the input, output the length in bits of the 8-bit ASCII encoding, the length in bits of an O Ptimal prefix-free variable-length encoding, and the compression ratio accurate to one decimal point.
Sample Input
Aaaaabcdthe_cat_in_the_hatend
Sample Output
64 13 4.9144) 51 2.8
Test instructions: Find Huffman encoding length, output in the normal eight bytes of storage ratio. Strategy: First take the order, record the number of each letter, put in the priority queue of the lower priority, and then remove the smallest two, until the number of the limited queue is 1; note: If the string has only one letter, Huffman encoding length is the length of the string.
Code:
#include < cstdio> #include <cmath> #include <iostream> #include <algorithm> #include <string> #include <queue>const int M = 5000;using namespace std;struct node{int Num;bool operator < (const node &a) Const{retur n num > a.num;}}; BOOL CMP (char A, char b) {return a > B;} int main () {string S;while (Cin >> s, s! = "END") {priority_queue<node> Q;sort (S.begin (), S.end (), CMP); int I, CO U = 1;node qq; for (i = 1; i < s.size (); + + i) {if (s[i] = = S[i-1]) {++cou;} Else{qq.num = Cou;q.push (QQ); cou = 1;}} Qq.num = Cou;q.push (QQ); int sum = 0;if (q.size () = = 1) {sum = Q.top (). Num; Else{while (Q.size () > 1) {node Temp1 = Q.top (); Q.pop (); node Temp2 = Q.top (); Q.pop (); node temp; temp.num = Temp1.num+te Mp2.num;sum + = Temp.num;q.push (temp);}} Double temp = (S.size () *8+0.0)/(sum+0.0);p rintf ("%d%d%.1lf\n", s.size () *8, sum, temp);} return 0;}
Hdoj 1053 Entropy "STL"