What is information entropy (Entropy) to measure?

Source: Internet
Author: User

  

What is information entropy (Entropy) to measure?

--A discussion with Philip Zhang

Simin

    

Mr Philip Zhang, in refuting Barrington, made a point, saying: "In the language

Words, overall efficiency is not measured by nationalism, but by information entropy (Entropy).

Of ”

Mr. Zhang said:

The basic formula for calculating text efficiency is:

H=-LOG2 (P)

H is the value (or amount of information) of the entropy, which is the bit (bit).

On this basis, he cites the information:

The average entropy of information in English is 4.03 bits,

The average entropy of information in French is 3.98,

In Spanish, it's 4.01,

German is 4.10,

In Russian, it's 4.8,

The average information entropy of Chinese is 9.65 bits.

Therefore, "the Chinese character is backward, whether it is simplified or traditional" becomes his easy to get the conclusion.

In fact, it is not difficult to refute his conclusions, or even to say it is easy-just to

Know what is the "average entropy of information" for a literal.

Unfortunately, Mr Zhang has just made a mistake in the direction of 180 degrees.

There are formulas, which are called mean entropy of information. But it's not the basic formula for writing efficiency,

It is the efficiency of the code length encoded in the communication! This formula is proposed and Shannon's is used to study information coding. Say so

In a popular sense, it is necessary to standardize the information of the party (the information source) (in case of possible noise)

Encode (for example, 0-1), then send it out, the other receives, decodes, reverts to the original information.

The focus of the study is how long a group of codes is justified--assuming too short to be correctly restored, assuming too

Long, there is redundancy.

In the next talk once, first to emphasize, is the code-length of the savings or redundancy, not the information itself savings or redundant

More than. For example, if you take the money to buy things, there are so many cents in the coin, but not necessarily enough. That's two back.

Thing

In English, for example, the collection of information sources is roughly 26 letters plus a space, which is the basic set. To pass

For whatever the other party (for example, with Morse code), the length of a few "0-1"? Uttered, is five bits.

If you use the "Average information point of view" to study the process, you will find that some letters appear often, in addition

Some are less often used, so the source of information is a bit characteristic, this characteristic is the information content is not "full".

In layman's terms, it is assumed that only a part of the alphabet is used frequently, and that other rarely used can be encoded by clever coding

Shrink to a little more than 4. In fact, because the communication bottleneck is not as important as it was half a century ago, computers

In the formal coding scheme is all redundant scheme, and no one really adopt austerity scheme, even consider the value of not

Yes.

So how do you calculate the amount of information?

Take the computer's 0-1 coding method as an example, assuming that "0" and "1" appear equal Opportunities, p is 1/2,

The logarithm is -1,h is 1. So its information content is a bit (bit). Assuming uneven appearance.

Evenly (for example, the basic is "0" appears, occasionally only "1" appears), then "0" p value is close to 1,

Its logarithm is naturally close to 0, and the other "1" has a P-value close to 0, and the logarithm is close to the negative infinity, weighted

On average, (this infinitely multiplied by the limit of 0, can naturally be calculated using (mathematically) method) information than 1

Smaller bits (bit).

Therefore, regardless of the elements of a set of codes (for example, the English alphabet), in the most efficient use of the case, can be transmitted

The maximum amount of information, equal to LOG2 (n) (in the source code of the number of n, such as the English full-load value of 4.75;

Russian is 5.08, according to the number of words in Chinese, small font for more than 12, large font for more than 14. And so on).

As you know, the average entropy of an English letter is 4.03 bits, which means it's a little "wasted" (because

2 of the 4 is 16, which is just equivalent to using 16 letters evenly. Suppose the "average amount of information" in English is low

to 1 or 2, it's the equivalent of just two or four letters. So Mr. Zhang's praise for English is really a

Point meaning No.

So, if our ancestors made Chinese characters with only a few parts, the average entropy of information would be very small,

For example, if you just use "Yes, no" and the other words are not used, then just one bit is enough.

Mr. Zhang thought that the less the average entropy, the better, was to make a "wrong direction". Visible, Zhang

Sir, how fragile and imprecise the knowledge of information science is! Using this kind of thing as "evidence",

We believe that the reform of the thousands of-year-old Chinese characters is very necessary, too irresponsible!

Mr. Zhang also quoted that:

In the 40 's, scientists such as Shannon's and Hoffman put forward information entropy theory and method, basic theorem

is: In a non-extensible source of memory-free information, the length of the character encoding cannot be less than the entropy of the information source. This one

The theorem is suitable for all language text, is the Science and Technology Foundation of Computer and network communication and the basic of project design

Under.

The sentence is all right. I don't know where Mr. Zhang is quoted, but Mr. Zhang obviously doesn't understand the meaning.

What does that say? It turns out that it's just that because the average entropy of information in English is a little over 4, so

The useful length of the English character set as a communication must be at least that long. Letters in German and Russian are more than English

A few more, they contain a little bit of information is normal. The Germans do not modify the letters, absolutely not because of the information

A lot or a small amount of reason. Much more is not a bad thing. In fact, you know that in the computer, the English alphabet, German, Russian

The article uses 8 bits (8 bits). 8-bit full storage is 256 characters, everyone together, who use more

Who use less, not to be preoccupied. Germans also read English, Russians also use German, and no one uses it to compare

than the "language of the pros and cons."

Chinese, started with a double-byte (that is, 16-bit), full storage is more than 60,000, now Chinese use

About 1/3 (of course, other types of literature will be used). There is no direct link between this and the efficiency of Chinese. Suppose, with a

The amount of "meaning" expressed in Chinese characters, assuming (on average) as much as an English letter, the kanji is really

It's too backward!

That's true? Are our Chinese characters really so backward? For example, "I" is two bytes, "I" is a

A byte. This is the Chinese is not as good as the English "only example." But "man, yes, up, and, day,

month, with, no 、...... "these hundreds of thousands of words (strictly speaking all Chinese characters) in English, just a letter.

It? No. There are only 26 letters in English, at best there are only 26 better than Chinese--unfortunately the English single

The letter word has only one "I", a "a" (meaning too simple, there is no independent use) other (such as

Of,on,to,we,me,go, ... It's good to be able to play with Chinese characters. Please note that the 26-letter composition

Of the 676 kinds of two-letter combinations, there is less meaning and less (for example, Aa,ab,ac,ad,ae, ...). Just a few

All meaningless). So, suppose someone uses a Chinese character to control the number of bytes in English (in the same sense of words),

Ten to nine, Chinese characters need to "save" much!

Natural English through the manufacture of abbreviations to overcome a lot of problems--un,usa,wto, so that the Chinese characters

The absolute superiority also must be cautious.

The most ridiculous is that if you want to follow the "pinyin" proposal to translate Chinese into pinyin (even if the tone

Ligatures and so on), the number of bytes to be greatly added, although the "average letter

The entropy "may also be reduced (total not more than 5). For example, Mr. Zhang, who switched to pinyin, could tell

People, my average amount of information has been reduced to more than 4 (that is, ' I'm going to end up with just one penny now.

I don't care if I add three times times my yearly expenses. ’)。 Since pinyin is in addition

A, E, is not allowed to separate letters into the word, that is a, E, but also blank lattice. So suppose you want to use pinyin.

As the text, in the waste of bytes is number one "text"-See not easy to say! In this meaning

"Since 1989, People's Daily and other newspapers have used the same tactics to criticize Chinese

GE, the continuous publication of the article advocating ' superior Chinese character ', said the Chinese reform is blind westernization and lead to Chinese cultural tradition

Perish, and so on. "What a great thing to do!"

Mr. Zhang also said:

The average information entropy of Chinese is 9.65 bits, and every character of Chinese characters is in the computer information operation.

Required to two bytes of space, so the overall efficiency of Chinese information processing and transmission than in English and other phonetic

is much less efficient.

This is completely against the basic common sense. To paraphrase his car analogy, which seemed to say: "A wheelbarrow is undoubtedly better than

12 Big trucks Save 10 times times, and walk only 1/10 "; it's like saying," It's better to buy something with a dollar bill.

Five cents more than one ";"

Although we have already explained that Chinese characters are actually simply not redundant than English and other phonetic alphabet (from taking up bytes

Perspective), the linguistic problems are still quite complex, and it seems difficult to be a language

The absolute verdict of the pros and cons. For example, the compilation of Esperanto, mathematical language, computer, is obviously very simple and normative,

But to replace the natural language of life is clearly not possible. We shall not discuss this question for the moment.

Mr. Zhang's article also has many other questions, such as he says:

No matter who is using and where to use, and regardless of the user's national feelings, the information of these words

Entropy is also the entropy of their information.

He had no idea that, except for the average entropy of information for the whole "nation", everyone's language had its own unique

Special information entropy. "Unhappy," for example, is generally not happy to encounter things, always say "cha" too

The average information entropy in their language is very small. The same character set while the entropy is small, which is definitely not what first

Into, is poor.

Incidentally, Mr. Zhang made the mistake of being a "famous linguist" in a Chinese faction for more than 10 years.

has been committed before, but also by a sharp criticism. They could not understand (presumably for math insulation) and not squeak

So that after 10 years, their followers continue to repeat the mistake. Sadly and alas, if the language

Words and written work to such "neither expert nor enthusiastic" person!

[Chinese research/zgyj1999/xiamian.htm]

What is information entropy (Entropy) to measure?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.