Text, numbers, language, information

numbers, words and natural languages are the carriers of information, and they are created to **record and disseminate information** .

But it seems that the relationship between mathematics and linguistics is not big, for a long time, mathematics is mainly used in astronomy, mechanics.

In this chapter, we will review the development of the information age and see how linguistics is slowly associated with mathematics.

Information

At the very beginning, humans use **sound** to disseminate information.

There is no difference in principle between the generation, transmission, reception and feedback of the information in the present state of the art communication.

Because the early human need to spread the information is not much, so do not need language text.

But when humans progress to a certain degree, they need language.

So our ancestors abstracted the common factors of language description, such as object, quantity, and action, and formed the vocabulary of today.

The creation of text and digital text

With the development of human, language and vocabulary to a certain extent, the brain has been unable to fully remember. At this point, you need a **text** to record the information.

The advantage of using text is that the transmission of information can span time, space, two of people do not need to meet at the same time, the same place can be exchanged information.

So how do you create text? The most straightforward way is to imitate the shape of the object to be described, which is called the hieroglyphics.

Clustering of text

Early, the number of hieroglyphs and records of a civilized information is related, that is, more hieroglyphics, representing the information of this civilization is large.

But as the amount of information increases, no one can learn and memorize so many words, so it needs to be **summed up and categorized** . That is, using a word to express the same and similar meanings.

For example, "Day" is meant to be the sun, but it can also be the day we speak.

The clustering of this concept is similar to the current **clustering** of natural language processing or machine learning, only in ancient times, which may take thousands of years, and now only takes a few hours.

But the words are clustered according to the meaning, there is always **ambiguity** , that is, it is unclear what the meaning of a **pronunciations** in a given environment.

To solve this problem, all **rely on the context** , most disambiguation can do, but there is always a time when the individual cannot do it. The **probability model of the context is better, and there will be a time of failure. **

Translation

Different civilizations because of the geographical reasons, the text and language is generally different, when the two civilizations touch together, the need for translation has.

**translation can be achieved by the reasons: different text system in the ability to record information is equivalent** , the text is only the carrier of information, not the information itself, and even can be carried by the number.

Today, we know more about Egypt than the Mayan civilization, thanks to the Egyptian people who recorded the most important information in life, the guiding significance for us is:

Repeat three copies of the same information, as long as a copy of the remaining, the original information will not be lost, which is instructive for channel coding.

`语料`

, which is the data of the language, is crucial.

The generation of numbers

Text occurs only when there is no information in the mind, and the number is generated when the property needs a few minutes to figure it out.

The early numbers are not written in the form, just say the fingers, which is why we use **decimal** .

Gradually, the ancestors found that 10 fingers are not enough, the simplest way is to count the toes, but can not solve the fundamental problem. So they invented the carry **system** , that is, every ten into one.

So why is the existing civilization more in decimal than in the two?

Compared to the decimal, 20 is more inconvenient, such as decimal only need to recite the 99 multiplication table, if it is 20, you need to back the 19*19 of the go disk.

For the number of different digits, both Chinese and Romans use explicit units to represent different levels of magnitude.

The Chinese are using top trillion, the Romans use I to denote a, V to denote 5 and so on.

Both of these representations unconsciously introduce the concept of naïve coding.

Use different symbols to represent different digital concepts.

Rules for **decoding** : China's decoding rules are multiplication, such as the 2 million notation means 2 * 100 * 10000, while the Roman decoding rule is addition and subtraction-small numbers out of the large number now left for the minus, plus on the right, for example, iv means 5-1=4,VII represents 5+2= 7, this rule is quite complex and difficult to describe for large numbers.

From the validity of coding, the Chinese are more skillful.

The most effective description of the numbers is the ancient Indians, who invented 10 Arabic numerals, which are more abstract than Chinese and Roman, which also marks the separation of numbers and words. Objectively let natural language and mathematics in thousands of years no repetition of the trajectory.

The principle of mathematical shortest coding behind text and language

From hieroglyphs to phonetic text is a leap, because humans in the way of describing objects, from the appearance of objects to abstract concepts, but also unconsciously adopted the information encoding.

Not only that, in the Roman system of text, characters commonly used short, uncommon word length, care type of text, is also the same, characters commonly used strokes less, and more uncommon characters strokes, which conforms to the **shortest coding principle in information theory. **

Roman language system:

`st=>start: 楔形文字op1=>operation: 叙利亚op2=>operation: 古希腊op3=>operation: 罗马人和马其顿人en=>end: 罗马式语言st->op1->op2->op3->en`

Before the invention of the paper, writing the text is not easy. So I need devotes, so the written text of ancient prose is very concise, but very difficult to understand. But the spoken language is not very different now. This is similar to some of the principles of information science today.

In communication, if the channel is wide, the information can be transmitted directly without compressing it;

If the channel is narrow, the information needs to be compressed as much as possible before it is delivered, and then decompressed at the receiving end.

This is the current Internet and mobile Internet web design exactly the same.

The use of broadband, the page is designed to be relatively large, and mobile phone terminal due to the bandwidth of the air channel, transmission speed is slow, low resolution.

Check

The Bible records the story of the Jewish ancestors since Genesis, and the writings of the Bible have lasted for many centuries, and a number of people have done it, and the mistakes of transcription are inevitable.

To avoid mistakes, Jews invented a method similar to **parity codes** , which correspond to a number in Hebrew letters, and each line adds up to a special number, which is a **checksum** .

After copying a page, you need to add up the text of each line to see if the check code is the same as the original.

Grammar

Word-formation from letters to words is the coding rule of words, then grammar is the code and decoding rules of language.

In comparison, **words are limited and closed collections, and languages are infinite and open collections. **Mathematically speaking, the former has a complete coding and decoding rules, and the language is not, that is, the language has no grammatical rules covering the place, which is "wrong sentences"

So is the language right or the grammar right? Some insist on starting from the actual corpus, and some insist on starting from the rules.

Summary

This chapter describes the history of words, numbers, and languages, and helps readers to feel the inner connection between language and mathematics. The following concepts are mentioned

**The principle of communication and the model of information dissemination**

(source) Encoding and shortest coding: Classical Chinese.

Rules for decoding: syntax

Clustering: A word multiple meanings

Check digit: One Hebrew word corresponds to one code.

Bilingual text, Corpus and machine translation: Information carriers are the same.

Ambiguity and the use of contextual disambiguation: probabilities

Natural language processing-from rules to statistics

In the previous chapter, we said that the purpose of language is to communicate with humans, while letters, words, and numbers are actually different units of **information encoding** .

Any language is a coding method, and the grammatical rules of a language are the algorithms for decoding. For example, we organize what we want to express through language, which is to encode it once, and if the other person can understand the language, it can decode it using the decoding method of the language.

So is the machine able to read and understand natural language? Of course

Machine Intelligence

The development process of natural language processing can be divided into two stages:

From the 2 No. 0 century 5 0 to the 7 0, scientists ' understanding is confined to the way humans learn languages, that is, computer simulations of the human brain. The results were almost 0.

The second phase, which was **based on mathematical models and statistical methods, was entered in the 70 's. **made a substantial breakthrough.

In the 50 's, academia's understanding of AI and natural language comprehension was this: to make machines complete speech recognition, computers must understand natural language. Because that's what humans do. This methodology is called "Bird flying Pie", that is, to see how birds fly to build airplanes. In fact, people invent airplanes by aerodynamics, not by Bionics.

So how can you understand natural language?

General needs:

Parsing statements, that is, through syntax. These grammatical rules are easier to describe with a computer.

Gets the semantics. Semantics are more difficult to express in a computer than grammar.

We can see a simple sentence.

Xu Zhimo likes Lin Huiyin because

This sentence can be divided into main, predicate, full period three parts, can be further analysis of each part, get the following syntax Analysis tree (parse tree)

Parsing the grammar rules it employs is called **rewriting rules** .

But this method quickly got into trouble. It can be seen that a **short sentence actually analyzes such a complex two-dimensional tree structure** , if you want to deal with a real sentence is very troublesome.

There are two main snag:

To cover even 20% of the true sentence through grammatical rules, the number of grammatical rules is at least tens of thousands of.

And there are even contradictions in these grammatical rules, so it is necessary to explain the conditions of use of the rules. If you want to overwrite more than 50% of the sentences, the number of grammar rules will eventually be more than one new sentence added, you need to add a new grammar

In fact, it is easy to understand, no matter in the middle school or university, English results How good, also may not test good GRE, because we have learned 10 years of English grammar can not cover all the English.

Even if you can cover all the grammar, it is difficult for a computer to parse a complex sentence. and the meaning of natural language and context have a specific relationship, that is, the context of the grammar, so the computational amount of resolution is quite large.

In fact, from the grammar of the road to analyze sentences, not reliable.

From rules to statistics

Above, we talked about rules-based parsing is cumbersome for semantic processing because the ambiguity of words in natural languages is difficult to describe in terms of rules, but rather depends on **context** .

For example "The box is in the pen." Because here pen is the meaning of the fence. The whole sentence translated into Chinese is "box in the fence". This pen refers to pen or fence, through the context has not been solved, need common sense

Since 1970, statistical linguistics has given birth to natural language processing, the key task of which is Jarini and his IBM Watson lab. At the very beginning, they used the statistical method to raise the speech recognition rate from 70% to 90%, while the speech recognition scale rose from hundreds of words to tens of thousands of words

Summary

The natural language processing method based on statistics, in the mathematical model and communication is the same, so in the mathematical sense, the natural language processing and the language of the original intention- **communication** links together.

Statistical language model

In the previous chapters, we have always emphasized that natural language has evolved into a context-sensitive way of expressing and transmitting information from its inception.

So to allow the machine to handle natural speech, the **key is** to create a mathematical model of the **context-sensitive nature of** the voice, which is the **statistical language model (statistical Language models)**

This model is widely used in machine translation, speech recognition, print recognition, spell correction, Chinese character input, and literature query.

Using mathematical methods to describe the law of language

One of the important problems that speech recognition needs to solve is whether the computer gives a sequence of words that can be understood by humans. Before the 70 's, people used semantic analysis to solve.

And Jarik the problem from another perspective, a simple statistical model is taken care of.

That is to see if a sentence is reasonable, just look at the **likelihood** of its size.

For example, a fluent statement appears in the probability of $10^{-20}$, and a chaotic statement appears the probability of $10^{-70}$, so the fluent statement is more likely.

Suppose $s$ represents a meaningful sentence, consisting of a sequence of words ${\omega _1},{\omega _2}, \cdots, {\omega _n}$, where $n$ is the length of the sentence. Now we need to know the probability of this sentence appearing.

$ $P \left (S \right) = P\left ({{w_1},{w_2}, \cdots, {w_n}} \right) $$

Using the conditional probability formula, $S $ the probability of the occurrence of this sequence equals the probability of each word appearing multiplied

$ $P \left ({{w_1},{w_2}, \cdots, {w_n}} \right) = P\left ({{w_1}} \right) P\left ({{w_2}|{ W_1}} \right) \cdots P\left ({{w_n}|{ W_1},{w_2}, \cdots, {w_{n-1}}} \right) $$

$P \left ({{w_n}|{ W_1},{w_2}, \cdots, {w_{n-1}} \right) $ means that the probability of the occurrence of the word $w_n$ depends on all the words in front of it.

The question is, how is this conditional probability calculated?

At the beginning of the 20th century, Russian mathematician Markov gave an effective method, when encountering this kind of situation, suppose that the probability of $w_i$ occurrence of any word is only related to the previous word $w_{i-1}$, and it is irrelevant to other words, which is called **Markov hypothesis**

And so the formula becomes

$ $P \left ({{w_1},{w_2}, \cdots, {w_n}} \right) = P\left ({{w_1}} \right) P\left ({{w_2}|{ W_1}} \right) \cdots P\left ({{w_n}|{ W_{n-1}}} \right) $$

It's called a **two-dollar model (Bigram models)** .

If a word is assumed to be determined by the preceding $n-1$ Word, the corresponding model is called $n

$ meta-model, will be more complex.

Similarly how to estimate the conditional probability $p\left ({{w_i}|{ W_{i-1}}} \right) $, you can first look at its definition

$ $P \left ({{w_i}|{ W_{i-1}}} \right) = \frac{{p\left ({{w_{i-1}},{w_i}} \right)}}{{p\left ({{w_{i-1}}} \right)}}$$

What needs to be done is an estimate

**Federated Probability ${p\left ({{w_{i-1}},{w_i}} \right)}$**: The probability of two consecutive words appearing at the same time

and the Edge probability ${p\left ({{w_{i-1}}} \right)}$

**So how do these two probabilities get?**

With a lot of corpus (Corpus), as long as a few ${{w_{i-1}},{w_i}}$ in the statistical text adjacent to how many times ${# \left ({{w_{i-1}},{w_i}} \right)}$. Then divide by the size of the corpus # so that you can use the **frequency** to estimate the probability.

According to the **large number theorem** , the relative frequency is equal to the probability as long as the statistic is sufficient.

$ $P \left ({{w_i}|{ W_{i-1}}} \right) = \frac{{# \left ({{w_{i-1}},{w_i}} \right)}}{{# \left ({{w_{i-1}}} \right)}}$$

Using such a complex model can solve the problem of complex speech recognition and machine translation.

A high-level language model of engineering know-how for statistical language models

The most important feature of a binary model is that each word is only related to the previous word, is too simplified, and more generally a word is related to a number of previous words.

So the $n$ meta-model refers to the current word $w_i$ only depends on the first $n-1$ word, which is`N-1阶马尔科夫假设`

In practice, ternary models use much more, so $n=3$, and higher-order is less used, because

The order of the model is large and the complexity is high.

The size of the $N $ metamodel is almost $n$, so $n$ cannot be too large. When the $n$ from 1 to 2, and then from 2 to 3, the effect of the model rose significantly, and the model from 3 to 4, the effect of ascension is not obvious, the cost of a lot of resources. So few people use a model of more than 4 yuan

Even though it is ascending the order, it is still not able to cover all linguistic phenomena. For example, from one paragraph to another, even if the order is higher, there is no alternative to this situation. Other long-range dependencies are required to resolve (Distance Dependency)

Training of models, 0 probability problems and smoothing methods

Using a language model requires knowing all the **conditional probabilities** in the model, as we call it `模型的参数`

.

By the statistics of corpus, the process of obtaining these parameters is called `模型的训练`

.

As we have said before, we only need to count the number of occurrences of the adjacent two characters and the number of occurrences of ${w_{i-1}}$ alone, and then calculate the ratio.

But there is a situation we did not consider if the adjacent two words did not appear at the same time, that is $# \left ({{w_{i-1}},{w_i}} \right) = 0$ What to do, whether the probability is 0.

Of course not, this involves the **reliability** of statistics.

In mathematical statistics, we dare to use the data to predict the probability, because the **large number theorem** , it is required to have enough observations. That is, if the sample is too small, the number of times to predict the probability is certainly not reliable.

So how do you train a language model correctly?

The direct approach is to **increase the amount of data** . But there are still 0 probability problems, called`“不平滑”`

For the probability of not smoothing, we cannot assume that it occurs with a probability of zero, which can be allocated from the total probability of a small proportion to these unseen events.

In this way, the probability sum of the events seen is less than 1, so it is necessary to lower the probability of all the exciting pieces seen. As for the small number, according to the **"less credible statistics, the more discounts on it"** approach.

The following is a statistical dictionary of the probability of each word to be specific talk.

It is assumed that the words that appear $r$ in the corpus have $n_r$, $N $ represents the size of the corpus.

$ $N = \sum\limits_{r = 1}^\infty {R{n_r}} $$

In other words, the number of words that appear in each word is multiplied by how many $r$ words appear.

When $r$ is small, indicating that the number of occurrences is not enough, then use a smaller number of times to calculate their probabilities, such as $d_r$

$${d_r} = \left ({R + 1} \right) \frac{{{n_{r + 1}}}}{{{n_r}}}$$

And

$$\sum\limits_r {{D_r}{n_r}} = n$$

Generally speaking, the **number of words appearing 1 times is more than two occurrences, and the same occurs two times more than three times. **

That is, the greater the number of $r$, the smaller the amount of words $n_r$, so ${n_{r + 1}} < {n_r}$, you can see ${d_r} < R $, so the estimate is because $d_r$ is the smaller number we're looking for than $r$, and when it only occurs 0 times ${d_0 }>0$

Such

The probability estimation of a word whose frequency exceeds a certain threshold is the relative frequency in the corpus.

For words whose frequency is less than the threshold, the probability is estimated to be less than their relative frequency.

For the two-dollar model,

which

$T $ is a threshold, generally around 8~10.

$f _{GT} () $ indicates the relative frequency after smoothing

and $q (W_{i-1}) $ can guarantee that all frequencies add up to 1.

This smoothing method was first proposed by IBM Katz, so called the **card-Backoff method**

Another way is to **remove the difference** method, that is, the low-order model and the high-order model for linear interpolation method to smooth processing, but because the effect is less than Katz Backoff, so rarely used.

On the selection of corpus

Another important problem in model training is the **training data** , or the selection of corpora, if the training is expected to be divorced from the field of the model application, the effect of the model will be greatly compromised.

For example, to build a language model, if the application is a web search, its training data should be **messy Web page data** and user input search string, rather than traditional, normative press releases, even if the former mixed with noise and errors. Because the training data and the application are consistent, the search quality is better.

Training data is usually the more the better, higher-order models because of the parameters, the need for training data will be much more, unfortunately, not all applications can have enough training data, such as machine translation bilingual corpus, this time, the pursuit of high-order large model does not make any sense.

If the training data is consistent with the application data and the training volume is large enough, the expected noise level of the training will also have an impact on the model. So before the training needs to be pre-treatment, for can find regular and more noise needs to be processed, such as tab

The evolution of Chinese participle

For western Pinyin, there is a definite delimiter (delimit) between the words. But for Chinese, there is no definite delimiter between the words. So we need to make a word for the sentence first.

The easiest way to think about it is to look it up in a **dictionary** . That is, to scan from left to right, and to identify the words in the dictionary.

But this approach has a complex problem. For example, when encountering **ambiguity** segmentation. Like "Developing countries", the right segmentation is "development-middle-country", and the left-to-right dictionary will be divided into "development-China-Home"

Similarly, we can use the **statistical language model** to solve the problem of ambiguity of word segmentation.

Suppose a sentence $s$ has several word segmentation methods:

$$\BEGIN{ARRAY}{L}

{A_1},{a_2},{a_3}, \cdots, {a_k}\

{B_1},{b_2},{b_3}, \cdots, {b_m}\

{c_1},{c_2},{c_3}, \cdots, {c_n}

\end{array}$$

The **best way** To do this is to divide the word, the probability of the sentence appears the most.

Of course, if you are poor at all the word segmentation methods, and calculate the probability of the sentence under each possibility, then the computational amount is quite large.

Can be seen as a **dynamic programming (programming)** problem, and using **Viterbi (VITERBI) algorithm** to quickly find the best participle

Linguists do not define words in exactly the same way, say "Peking University", some people think it is a word, some people think it is two words. The way to compromise is to start with a four-word term, and then find the word "Beijing" and "university"

The details of the project

The main reason for the inconsistency of artificial participle lies in the understanding of the **grain size** of the word.

For example, "Tsinghua University", some people think it is a whole, some people think that "Tsinghua" is to modify the "university". There is no need to emphasize who is right, but to know that in different applications there is a better particle size than the other.

For example, in machine translation, the granularity of large translation effect is good, such as "Lenovo Company" if split into open, it is difficult to translate "Lenovo." But in the web search, small particle size will be better than the large, such as user query "Tsinghua" and not "Tsinghua University" can be found on the homepage of Tsinghua University.

It's too wasteful to build different word breakers for different applications. Allows **a word breaker to support the segmentation of different levels of words at the same time. **

First, a basic glossary and a compound Word table are needed.

The basic vocabulary includes "Tsinghua", "university", "Jarinik", such as the words can not be divided

The compound word contains compound words and what basic terms they consist of, including "Tsinghua University: Tsinghua University"

The next step is to build a language model based on the **Basic glossary** and **compound** words: l1,l2

According to the basic glossary and L1, we get the small granularity of the word segmentation results. Generally speaking, the basic words are more stable and occasionally add new words.

Finally, using compound words and L2 for the second participle, when the input is the **basic word string** , the output is a compound string.

In other words, the sentences are divided according to the basic words, and then the basic word strings are separated according to the compound word model.

"The beauty of Mathematics notes" Natural Language Processing Section (i). MD