The distinction between statistics and

machine learning has always been fuzzy.

Both the industry and the academia have always believed that

machine learning is just a glamorous coat of approval from statistics.

The artificial intelligence supported by machine learning is also called the "extension of statistics"

For example, Nobel Prize winner Thomas Sargent once said that artificial intelligence is actually statistics, but used a very gorgeous rhetoric.

Sargent said at the World Science and Technology Innovation Forum that artificial intelligence is actually statistics

Of course there are some different voices. However, the arguments between the pros and cons of this view are filled with a pile of seemingly profound but vague arguments, which is really confusing.

A Harvard PhD student named Matthew Stewart differs from statistics and machine learning; statistical models are different from machine learning. These two perspectives demonstrate that machine learning and statistics are not synonymous with each other.

The main difference between machine learning and statistics is their purpose

Contrary to what most people think, machine learning has existed for decades. It was only because the computing power at that time could not meet its demand for large amounts of computing, and it was gradually abandoned. However, in recent years, due to the advantages of data and computing power brought about by the information explosion, machine learning is rapidly recovering.

Closer to home, if machine learning and statistics are synonymous with each other, then why don't we see that every university's statistics department is closed and switched to'machine learning' department? Because they are not the same!

I often hear some vague statements on this topic, the most common is this statement:

"The main difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed to infer the relationships between variables.

Although this is technically correct, this argument does not give a particularly clear and satisfactory answer. One major difference between machine learning and statistics is indeed their purpose.

However, it is almost meaningless to say that machine learning is about accurate predictions and statistical models are designed for reasoning, unless you are really proficient in these concepts.

First, we must understand that statistics and statistical modeling are not the same. Statistics is the mathematical study of data. Unless there is data, statistics cannot be made. A statistical model is a model of data, mainly used to infer the relationship between different content in the data, or to create a model that can predict future values. Under normal circumstances, the two are complementary.

Therefore, in fact, we need to discuss from two aspects: first, what is the difference between statistics and machine learning; second, what is the difference between statistical models and machine learning?

To put it more bluntly, there are many statistical models that can make predictions, but the prediction results are relatively unsatisfactory.

And machine learning usually sacrifices interpretability to obtain powerful predictive capabilities. For example, from linear regression to neural network, although the interpretability has become worse, the predictive ability has been greatly improved.

From a macro perspective, this is a good answer. At least it is good enough for most people. However, in some cases, this statement is easy to misunderstand the difference between machine learning and statistical modeling. Let us look at an example of linear regression.

The difference between statistical model and machine learning in linear regression

Perhaps because of the similarities in the methods used in statistical modeling and machine learning, people think that they are the same thing. I can understand this, but it is not the case.

The most obvious example is linear regression, which may be the main reason for this misunderstanding. Linear regression is a statistical method through which we can train a linear regressor and fit a statistical regression model by the least square method.

As you can see, in this case, what the former does is called "training" the model. It only uses a subset of the data. How does the trained model perform after testing on another subset of the data? To know. In this example, the ultimate goal of machine learning is to obtain the best performance on the test set.

For the latter, we assume in advance that the data is a linear regression with Gaussian noise, and then try to find a line that minimizes the mean square error of all data. No training or test set is required. In many cases, especially in research (such as the sensor example below), the purpose of modeling is to describe the relationship between data and output variables, not to predict future data. We call this process statistical inference, not prediction. Although we can use this model to make predictions, which may also be what you think, the method of evaluating the model is no longer the test set, but the significance and robustness of the model parameters.

The purpose of machine learning (here specifically referred to as supervised learning) is to obtain a predictable model. We usually don't care whether the model can be explained. Machine learning only cares about results. As far as the company is concerned, your value is only measured by your performance. Statistical modeling is more to find the relationship between variables and determine the significance of the relationship, which happens to cater to prediction.

Let me give an example of my own to illustrate the difference between the two. I am an environmental scientist. The main content of the work is to deal with sensor data. If I try to prove that the sensor can respond to a certain stimulus (such as gas concentration), then I will use a statistical model to determine whether the signal response is statistically significant. I will try to understand this relationship and test its repeatability so that I can accurately describe the sensor's response and make inferences based on this data. I may also test, is the response linear? Is the response due to gas concentration rather than random noise in the sensor? and many more.

At the same time, I can also take data from 20 different sensors and try to predict the response of a sensor that can be characterized by them. If you don't know much about sensors, this may seem a little strange, but it is indeed an important research area of environmental science at present.

Using a model with 20 different variables to characterize the output of the sensor is obviously a prediction, and I did not expect the model to be interpretable. You should know that due to the nonlinearity of chemical kinetics and the relationship between physical variables and gas concentration, etc., this model may be very esoteric, as difficult to interpret as neural networks. Although I hope this model can be understood by people, as long as it can make accurate predictions, I am quite happy.

If I try to prove that the relationship between data variables is statistically significant to some extent so that I can publish in a scientific paper, I will use statistical models instead of machine learning. This is because I care more about the relationship between variables than making predictions. Making predictions may still be important, but most machine learning algorithms lack interpretability, which makes it difficult to prove relationships in the data.

Obviously, these two methods are different in terms of goals, although similar methods are used to achieve the goals. The evaluation of machine learning algorithms uses test sets to verify their accuracy. However, for statistical models, the analysis of regression parameters through confidence intervals, significance tests, and other tests can be used to assess the legitimacy of the model. Because these methods produce the same results, it is easy to understand why people assume they are the same.

The difference between statistics and machine learning in linear regression

There is a misunderstanding that has existed for 10 years: it is unreasonable to confuse the two terms based on the fact that they both use the same basic concept of probability.

However, it is unreasonable to confuse these two terms based on the fact that they both use the same basic concepts in probability. For example, if we only treat machine learning as statistics with a glamorous coat, we can also say this:

Physics is just a better term for mathematics.

Zoology is just a better term for stamp collection.

Architecture is just a better term for sand castle construction.

These statements (especially the last one) are very absurd and completely confound the terms of two similar ideas.

In fact, physics is based on mathematics, and understanding physical phenomena in reality is an application of mathematics. Physics also includes all aspects of statistics, and modern statistics is usually based on the combination of Zermelo-Frankel set theory and measurement theory to generate probability spaces. They have a lot in common because they come from similar origins and use similar ideas to reach a logical conclusion. Similarly, architecture and sand castle architecture may have a lot in common, but even if I am not an architect, I cannot give a clear explanation, but I can see that they are obviously different.

Before we go further, we need to briefly clarify two other common misconceptions related to machine learning and statistics. This is that artificial intelligence is different from machine learning, and data science is different from statistics. These are uncontroversial issues, so they will be clear soon.

Data science is essentially the calculation and statistical methods applied to data, including small data sets or large data sets. It also includes things like exploratory data analysis, such as examining and visualizing data to help scientists better understand the data and make inferences from it. Data science also includes things such as data packaging and preprocessing, so it involves a certain degree of computer science because it involves coding and establishing databases, connections and pipelines between web servers, and so on.

To perform statistics, you don't necessarily have to rely on a computer, but if it is data science, it will be impossible to operate without a computer. This once again shows that although data science uses statistics, the two are not the same concept.

In the same way, machine learning is not artificial intelligence; in fact, machine learning is a branch of artificial intelligence. This is obvious, because we "teach" (train) machines to make general predictions for specific types of data based on past data.