Refer to "Introduction to machine learning"
Suppose we have a dataset that contains n points. These n points can be labeled as positive and negative examples using 2N methods. Therefore, the n number of points can define 2N different learning problems. If for any of these problems we are able to find a hypothesis that H belongs to H, separating the positive and negative examples, then we call H hash n points. That is, any learning problem that can be defined with n points can be learned from an assumption that is extracted from H without error. The maximum number of points that can be hashed by H is called the VC dimension of H, which is recorded as VC (h), which measures the learning ability of the hypothetical class H.
Usually I prefer to use degrees of freedom to approximate the learning ability of hypothetical classes.
Often, in real life, the world is smooth, with the same markers for most of the time, and we don't need to worry about all the possible tags. There are many more than 4 points of the data set can be through the VC (H) =4 hypothesis class to learn. Therefore, the hypothetical classes with smaller VC dimensions are also of application value and preferable to those of larger VC dimensions.
What is VC dimension?