What are the characteristics of machine learning?

Source: Internet
Author: User

Lecture Video: What makes a good Feature? -Machine learning Recipes #3
Https://www.youtube.com/watch?v=N9fDIAflCMY

A classifier can have good performance only if you use a good feature. Providing or identifying good feature is one of the most important tasks when using machine learning.

Suppose you want to classify a dog's category to distinguish between Greyhound and Labrador.


We consider two characteristics, height (inches) and eye color.


Here we assume that the two dogs have only blue and brown two colors of their eyes.

Let's analyze the feature height first.

Usually, Greyhound is higher than Labrador, but the real world is more complicated and the height of both dogs varies in one range.

We use Python to write some code to generate random height data, of which greyhound average height of 28,labrador is 24. We draw a histogram. Red is greyhound, Blue is Labrador.


Let's analyze this histogram. First look to the left, for example, when the height of the inches, if you want to estimate the height of the dog, we should think it is Labrador, because the height of the case, 80% probability is Labrador, and only 20% of the likelihood is greyhound. Look to the right, for example, when height is inches, then 95% of the likelihood is greyhound, so we should estimate the situation of the dog as Greyhound.

However, we also note that the middle section, such as the inches, where the likelihood of the two dogs is not very different, so height for these values, it is difficult to distinguish between.

Therefore, height is a useful feature, but not perfect.

If you want to find out what kind of characteristics you should use, then you can do a simulated thinking experiment, assuming you are the classifier, you now try to distinguish between a dog is greyhound, or Labrador, you want to know something else? You may ask: how sparse is their hair? How fast are they running? How many do they weigh?

In fact, how many features should be used, more art, not a science. But in terms of experience, how many features do you need to classify, and how much the classifier might need.

Now look at another feature, the color of the eye. We assume that both dogs have only 2 colors: blue and brown, and the dog's color is irrelevant to its breed.


Its histogram statistic may look like the same. This picture doesn't tell us anything, because two kinds of dogs are almost as likely to be in two colors, so the color of the dog is also a useless feature. If you add such a useless feature when using a classifier, it will affect the classification accuracy of the classifier. Such features may seem useful, but only because of the contingency of the data itself. Especially when you have very little training data, it is more likely to make you mistakenly think such features are useful.

Moreover, we should use the characteristics of mutual independent. Because the characteristic of mutual independent can give you information without angle. For example, you already have a height of inches in your data, and it doesn't make sense to add a height of cm, because you can't provide more information. You should try to get rid of similar redundancy features, because many classifiers are sensitive and when you meet such highly correlated features, it mistakenly considers this feature more important, which is obviously not what we want.

In addition, we should use easy-to-understand features. For example, we now want to predict how many days to send a paper mail from a city to another city. Obviously, the farther away two cities, the more days you spend.


Here, miles between cities is a very good feature. There is also a poor choice to use the coordinates of two cities:


From a person's point of view, it is easy to know that miles can easily estimate the number of days, and just knowing the coordinates is not easy to estimate. And if you use the hard-to-understand features of coordinates, you'll use much more data to train the classifier than the easy-to-understand features.

To summarize, the ideal characteristic should be:

1) informative, with information;

2) Independent, independent of other characteristics;

3) Simple, easy to understand.


What are the characteristics of machine learning?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.