Very brief introduction to machine learning for AI

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I recently started to learn about machine learning and found that this comprehensive article has been cited and recommended many times. The landlord is eager to understand English. He feels that translation into something he is familiar with looks more comfortable. The translation is rough and has not been proofread repeatedly. In general, it should be okay, but I still don't know much about the specific professional vocabulary, and Google does not have good results, so the probability of errors is quite high...

I am not very clear about copyright issues. If you have any questions, please let me know.

Click the original link to open the link.

========================================== ========================================================

A Brief Introduction to AI Machine Learning

All the points summarized here are included in these slides.

Intelligence

The concept of intelligence can be defined in many ways. Here we define intelligence as the ability to make correct decisions based on certain criteria (for example, survival and reproduction for animals ). Knowledge is required to make better decisions in an operational form, that is, it can be used to interpret sensor data and make decisions using this information.

Artificial Intelligence

Computers already have some intelligence. Thanks to all the well-developed programs, these programs allow them to do what we think is useful (this is what we mean primarily to make the right decisions ).

However, at the beginning of the 21st century, there were still some jobs that were very simple for humans and animals, but beyond the capabilities of computers.

Many of these problems are classified as artificial intelligence, including many cognitive and control tasks.

Why can't we solve such problems by programming? I think this is mainly because even if our brain can do this, we still don't have a clear idea of how to deal with it.

The completion of such tasks involves the current implicit information, but we have data and examples about those tasks (for example, observing a human response to a specific request or input ).

How can we make machines get such intelligence? Learning is the process of using data and samples to establish operational knowledge.

Machine Learning

Machine Learning has a long history, and many textbooks have well covered its main principles.

In recent textbooks, I suggest:

Chris Bishop, "Pattern Recognition and machine learning", 2007
Simon haykin, "Neural Networks: A Comprehensive Foundation", 2009 (3rd edition)
Richard O. Duda, Peter E. Hart and David G. Stork, "pattern classification", 2001 (2nd edition)

Here we focus on the concept of the latest relationship with the course.

Formalization

First, let's formally learn the most common mathematical framework. We have training Example D = {Z_1, Z_2... Z_N}

Z_ I is the sample obtained from the unknown P (z) sampling process. We also have the loss function (loss function) L. Its parameter is decision function f and sample value Z, and return a real value measurement. We hope that an unknown generating process can minimize the expectation of L (F, Z.

Supervised Learning

In supervised learning, each group of samples is an (input, output) pair: z = (x, y), and F uses X as the parameter. The most common example is:

Regression: Y is a real vector. The output of F is in the same set as Y. We often take the loss function as a square error function:

Category: Y is a finite INTEGER (for example, a symbol (class label? Symbol ?)) Corresponding to the index of a class, we often take the loss function as the negative logarithm of the likelihood probability (negative conditional log-likelihood, do not know if the translation is correct ), interpreted as f_ I (x), which estimates p (y = I | X ):

Restrictions:

Unsupervised learning

In unsupervised learning, we learn a function F, which describes an unknown distribution P (z ). sometimes F is directly an estimation of P (z) itself (this is called density estimation ). In many other cases, f tries to portray the density concentration. The clustering algorithm (clustering algorithms) divides the input space into different regions (usually centered on a typical sample or center of gravity ). Some clustering algorithms create hard partition (for example, K-means algorithm), while others construct soft partition (for example, Gaussian mixture, a Gaussian mixture model ).
Model) specifies the probability of each cluster. Another unsupervised learning algorithm creates a new representation. Many deep learning algorithms belong to this category, and so does principal component analysis.

Local generalization (local generalization ?)

Most learning algorithms use one principle to implement generalization: Local generalization. this principle assumes that if a sample X_ I input is very close to another sample X_j, then their output F (x_ I) and F (x_j) should also be close. This is the basic principle for implementing local interpolation. This principle is very powerful, but it has limitations: What if we want to push it out? Or equivalent. What if an unknown target function has many changes than a training sample? In this case, localgeneralization cannot work. Because we need at least as many samples as the rise and fall of the target function to include these changes and follow this principle to generalize. For the following reasons, this conclusion is strongly related to the so-called dimensional disaster. When the dimension of the input space is very high, the output changes probably according to the input latitude index. For example, suppose we want to distinguish between 10 values of each input variable (each element of the input vector), and we care about all the 10 ^ N values of these n variables.
. Only by using local generalization, We need to observe at least one sample of these 10 ^ n distributions so that we can extend the conclusion to all variables.

Distributed versus local representation and non-local generalization (I don't know how to translate it ..)

The N binary of a simple integer is a continuous B-bit sequence, except that the nth bit is 0. A simple binary distributed representation of integers is a series of log_2 (B) bits that use the common binary encoding. In this example, the efficiency of distributed representation is exponentially higher than that of local representation. Generally, for learning algorithms, distributed representation has the potential to capture more (exponential) changes with the same number of free parameters. Therefore, distributed representation has better generalization potential, because the learning theory points out that the number of samples required is O (B), and the effective dimension of degrees of freedom is O (B.

Another description of the difference between the distributed representation and the local representation (correspondingly, the local and non-local generalization) is about cluster clustering and PCA or RBM. The former is local, while the latter is distributed. Using the K-means clustering algorithm, we maintain a parameter vector for each prototype, that is, each region divided by learners. Using the PCA algorithm, we track the main variation directions to mark its distribution. Now imagine a simplified PCA explanation. Here we are most concerned about whether the projection of data in this direction exceeds or falls below a certain threshold in each change direction. In the D direction, we can distinguish 2 ^ d different regions. Like this, the RBM algorithm defines D superplanes and uses a bit to mark one side of the plane or the other. An RBM associates a combination of the same flag spaces in an input area (in neural networks, these bits are called hidden units ). The number of RBM parameters is approximately the same as the multiples of these bits relative to the input dimension (times, should be multiples ...). Once again, we find that the number of regions that can be expressed by RBM or PCA (distributed representation) increases exponentially according to the parameter rules, while traditional clustering algorithms (such as K-means or Gaussian mixture, the number of regions that can be expressed increases linearly according to the number of parameters. In another view, it is realized that RBM can generalize the corresponding new areas based on the combination of hidden units, even if no samples are observed. This is not possible for clustering algorithms (except for those areas where samples have been observed ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Very brief introduction to machine learning for AI

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Very brief introduction to machine learning for AI

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support