Machine learning (common interview machine learning algorithm Thinking simple comb) _

Machine learning (common interview machine learning algorithm Thinking simple comb) __ Machine learning

Last Update:2018-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective:
When looking for a job (IT industry), in addition to the common software development, machine learning positions can also be regarded as a choice, many computer graduate students will contact this, if your research direction is machine learning/data mining and so on, and it is very interested in, you can consider the post, After all, machine learning can be an important tool before the human level is reached, and with the development of science and technology, it is believed that the demand for talents will become more and more big. Supervision (supervised) Classification (classification) 1.knn (k nearest neighbors) algorithm:

　　Key Formula : d= (∑ (xi−xtest) 2) d=\left (\sum (X_i-x_{test}) ^2\right) ^{\frac{1}{2}}
　　pseudo Code (PSEUDO_CODE):
Calculate the distance from the training focus to that point
Select the k point with the smallest distance
Returns the highest frequency category of K points as the prediction classification for the current point

def KNN (in_x, Data_set, labels, k):
Diff_mat = Tile (in_x, (data_size,1))-Data_set Sq_diff_mat
= diff_mat**2< c2/> distances = sq_diff_mat.sum (Axis=1) **0.5
sorted_dist_indicies = Distances.argsort () for
I in Xrange (k): c5/> Vote_label = labels[sorted_dist_indicies[i]]
Class_count[vote_label] = class_count.get (Vote_label, 0) + 1< c7/> Sortedclasscount = sorted (Classcount.iteritems (),
key=operator.itemgetter (1), reverse=true) return
Sor TEDCLASSCOUNT[0][0]

2. Decision Tree:

Key Formula : H (x) =−∑p (xi) log2 (P (xi)) H (x) =-\sum p (x_i) log_2 (P (x_i))
The important point in decision tree is to select an attribute to branch according to the size of information entropy, so pay attention to the calculation formula of information entropy and understand it deeply. 　　　　　　
Pseudo Code (PSEUDO_CODE):
detects whether each subkey in the dataset belongs to the same category: the
If so return class label;
Else
　　Find the best features for dividing datasets
Create branch nodes
for each subset of the partition
Call function Createbranch and increase return results to branch node
back branch node
The principle is only one, try to make each node sample label as little as possible, note the above pseudo code in a sentence said: Find the best feature to split the data, then how to get thebest feature? The general rule is to try to make the branching section Point of the category of pure some, that is, the more accurate points. As shown in (figure I), the 5 animals that are fished out of the ocean, we have to determine whether they are fish and which features to use first.

(Figure I)
in order to improve the accuracy of recognition, we first use the "leave the land to survive" or "whether there is webbed" to judge. We have to have a yardstick, commonly used in information theory, the purity of Guinea, and so on, here use the former. Our goal is to select the feature that makes the label information gain the most in the segmented data set, the information gain is the original data set label base entropy minus the partition of the data set label entropy, in other words, the information gain is large is the entropy becomes smaller, makes the dataset more orderly. The entropy (great character) is computed as shown in (Equation i):

h=−∑ni=1p (xi) log2p (xi) H=-\sum_{i=1}^n P (x_i) log_2p (x_i)
(Formula One)
where n represents n categories (such as fake Set is two class problem, then n=2). The probability P1 p_1 and P2 p_2 of these 2 kinds of samples in the total sample are calculated respectively, so that the information entropy before the unselected attribute branching can be computed.

3. Naive Bayesian (Naive Bayes)

　　Key formula: P (a| b) =p (A∩B) p (b) p (a| b) =\dfrac{p (A\cap b)}{p (b)}
　　Why
An important application of machine learning is the automatic classification of documents. In a document category, an entire document, such as an e-mail message, is an instance, while some elements in an e-mail message form a feature. We can look at the words appearing in the document and make each word appear or not appear as a feature, so that we get as many features as possible. Suppose you have 1000 words in your vocabulary. To get a good probability distribution, you need enough data samples, assuming that the number of samples is n. By statistical knowledge, if each feature requires N samples, then for 10 features will need to N10 a sample n^{10} samples, for the vocabulary containing 1000 features will need to N1000 n^{1000} samples. As you can see, the number of samples required will increase rapidly as the characteristic trees grow.
If the features are independent of each other, then the sample tree can be from the N1000

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine learning (common interview machine learning algorithm Thinking simple comb) __ Machine learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine learning (common interview machine learning algorithm Thinking simple comb) __ Machine learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support