Objective:
When looking for a job (IT industry), in addition to the common software development, machine learning positions can also be regarded as a choice, many computer graduate students will contact this, if your research direction is machine learning/data mining and so on, and it is very interested in, you can consider the post, After all, machine learning can be an important tool before the human level is reached, and with the development of science and technology, it is believed that the demand for talents will become more and more big. Supervision (supervised) Classification (classification) 1.knn (k nearest neighbors) algorithm:
Key Formula : d= (∑ (xi−xtest) 2) d=\left (\sum (X_i-x_{test}) ^2\right) ^{\frac{1}{2}}
pseudo Code (PSEUDO_CODE):
Calculate the distance from the training focus to that point
Select the k point with the smallest distance
Returns the highest frequency category of K points as the prediction classification for the current point
def KNN (in_x, Data_set, labels, k):
Diff_mat = Tile (in_x, (data_size,1))-Data_set Sq_diff_mat
= diff_mat**2< c2/> distances = sq_diff_mat.sum (Axis=1) **0.5
sorted_dist_indicies = Distances.argsort () for
I in Xrange (k): c5/> Vote_label = labels[sorted_dist_indicies[i]]
Class_count[vote_label] = class_count.get (Vote_label, 0) + 1< c7/> Sortedclasscount = sorted (Classcount.iteritems (),
key=operator.itemgetter (1), reverse=true) return
Sor TEDCLASSCOUNT[0][0]
2. Decision Tree:
Key Formula : H (x) =−∑p (xi) log2 (P (xi)) H (x) =-\sum p (x_i) log_2 (P (x_i))
The important point in decision tree is to select an attribute to branch according to the size of information entropy, so pay attention to the calculation formula of information entropy and understand it deeply.
Pseudo Code (PSEUDO_CODE):
detects whether each subkey in the dataset belongs to the same category: the
If so return class label;
Else
Find the best features for dividing datasets
Create branch nodes
for each subset of the partition
Call function Createbranch and increase return results to branch node
back branch node
The principle is only one, try to make each node sample label as little as possible, note the above pseudo code in a sentence said: Find the best feature to split the data, then how to get thebest feature? The general rule is to try to make the branching section Point of the category of pure some, that is, the more accurate points. As shown in (figure I), the 5 animals that are fished out of the ocean, we have to determine whether they are fish and which features to use first.
(Figure I)
in order to improve the accuracy of recognition, we first use the "leave the land to survive" or "whether there is webbed" to judge. We have to have a yardstick, commonly used in information theory, the purity of Guinea, and so on, here use the former. Our goal is to select the feature that makes the label information gain the most in the segmented data set, the information gain is the original data set label base entropy minus the partition of the data set label entropy, in other words, the information gain is large is the entropy becomes smaller, makes the dataset more orderly. The entropy (great character) is computed as shown in (Equation i):
h=−∑ni=1p (xi) log2p (xi) H=-\sum_{i=1}^n P (x_i) log_2p (x_i)
(Formula One)
where n represents n categories (such as fake Set is two class problem, then n=2). The probability P1 p_1 and P2 p_2 of these 2 kinds of samples in the total sample are calculated respectively, so that the information entropy before the unselected attribute branching can be computed.
3. Naive Bayesian (Naive Bayes)
Key formula: P (a| b) =p (A∩B) p (b) p (a| b) =\dfrac{p (A\cap b)}{p (b)}
Why
An important application of machine learning is the automatic classification of documents. In a document category, an entire document, such as an e-mail message, is an instance, while some elements in an e-mail message form a feature. We can look at the words appearing in the document and make each word appear or not appear as a feature, so that we get as many features as possible. Suppose you have 1000 words in your vocabulary. To get a good probability distribution, you need enough data samples, assuming that the number of samples is n. By statistical knowledge, if each feature requires N samples, then for 10 features will need to N10 a sample n^{10} samples, for the vocabulary containing 1000 features will need to N1000 n^{1000} samples. As you can see, the number of samples required will increase rapidly as the characteristic trees grow.
If the features are independent of each other, then the sample tree can be from the N1000