Basic machine learning algorithm thinking and programming implementation

Last Update:2018-03-26 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Profile

The commonly used machine learning algorithms:\ (k\)-Nearest neighbor algorithm, decision tree, naive Bayesian,\ (k\)-mean clustering its ideas and Python code implementation summary. Do not have to know it but also know the reason why. Refer to "machine learning combat".
?

\ (k\)-Nearest Neighbor algorithm

Basic principle

?
\ (k\)-Nearest neighbor algorithm is the simplest and most effective method for classifying data. In a nutshell, it is categorized by measuring the distance between different eigenvalues . Extracts the category labels of the most adjacent data in a sample set, in general, we only select the most similar data before \ (k\) in the sample data set.
?

Code implementation

?
The key to the code is to calculate the distance between each point and point in the dataset and sort by increment. Keep in mind that Distances.argsort () returns the index position of the values in the array distances from small to large, and it has to be said that Python's encapsulation function is powerful.

def classify0 (InX, DataSet, labels, k): ' InX: Input vector for classification dataset: Input Training Sample Set Labels: label vector, number of rows in dataset K: Used to select the number of nearest neighbors ' datasetsize = dataset.shape[0] # number of samples Diffmat = Tile (InX, (datasetsize,1))-DataSet #np. ti Le Inx expands the data into a second parameter shape Sqdiffmat = diffmat**2 #平方求欧氏距离 sqdistances = Sqdiffmat.sum (axis=1) # column sum, return array distances = sqdistances**0.5 # get Euclidean distance sorteddistindicies = Distances.argsort () # value from small to large index position classcount = {} # Create an empty dictionary, dictionary The key value of the dictionary that is not ordered can be used with the Get () method, if key does not exist, can return None, or the value specified by itself, for example, if key does not exist, give 0 value ' for I in R Ange (k): # Select K points with minimum distance Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (        voteilabel,0) +1 "Python Dictionary of the items method function: Is that all the items in the dictionary can be returned in a list, each item is a tuple.        Because the dictionary is unordered, all items in the dictionary are returned with the items method, and there is no order. Eg:a = {' A ': 1, ' B ': 2, ' C ': 3} The output of the A.items () is: [(' A ', ' 1 '), (' C ', 3 '), (' B ', 2)] The Python dictionary's Iteritems method function: As compared to the items method is roughly the same as the function, just its return valueis not a list, but an iterator. The operator module provides a itemgetter function to get data about which dimensions of an object, # parameters are ordinal numbers (that is, the number of data that needs to be fetched in the object), #itemgetter函数获取的不是值, but defines a function that acts on the object to get the value Sortedclasscount = sorted (Classcount.iteritems (), #将字典变成迭代器 (List form) Key=operator.itemgetter (1), R Everse=true) # sorted in reverse order, i.e. from large to small return sortedclasscount[0][0]

Advantages and Disadvantages

?
Advantages: high accuracy, insensitive to outliers, no data input assumptions

Cons: High computational complexity, high spatial complexity, inability to provide infrastructure information for any data

Decision Tree

Basic principle

?
Decision trees are the process of classifying data through a series of rules.

It takes advantage of the principle of probability theory and uses a tree diagram as an analytical tool. The basic principle is to use decision point to represent the decision-making problem, use the Scheme Branch representative to choose the option, use the probability branch to represent the various results of the scheme, through the calculation and comparison of the profit and loss of various schemes under various result conditions, to provide decision-making basis for the decision-maker.
?

Implementation of decision Tree

?
The implementation of the decision tree is divided into three main steps:

Feature selection: Feature selection refers to the selection of a feature from many features in the training data as the split standard of the current node, and how to choose features with many different quantitative evaluation criteria, thus deriving different decision tree algorithms.
Decision Tree Generation: Sub-nodes are generated recursively from top to bottom according to the selected feature evaluation criteria, until the dataset is not divided.
Pruning: Decision trees are easy to fit, generally need pruning, reduce the size of tree structure and alleviate overfitting. There are two kinds of pruning techniques: pre-pruning and post-pruning.

The biggest principle of partitioning a dataset is to make the unordered data orderly . If there are 20 features in a training data, which is the basis of the selection? This must use the quantification method to judge, the quantification divides the method to have the multiplicity, one of them is "The Information theory measure classification". The decision tree algorithm based on information theory has ID3, cart and C4.5 algorithm, in which C4.5 and cart two kinds of algorithms derive from ID3 algorithm. The ID3 algorithm is based on the "Ames Razor": the smaller the decision tree is superior to the larger decision tree (being simple theory). According to the information gain evaluation and selection feature in the ID3 algorithm, the maximum feature of information gain is chosen each time to make a judgment module.

Advantages and Disadvantages

?
?
Advantages: The computational complexity is high, the output is easy to understand and insensitive to the absence of intermediate values. Can handle irrelevant feature data

Cons: may cause over-matching problems (over-fitting)
?

Naive Bayesian

Basic principle

?
The core idea of Bayesian decision-making theory: Choose the decision with the highest probability .

The core is the Bayesian criterion, which tells us how to exchange conditions and results in conditional probabilities, that is, if you know \ (P ( x|c) \), ask \ (P (c|x) \), then you can use the following calculation method:
\begin{align}
P (c|x) = \frac{p (X|c) p (c)}{p (x)} \notag
\end{align}

The hypothesis of naive Bayesian hypothesis is the meaning of the word "simplicity" in naive Bayes. Another hypothesis in the naive Bayesian classifier is that each feature is equally important . Although there are some minor flaws in these two assumptions, the practical effect of naive Bayes is very good.
?

Using conditional probabilities to classify

?
Bayesian decision theory requires the calculation of two probabilities ( p (c_1|x) \) and \ (P (c_2|x) \)(for two classifications). The point is: given a data point represented by \ (x\) , what is the probability that the data points come from the category \ (c_1\) ? What is the probability from \ (c_2\) ? Note that these probabilities are not the same as \ (P (x|c_1) \) , and you can use Bayesian criteria to exchange probabilities for conditions and results. Using these definitions, you can define the Bayesian classification criteria:

If \ (P (c_1|x) > P (c_2|x) \), then it belongs to category \ (c_1\).
If \ (P (c_1|x) < P (c_2|x) \), then it belongs to category \ (c_2\).

For a practical problem, we need to do the following steps:

Statistical calculation \ (P (c_i), \quad i=1,2\).
Next calculate \ (P (x| c_i) \), here to use the Naïve Bayes hypothesis, if the \ (x\) to expand a separate feature, then \ (P (x|c_i) =p (x_0,x_1,\cdots,x_n | c_i) = P (x_0|c_i) p (x_1|c_i) \cdots p (x_n|c_i) \). (If each number is too small, you can use the technique of taking the logarithm)
If the vector to classify is \ (w\), calculate \ (P (w|c_i) p (c_i), \quad i=1,2\), which value is attributed to which class.

Advantages and Disadvantages

?
Pros: still valid with less data, can handle multiple categories of problems

Cons: sensitive to the way the input data is prepared
?

Logistic regression

Basic principle

?
The goal of logistic regression is to find the best fitting parameters of a nonlinear function Sigmoid, and the solving process is accomplished by the optimization algorithm. The most common is the gradient rise algorithm.

The Sigmoid function is calculated in the following formula:
\begin{align}
\sigma (z) = \frac{1}{1+\mathrm{e}^{-z}} \notag
\end{align}
Obviously \ (\sigma (0) = 0.5\). In order to implement the Logsitic regression classifier, we can multiply a regression coefficient on each feature and then add all the result values, substituting the sum into the Sigmoid function, resulting in a value that ranges between \ (0 \sim 1\) . Any data greater than \ (0.5\) is divided into the \ (1\) class, and less than \ (0.5\) is classified as the \ (0\) class. So Logistic regression can also be considered as a probability estimate. The main question now is: how to determine the optimal regression coefficient? After we have defined the cost function, we can solve it with the gradient ascending algorithm.
?

Code implementation

?
The main part of the algorithm is to write the gradient ascending algorithm, the following is given:

defGradascent (Datamatin, classlabels):# gradient rise algorithmM, n=Np.shape (Datamatin) Alpha= 0.001Maxcycles=  -Weights=Np.ones (N)# array of 1*n     forKinch Range(maxcycles): H=Sigmoid (Np.dot (datamatin, weights))# 1*m Array, sigmoid is the sigmoid function, write it yourselfError=Classlabels-H Weights=Weights+Alpha*Np.dot (Error, Datamatin)//Here it is adjusted by the difference direction, can also solve the gradientreturnWeightsdefStocGradAscent0 (Datamatin, Classlabels, Numiter= +):# random gradient rise algorithmM, n=Np.shape (Datamatin)#maxCycles =Weights=Np.ones (N)# Initializes an array of weights, 1*n     forJinch Range(numiter): Dataindex= Range(m) forIinch Range(m): Alpha= 4 /(1.+J+I+0.01  # Iteration Step settingRandindex= int(Np.random.uniform (0,Len(dataindex)))# The only difference from the gradient rise: randomly selected updatesH=Sigmoid (NP.sum(Datamatin[randindex]*weights))# one numberError=Classlabels[randindex]-H# a vectorWeights=Weights+Alpha*Error*Datamatin[randindex]del(Dataindex[randindex])returnWeights

Advantages and Disadvantages

?
Advantages: Low computational cost, easy to understand and implement

Disadvantages: easy to fit, classification accuracy may not be high
?

\ (k\)Mean-value Clustering

Basic principle

?
Clustering is unsupervised learning, which places similar objects in the same cluster. \ (k\) mean clustering is called the \ (k\) mean because it can discover \ (k\) different clusters, and the center of each cluster is calculated from the mean value of the values contained in the cluster.

\ (k\) mean clustering is the algorithm that discovers \ (k\) clusters in a given dataset. The number of clusters \ (k\) is given by the user, and each cluster is described by its centroid, which is a bit of the center of the cluster. Its algorithm flow:

Create \ (k\) points as the starting centroid (often randomly selected)
When the cluster allocation result of any one point changes (the explanation is not convergent)
?? For each data point in the data set
???? For each centroid
?????? Calculate the distance between the centroid and the data points
???? Assign a data point to the cluster closest to it
?? For each cluster, calculate the mean value in the cluster and use the mean as the centroid

Code implementation

?
Suppose we cluster a bunch of data points, and the data points come from the machine learning combat. The code is as follows:

# Coding:utf-8ImportNumPy asNpdefLoaddataset (FileName):#general function to parse tab-delimited floatsDatamat=[]#assume last column is target valueFr= Open(FileName) forLineinchFr.readlines (): CurLine=Line.strip (). Split ('\ t') Fltline= List(Map(float, CurLine))#map all elements to float ()Datamat.append (Fltline)#一个列表包含很多列表    returnNp.array (Datamat)defDisteclud (Veca, VECB):returnNP.SQRT (sum(Np.power (Veca-VECB,2)))#la. Norm (VECA-VECB)defRandcent (DataSet, K):# Build cluster centroid, matrix DataSet each row represents a sampleN=Np.shape (DataSet) [1] Centroids=Np.zeros ((k,n))#create centroid Mat     forJinch Range(n): Minj= min(Dataset[:,j])# random centroid must be within the bounds of the entire data setRangej= float(Max(Dataset[:,j])-Minj) Centroids[:,j]=(Minj+Rangej*Np.random.rand (k,1). Flatten ()#随机    returnCentroidsdefKmeans (DataSet, K, Distmeas=Disteclud, Createcent=randcent): M=Np.shape (DataSet) [0]# Sample of MClusterassment=Np.zeros ((M,2))#创建一个矩阵来存储每个点的分配结果                                    #两列: One column records index value, second column storage errorCentroids=Createcent (DataSet, K)# Create a k centroidClusterchanged= True     whileClusterchanged:clusterchanged= False         forIinch Range(m):#将每个点分配到最近的质心Mindist=Np.inf;Minindex= -1             forJinch Range(k):# Find the nearest centroidDistji=Distmeas (Centroids[j,:],dataset[i,:])ifDistji<Mindist:mindist=Distji;Minindex=JifClusterassment[i,0]!=Minindex:# until the cluster allocation result of the data point no longer changesClusterchanged= TrueClusterassment[i,:]=Minindex,mindist**2        #print (centroids)         forcentinch Range(k):#重新计算质心, update the location of the centroidPtsinclust=Dataset[np.nonzero (clusterassment[:,0]==cent)]# to get all the points of a given cluster through array filteringCentroids[cent,:]=Np.mean (ptsinclust, axis=0)#计算所有点的均值    returnCentroids, Clusterassment# Returns all class centroid and point assignment results   ImportMatplotlib.pyplot asPltif __name__ == ' __main__ ': Datmat=Loaddataset (' TestSet.txt ') K= 4Centroids=Randcent (Datmat, K)#print (centroids)Mycentroids, clustassing=Kmeans (Datmat,4) FIG=Plt.figure () Ax=Fig.add_subplot (1,1,1)#ax. Scatter (datmat[:,0], datmat[:,1]) # must be an array typeScattermarkers=[' s ',' O ','^',' 8 ',' P ',' d ',' V ',' h ',' > ',' < '] forIinch Range(k): Ptsincurrcluster=Datmat[np.nonzero (clustassing[:,0]==i)] Ax.scatter (ptsincurrcluster[:,0], ptsincurrcluster[:,1], Marker=Scattermarkers[i], S= -) Ax.scatter (mycentroids[:,0], mycentroids[:,1], Marker='+'+ T= -) Plt.show ()

Visualize such as:

Because of the random selection of the initial centroid, the results are slightly different for each run.

Advantages and Disadvantages

Advantages: easy to implement
disadvantage: may converge to local minimum, slow convergence on large scale datasets

How to determine the parameter\ (k\)?

If \ (k\) is selected too small, the algorithm converges to the local minimum rather than the global minimum value. An indicator for measuring the clustering effect is SSE (Sum of squared error, squared error), and the sum of the first column of the clusterassment matrix in the corresponding degree. Smaller SSE values indicate that the data points are closer to their centroid, and the clustering effect is better. One way to definitely reduce SSE values is to increase the number of clusters, but this violates the goal of clustering. The goal of clustering is to improve the quality of clusters while keeping the number of clusters constant.

So how to improve it? One way is to divide the clusters with the largest SSE values into two clusters. In the implementation of the maximum cluster can be filtered out and run the \ (k\) mean algorithm, in order to keep the total number of clusters, you can merge a two clusters, the choice of two clusters generally have two quantifiable methods: merging the nearest centroid, or merging two to make SSE The center of mass with the smallest increment.
?

Two points\ (k\)Mean value algorithm

In order to overcome the problem that the k\ mean algorithm converges to the local minimum value, someone proposes another algorithm called the mean value of the binary \ (k\) . The algorithm first takes all the points as a cluster and then divides the cluster into one. Then select a cluster to proceed with the division, and choose which cluster to divide depending on whether the partition can minimize the value of SSE. The above-mentioned SSE-based partitioning process is broken and repeated until a user-specified number of clusters is reached. Another option is to select the largest cluster of SSE to divide until the number of clusters reaches the number specified by the user.

Basic machine learning algorithm thinking and programming implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More