Http://www.csdn.net/article/2012-12-28/2813275-Support-Vector-Machine
absrtact: support vector Machine (SVM) has become a very popular algorithm. This paper mainly expounds how SVM works, and also gives some examples of using Python scikits library. As an algorithm for training machine learning, SVM can be used to solve classification and regression problems, and kernel trick technology is used to transform data, and then to find an optimal boundary in the possible output according to the conversion information.
"CSDN report" Support vector Machines (SVM) has become a very popular algorithm. In this article, Greg lamp simply explains how it works, and he gives a few examples of using the Python scikits library. All the code is available on GitHub, and Greg lamp will further elaborate on the details of using scikits and Sklearn. CSDN This technical article is compiled and organized:
What is SVM?
SVM is an algorithm for training machine learning that can be used to solve classification and regression problems, and also uses a technique called kernel trick to convert data, and then, based on these transformations, to find an optimal boundary in the possible output. In short, it's about doing some very complex data transformations and then calculating how to separate the user's data based on predefined labels or outputs.
What makes it so powerful?
Of course, for SVM, it is fully capable of classifying and returning. In this article, Greg Lamp focuses on how to classify using SVM, especially non-linear SVM or SVM using nonlinear kernels. Non-linear SVM means that the algorithm calculates a boundary that is not necessarily a straight line, and the advantage is that it captures more complex relationships between the set of data points without having to rely on the user to perform the difficult transformations themselves. The disadvantage is that the training time is much longer because of more computational capacity.
What is kernel trick?
Kernel trick docking The data you receive: type in some of the features you think are more obvious, sort them out, and output some data that you don't really know, which is like unlocking a DNA strand. You start by looking for the vector of the data, and then pass it on to kernel trick, and then continue to decompose and recombine until a larger data set is formed, and often the data you see is very difficult to understand. This is where the magic is, the extended data set has a more pronounced boundary, and the SVM algorithm can calculate a more optimized hyper-plane.
Second, suppose you are a farmer and now you have a problem-you need to build a fence to prevent wolves from harming the herd. But where should the fence be built? If you are a data-driven farmer, then you need to build a "classifier" on your ranch, based on the position of the herd and the wolves, and compare the different classifiers as shown, we can see that the SVM has completed a perfect solution. Greg Lamp sees the story as a beautiful illustration of the advantages of using a non-linear classifier. It is obvious that both the logical pattern and the decision tree pattern are using a straight line approach.
The implementation code is as follows: farmer.py Python
- Import NumPy as NP
- Import Pylab as Pl
- From Sklearn import SVM
- From Sklearn import Linear_model
- From Sklearn import tree
- Import Pandas as PD
- def plot_results_with_hyperplane (CLF, Clf_name, DF, PLT_NMBR):
- X_min, X_max = df.x.min ()-. 5, Df.x.max () +. 5
- Y_min, Y_max = df.y.min ()-. 5, Df.y.max () +. 5
- # Step between points. i.e. [0, 0.02, 0.04, ...]
- step =.
- # to plot the boundary, we ' re going to create a matrix of every possible point
- # then label each point as a wolf or cow using our classifier
- xx, yy = Np.meshgrid (Np.arange (x_min, X_max, step),
- Np.arange (y_min, Y_max, Step))
- Z = clf.predict (Np.c_[xx.ravel (), Yy.ravel ()])
- # This gets we predictions back into a matrix
- Zz = z.reshape (xx.shape)
- # Create a subplot (we ' re going to the than 1 plot on a given image)
- Pl.subplot (2, 2, PLT_NMBR)
- # Plot the Boundaries
- Pl.pcolormesh (xx, yy, Z, cmap=pl.cm.Paired)
- # Plot the Wolves and cows
- For animal in Df.animal.unique ():
- Pl.scatter (df[df.animal==animal].x,
- df[Df.animal==animal].y,
- marker=Animal,
- label="cows" if animal== "x" Else "Wolves",
- color=' black ',
- c=df.animal_type, cmap=pl.cm.Paired)
- Pl.title (Clf_name)
- Pl.legend (loc="best")
- data = Open ("Cows_and_wolves.txt"). Read ()
- data = [Row.split (' \ t ') for row in Data.strip (). Split (' \ n ')]
- Animals = []
- For Y, row in Enumerate (data):
- For x, item in enumerate (ROW):
- # x ' s is cows, O ' s is Wolves
- If item in [' O ', ' x ']:
- Animals.append ([x, Y, item])
- DF = PD. DataFrame (Animals, columns=["x", "Y", "Animal"])
- df[' animal_type ' = df.animal.apply (lambda x:0 if x== "x" Else 1)
- # Train using the X and Y position coordiantes
- Train_cols = ["X", "Y"]
- Clfs = {
- "SVM": SVM. SVC (),
- "Logistic": Linear_model. Logisticregression (),
- "Decision tree": Tree. Decisiontreeclassifier (),
- }
- PLT_NMBR = 1
- For Clf_name, CLF in Clfs.iteritems ():
- Clf.fit (Df[train_cols], Df.animal_type)
- Plot_results_with_hyperplane (CLF, Clf_name, DF, PLT_NMBR)
- PLT_NMBR + = 1
- Pl.show ()
Let the SVM do some more difficult work!
Admittedly, if the relationship between the independent variable and the dependent variable is nonlinear, it is difficult to approach the accuracy of SVM. If it's still hard to understand, take a look at the following example: Suppose we have a set of datasets that contain green and red point sets. We first plot their coordinates, which form a concrete shape-with a red outline, surrounded by green (which looks like the flag of Bangladesh). If, for some reason, we lose 1/3 of the data set, then as we recover, we want to find a way to maximize the contours of this lost 1/3 part.
So how do we speculate that the missing 1/3 part is closest to what shape? One way is to create a model that uses nearly 80% of the data information as a "training set." Greg Lamp chooses three different data models to try out separately:
- Logical Model (GLM)
- Decision Tree Model (DT)
- Svm
Greg lamp trains each data model and then uses the models to speculate on the loss of 1/3 parts of the dataset. We can look at the results of these different models:
The implementation code is as follows: svmflag.py Python
- Import NumPy as NP
- Import Pylab as Pl
- Import Pandas as PD
- From Sklearn import SVM
- From Sklearn import Linear_model
- From Sklearn import tree
- From Sklearn.metrics import Confusion_matrix
- X_min, X_max = 0,
- Y_min, Y_max = 0, ten
- Step =. 1
- # to plot the boundary, we ' re going to create a matrix of every possible point
- # then label each point as a wolf or cow using our classifier
- xx, yy = Np.meshgrid (Np.arange (x_min, X_max, Step), Np.arange (y_min, Y_max, Step))
- DF = PD. DataFrame (data={' x ': Xx.ravel (), ' Y ': Yy.ravel ()})
- df[' color_gauge ' = (df.x-7.5) **2 + (df.y-5) **2
- df[' Color ' = df.color_gauge.apply (lambda x: "Red" if x <=-Else "green")
- df[' color_as_int ' = df.color.apply (lambda x:0 if x== "red" Else 1)
- Print "Points on flag:"
- Print df.groupby (' color '). Size ()
- Print
- Figure = 1
- # Plot a figure for the entire dataset
- For color in Df.color.unique ():
- idx = df. Color==color
- Pl.subplot (2, 2, figure)
- Pl.scatter (df[idx].x, df[idx].y, colorcolor=color)
- Pl.title (' Actual ')
- Train_idx = df.x <
- Train = Df[train_idx]
- Test = Df[-train_idx]
- Print "Training Set Size:%d"% Len (train)
- Print "Test Set Size:%d"% len (test)
- # Train using the X and Y position coordiantes
- cols = ["X", "Y"]
- Clfs = {
- "SVM": SVM. SVC (degree=0.5),
- "Logistic": Linear_model. Logisticregression (),
- "Decision tree": Tree. Decisiontreeclassifier ()
- }
- # racehorse different classifiers and plot the results
- For Clf_name, CLF in Clfs.iteritems ():
- Figure + = 1
- # Train the classifier
- Clf.fit (Train[cols], train.color_as_int)
- # Get the predicted values from the test set
- test[' predicted_color_as_int ' = Clf.predict (Test[cols])
- test[' Pred_color ']
- = Test.predicted_color_as_int.apply (lambda x: "Red" if x==0 Else "green")
- # Create a new subplot on the plot
- Pl.subplot (2, 2, figure)
- # Plot each predicted color
- For color in Test.pred_color.unique ():
- # plot only rows where pred_color are equal to color
- idx = test. Pred_color==color
- Pl.scatter (test[idx].x, test[idx].y, colorcolor=color)
- # Plot the training set as well
- For color in Train.color.unique ():
- idx = train. Color==color
- Pl.scatter (train[idx].x, train[idx].y, colorcolor=color)
- # Add a dotted line to show the boundary between the training and test set
- # (Everything to the right of the "in the test set)
- #this plots a vertical line
- train_line_y = np.linspace (y_min, Y_max) #evenly spaced array from 0 to ten
- train_line_x = np.repeat (ten, Len (train_line_y))
- #repeat (threshold for Traininset) n times
- # Add a black, dotted line to the subplot
- Pl.plot (train_line_x, train_line_y, ' k--', color="Black")
- Pl.title (Clf_name)
- Print "Confusion Matrix for%s:"% clf_name
- Print Confusion_matrix (Test.color, Test.pred_color)
- Pl.show ()
Conclusion:
From these experimental results, there is no doubt that SVM is the absolute winner. But for the reasons, we might as well look at the DT model and the GLM model. It is clear that they are all using a straight line boundary. Greg Lamp's input model does not contain any conversion information when calculating the relationship between non-linear x, Y, and color. If Greg Lamp could define some specific conversion information so that the GLM model and the DT model would be able to output better results, why would they waste their time? In fact, there is no complex conversion or compression, SVM only analyzes the wrong 117/5000 points set (up to 98% accuracy, compared to the DT model is 51%, and GLM model only 12%!) )
Where is the limitation?
Many people have doubts, since SVM is so powerful, but why can't we use SVM for everything? Unfortunately, the most magical part of SVM happens to be its biggest weakness! Complex Data transformation information and the results of boundary generation are difficult to elaborate. That's why it's often called black box, and the GLM model is just the opposite of the DT model, and they're easy to understand. (Compile/@CSDN Wang Peng, review/Zhonghao)
This article is compiled for CSDN, not reproduced without permission. If you want to reprint please contact [email protected]
Programmer Training Machine Learning SVM algorithm sharing