Programmer Training Machine Learning SVM algorithm sharing

Source: Internet
Author: User
Tags svm

Http://www.csdn.net/article/2012-12-28/2813275-Support-Vector-Machine

absrtact: support vector Machine (SVM) has become a very popular algorithm. This paper mainly expounds how SVM works, and also gives some examples of using Python scikits library. As an algorithm for training machine learning, SVM can be used to solve classification and regression problems, and kernel trick technology is used to transform data, and then to find an optimal boundary in the possible output according to the conversion information.

"CSDN report" Support vector Machines (SVM) has become a very popular algorithm. In this article, Greg lamp simply explains how it works, and he gives a few examples of using the Python scikits library. All the code is available on GitHub, and Greg lamp will further elaborate on the details of using scikits and Sklearn. CSDN This technical article is compiled and organized:

What is SVM?

SVM is an algorithm for training machine learning that can be used to solve classification and regression problems, and also uses a technique called kernel trick to convert data, and then, based on these transformations, to find an optimal boundary in the possible output. In short, it's about doing some very complex data transformations and then calculating how to separate the user's data based on predefined labels or outputs.

What makes it so powerful?

Of course, for SVM, it is fully capable of classifying and returning. In this article, Greg Lamp focuses on how to classify using SVM, especially non-linear SVM or SVM using nonlinear kernels. Non-linear SVM means that the algorithm calculates a boundary that is not necessarily a straight line, and the advantage is that it captures more complex relationships between the set of data points without having to rely on the user to perform the difficult transformations themselves. The disadvantage is that the training time is much longer because of more computational capacity.

What is kernel trick?

Kernel trick docking The data you receive: type in some of the features you think are more obvious, sort them out, and output some data that you don't really know, which is like unlocking a DNA strand. You start by looking for the vector of the data, and then pass it on to kernel trick, and then continue to decompose and recombine until a larger data set is formed, and often the data you see is very difficult to understand. This is where the magic is, the extended data set has a more pronounced boundary, and the SVM algorithm can calculate a more optimized hyper-plane.

Second, suppose you are a farmer and now you have a problem-you need to build a fence to prevent wolves from harming the herd. But where should the fence be built? If you are a data-driven farmer, then you need to build a "classifier" on your ranch, based on the position of the herd and the wolves, and compare the different classifiers as shown, we can see that the SVM has completed a perfect solution. Greg Lamp sees the story as a beautiful illustration of the advantages of using a non-linear classifier. It is obvious that both the logical pattern and the decision tree pattern are using a straight line approach.

The implementation code is as follows: farmer.py Python

  1. Import NumPy as NP
  2. Import Pylab as Pl
  3. From Sklearn import SVM
  4. From Sklearn import Linear_model
  5. From Sklearn import tree
  6. Import Pandas as PD
  7. def plot_results_with_hyperplane (CLF, Clf_name, DF, PLT_NMBR):
  8. X_min, X_max = df.x.min ()-. 5, Df.x.max () +. 5
  9. Y_min, Y_max = df.y.min ()-. 5, Df.y.max () +. 5
  10. # Step between points. i.e. [0, 0.02, 0.04, ...]
  11. step =.
  12. # to plot the boundary, we ' re going to create a matrix of every possible point
  13. # then label each point as a wolf or cow using our classifier
  14. xx, yy = Np.meshgrid (Np.arange (x_min, X_max, step),
  15. Np.arange (y_min, Y_max, Step))
  16. Z = clf.predict (Np.c_[xx.ravel (), Yy.ravel ()])
  17. # This gets we predictions back into a matrix
  18. Zz = z.reshape (xx.shape)
  19. # Create a subplot (we ' re going to the than 1 plot on a given image)
  20. Pl.subplot (2, 2, PLT_NMBR)
  21. # Plot the Boundaries
  22. Pl.pcolormesh (xx, yy, Z, cmap=pl.cm.Paired)
  23. # Plot the Wolves and cows
  24. For animal in Df.animal.unique ():
  25. Pl.scatter (df[df.animal==animal].x,
  26. df[Df.animal==animal].y,
  27. marker=Animal,
  28. label="cows" if animal== "x" Else "Wolves",
  29. color=' black ',
  30. c=df.animal_type, cmap=pl.cm.Paired)
  31. Pl.title (Clf_name)
  32. Pl.legend (loc="best")
  33. data = Open ("Cows_and_wolves.txt"). Read ()
  34. data = [Row.split (' \ t ') for row in Data.strip (). Split (' \ n ')]
  35. Animals = []
  36. For Y, row in Enumerate (data):
  37. For x, item in enumerate (ROW):
  38. # x ' s is cows, O ' s is Wolves
  39. If item in [' O ', ' x ']:
  40. Animals.append ([x, Y, item])
  41. DF = PD. DataFrame (Animals, columns=["x", "Y", "Animal"])
  42. df[' animal_type ' = df.animal.apply (lambda x:0 if x== "x" Else 1)
  43. # Train using the X and Y position coordiantes
  44. Train_cols = ["X", "Y"]
  45. Clfs = {
  46. "SVM": SVM. SVC (),
  47. "Logistic": Linear_model. Logisticregression (),
  48. "Decision tree": Tree. Decisiontreeclassifier (),
  49. }
  50. PLT_NMBR = 1
  51. For Clf_name, CLF in Clfs.iteritems ():
  52. Clf.fit (Df[train_cols], Df.animal_type)
  53. Plot_results_with_hyperplane (CLF, Clf_name, DF, PLT_NMBR)
  54. PLT_NMBR + = 1
  55. Pl.show ()

Let the SVM do some more difficult work!

Admittedly, if the relationship between the independent variable and the dependent variable is nonlinear, it is difficult to approach the accuracy of SVM. If it's still hard to understand, take a look at the following example: Suppose we have a set of datasets that contain green and red point sets. We first plot their coordinates, which form a concrete shape-with a red outline, surrounded by green (which looks like the flag of Bangladesh). If, for some reason, we lose 1/3 of the data set, then as we recover, we want to find a way to maximize the contours of this lost 1/3 part.

So how do we speculate that the missing 1/3 part is closest to what shape? One way is to create a model that uses nearly 80% of the data information as a "training set." Greg Lamp chooses three different data models to try out separately:

    • Logical Model (GLM)
    • Decision Tree Model (DT)
    • Svm

Greg lamp trains each data model and then uses the models to speculate on the loss of 1/3 parts of the dataset. We can look at the results of these different models:

The implementation code is as follows: svmflag.py Python

  1. Import NumPy as NP
  2. Import Pylab as Pl
  3. Import Pandas as PD
  4. From Sklearn import SVM
  5. From Sklearn import Linear_model
  6. From Sklearn import tree
  7. From Sklearn.metrics import Confusion_matrix
  8. X_min, X_max = 0,
  9. Y_min, Y_max = 0, ten
  10. Step =. 1
  11. # to plot the boundary, we ' re going to create a matrix of every possible point
  12. # then label each point as a wolf or cow using our classifier
  13. xx, yy = Np.meshgrid (Np.arange (x_min, X_max, Step), Np.arange (y_min, Y_max, Step))
  14. DF = PD. DataFrame (data={' x ': Xx.ravel (), ' Y ': Yy.ravel ()})
  15. df[' color_gauge ' = (df.x-7.5) **2 + (df.y-5) **2
  16. df[' Color ' = df.color_gauge.apply (lambda x: "Red" if x <=-Else "green")
  17. df[' color_as_int ' = df.color.apply (lambda x:0 if x== "red" Else 1)
  18. Print "Points on flag:"
  19. Print df.groupby (' color '). Size ()
  20. Print
  21. Figure = 1
  22. # Plot a figure for the entire dataset
  23. For color in Df.color.unique ():
  24. idx = df. Color==color
  25. Pl.subplot (2, 2, figure)
  26. Pl.scatter (df[idx].x, df[idx].y, colorcolor=color)
  27. Pl.title (' Actual ')
  28. Train_idx = df.x <
  29. Train = Df[train_idx]
  30. Test = Df[-train_idx]
  31. Print "Training Set Size:%d"% Len (train)
  32. Print "Test Set Size:%d"% len (test)
  33. # Train using the X and Y position coordiantes
  34. cols = ["X", "Y"]
  35. Clfs = {
  36. "SVM": SVM. SVC (degree=0.5),
  37. "Logistic": Linear_model. Logisticregression (),
  38. "Decision tree": Tree. Decisiontreeclassifier ()
  39. }
  40. # racehorse different classifiers and plot the results
  41. For Clf_name, CLF in Clfs.iteritems ():
  42. Figure + = 1
  43. # Train the classifier
  44. Clf.fit (Train[cols], train.color_as_int)
  45. # Get the predicted values from the test set
  46. test[' predicted_color_as_int ' = Clf.predict (Test[cols])
  47. test[' Pred_color ']
  48. = Test.predicted_color_as_int.apply (lambda x: "Red" if x==0 Else "green")
  49. # Create a new subplot on the plot
  50. Pl.subplot (2, 2, figure)
  51. # Plot each predicted color
  52. For color in Test.pred_color.unique ():
  53. # plot only rows where pred_color are equal to color
  54. idx = test. Pred_color==color
  55. Pl.scatter (test[idx].x, test[idx].y, colorcolor=color)
  56. # Plot the training set as well
  57. For color in Train.color.unique ():
  58. idx = train. Color==color
  59. Pl.scatter (train[idx].x, train[idx].y, colorcolor=color)
  60. # Add a dotted line to show the boundary between the training and test set
  61. # (Everything to the right of the "in the test set)
  62. #this plots a vertical line
  63. train_line_y = np.linspace (y_min, Y_max) #evenly spaced array from 0 to ten
  64. train_line_x = np.repeat (ten, Len (train_line_y))
  65. #repeat (threshold for Traininset) n times
  66. # Add a black, dotted line to the subplot
  67. Pl.plot (train_line_x, train_line_y, ' k--', color="Black")
  68. Pl.title (Clf_name)
  69. Print "Confusion Matrix for%s:"% clf_name
  70. Print Confusion_matrix (Test.color, Test.pred_color)
  71. Pl.show ()

Conclusion:

From these experimental results, there is no doubt that SVM is the absolute winner. But for the reasons, we might as well look at the DT model and the GLM model. It is clear that they are all using a straight line boundary. Greg Lamp's input model does not contain any conversion information when calculating the relationship between non-linear x, Y, and color. If Greg Lamp could define some specific conversion information so that the GLM model and the DT model would be able to output better results, why would they waste their time? In fact, there is no complex conversion or compression, SVM only analyzes the wrong 117/5000 points set (up to 98% accuracy, compared to the DT model is 51%, and GLM model only 12%!) )

Where is the limitation?

Many people have doubts, since SVM is so powerful, but why can't we use SVM for everything? Unfortunately, the most magical part of SVM happens to be its biggest weakness! Complex Data transformation information and the results of boundary generation are difficult to elaborate. That's why it's often called black box, and the GLM model is just the opposite of the DT model, and they're easy to understand. (Compile/@CSDN Wang Peng, review/Zhonghao)

This article is compiled for CSDN, not reproduced without permission. If you want to reprint please contact [email protected]

Programmer Training Machine Learning SVM algorithm sharing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.