Python Data Mining and machine learning technology Getting started combat __python

Source: Internet
Author: User
Tags random seed python web crawler

Summary: What is data mining. What is machine learning. And how to do python data preprocessing. This article will lead us to understand data mining and machine learning technology, through the Taobao commodity case data preprocessing combat, through the iris case introduced a variety of classification algorithms.

Introduction to the course:
Wei Chi, entrepreneur, senior it field Specialist/lecturer/writer, best-selling author of Python web crawler, Aliyun community technical expert.

The following content is based on the presenter's video sharing and PPT.

This course contains five points of knowledge:
1. Data mining and machine learning technology Introduction
2.Python Data preprocessing Combat
3. Common Classification Algorithm Introduction
4. Iris Classification Case
5. Classification algorithm choice of ideas and skills

First, data mining and machine learning technology Introduction

What is data mining. Data mining refers to the processing and analysis of some existing data, and finally to the deep relationship between data and data. For example, when the supermarket goods are placed, the milk is in the same place with the bread sold higher, or with other goods sold higher. Data mining technology can be used to solve such problems. In particular, the store of goods in supermarkets can be divided into related analysis class scene.

In daily life, the application of data mining technology is very extensive. For example, for a merchant, it is often necessary to classify their customers ' grades (SVIP, VIP, ordinary customers, etc.), which can be used as training data and part of customer data as test data. Then the training data input into the model for training, after the training is completed, input another part of the data to test, and finally realize the customer level of automatic division. Other examples of similar applications include verification code identification, automatic fruit quality screening, and more.

So what is machine learning technology? Word, where the machine learns the relationship or the rules of the data through the models and algorithms we have established, the last technology we use is machine learning technology. In fact, machine learning technology is a cross discipline, it can be roughly divided into two categories: the traditional machine learning technology and depth learning technology, which in-depth learning technology includes neural network related technologies. In this course, the emphasis is on the traditional machine learning technology and various algorithms.

Because machine learning technology and data mining technology are all to explore the law between data, so people usually put the two together mentioned. And these two technologies in the real life also has a very broad application scene, one of the classic categories of application scenarios as shown in the following figure:

1, classification: Customer Grade Division, Verification Code identification, fruit quality automatic screening, etc.

Machine learning and data mining techniques can be used to solve classification problems, such as classifying customers ' grades, verifying code recognition, and automatically selecting fruit quality.

For example, verification code identification is needed to design a scheme to identify a code of 0 to 9 handwritten digits. There is a solution to the idea that first, some of the 0 to 9 of handwritten digits are divided into training sets, and then the training set is divided manually, the individual handwriting is mapped to its corresponding number category, after the mapping relationship is established, the corresponding model can be established by the classification algorithm. At this point, if a new digital handwriting appears, the model can predict the number that the handwriting represents, that is, which numeric category it belongs to. For example, the model predicts that a handwriting belongs to the category of number 1 and can automatically recognize the handwriting as a number 1. So the verification code identification problem is essentially a classification problem.

The problem of automatic selection of fruit quality is also a classification problem. Fruit size, color and other characteristics can also be mapped to the corresponding sweetness category below, for example 1 This category can represent sweet, 0 this category represents not sweet. After obtaining the data of some training sets, the model can also be modeled by the classification algorithm, and if a new fruit is present, it can be automatically judged by its size, color and other characteristics to determine whether it is sweet or not. In this way, the automatic selection of fruit quality is realized.

2, Regression: Continuous data for the prediction, trend prediction, etc.

In addition to classification, data mining technology and machine learning technology also has a very classic scenario-regression. The number of categories in the category mentioned above is limited. For example, the digital verification code recognition scene contains 0 to 9 of the number category, and then, for example, the letter verification code recognition scene, contains A to Z of the limited category. The number of categories is limited, whether it is a numeric category or an alphabetic category.

Now assuming that there are some data, the best results are not at some point 0, 1 or 2, but in 1.2, 1.3, 1.4 ... Above. But the classification algorithm cannot solve this kind of problem, this time can use the regression analysis algorithm to solve. In the practical application, the regression analysis algorithm can realize the continuous data prediction and trend prediction.

3. Clustering: Customer value Forecast, Business Circle forecast, etc.

What is clustering. As mentioned above, in order to solve the problem of classification, it is necessary to have historical data (that is, the correct training data for the people to build). If there is no historical data, and need to directly classify the characteristics of an object to its corresponding category, classification algorithm and regression algorithm can not solve this problem. This time there is a solution-clustering, clustering method directly according to the characteristics of the object to classify the corresponding category, it does not need to be trained, so it is a unsupervised learning method.

When you can use clustering. If the database has a group of customer characteristics data, now need to according to the characteristics of these customers directly divided into the customer level (such as SVIP customers, VIP customers), this time can use the clustering model to solve. In addition, the clustering algorithm can also be used when predicting the business circle.

4, related analysis: Supermarket goods display, personalized recommendations, etc.

Relevance analysis refers to the analysis of the relationship between goods. For example, there is a large quantity of goods stored in a supermarket, now need to analyze the relationship between these goods, such as the relationship between bread and milk products, the degree of strength, this time can use the association analysis algorithm, with the help of the user's purchase records and other information, direct analysis of the relationship between these goods. After understanding the relevance of these products, it can be applied to the supermarket of the goods placed, by the strong correlation of goods in a similar position, can effectively improve the supermarket sales of goods.
In addition, association analysis can also be used for personalized recommendation techniques. For example, with the help of the user's browsing record, analysis of the relationship between the various pages, when users browse the page, you can push the strong associated pages. For example, after analyzing the browsing record data, found that there is a strong relationship between page A and page C, then when a user browses page A, you can push the page C to him, thus achieving personalized recommendations.

5, Natural language Processing: Text similarity technology, chat robot, etc.

In addition to the above scenarios, data mining and machine learning techniques can also be used for natural language processing and speech processing, among other things. For example, the computation of text similarity and the chat robot.

Second, Python data preprocessing combat

Before data mining and machine learning, the first step to do is to preprocess the existing data. If even the initial data are incorrect, then the final result is not guaranteed to be correct. Only by preprocessing the data and guaranteeing its accuracy can we guarantee the correctness of the final result.

Data preprocessing refers to the initial processing of data, the dirty data (that is, the impact of the accuracy of the results of data) processing, otherwise it is easy to affect the final results. Common data preprocessing methods are shown in the following illustration:

1, Missing value processing

A missing value is a characteristic value that is missing from a row of data in a set of data. There are two ways to resolve missing values, one is to delete the line of data where the missing value is located, and the other is to add the missing value to the correct value.

2, abnormal value processing

Abnormal value is often caused by the data in the collection of errors, such as in the collection of the number 68 o'clock error, mistakenly collected it into 680. In order to deal with outliers, it is necessary to discover these outliers first, which can often be used to discover these outliers data by drawing. After the data processing of outliers is finished, the original data will become correct, so that the accuracy of the final result can be ensured.

3. Data integration

Data integration is a simple method of data preprocessing compared with the missing value processing and outliers processing above. So what is data integration. Assuming that there are two sets of structured data A and B, and both sets of data have been loaded into memory, then if the user wants to combine the two sets of data into a set of data, it can be combined directly with pandas, and the merging process is actually the integration of the data.

Next to Taobao commodity data for example, introduced above pretreatment of the actual combat.

Before data preprocessing, first need to import Taobao merchandise data from the MySQL database. After the MySQL database is opened, the TAOB table is queried and the following output is obtained:

As you can see, there are four fields in the Taob table. Where the title field is used to store the name of the Taobao commodity; the Link field stores the links to Taobao's merchandise; price stores the prices of Taobao's goods; Comment stores the comments on Taobao's products (to some extent, the sales of goods).

So then how do you import that data in. First through the Pymysql connection to the database (if there is garbled, then the source of the Pymysql to modify), the successful connection, the TAOB in all the data retrieved, and then with the Pandas Read_sql () method can be imported into memory. The Read_sql () method has two parameters, the first parameter is an SQL statement, and the second parameter is the connection information for the MySQL database. The specific code is as follows:

1, the Missing value processing combat

The processing of the missing value can be done by data cleaning method. Take the above Taobao commodity data as an example, the comment number of a product may be 0, but its price is impossible to 0. However, in fact, there are some data in the database with price value of 0, this situation occurs because of the prices of some of the data attribute is not crawled.

So how can you tell if the data has a missing value? The following methods can be used to discriminate: first call the Data.describe () method for the previous TAOB table, and the results shown in the following figure appear:

How to understand the statistical results. The first step is to observe the count data for the price and comment fields, and if they are not equal, there must be a lack of information, and if they are equal, there is no indication of a missing condition. For example, the count of Price is 9616.0000, and the count of comment is 9615.0000, which indicates that the comment data is missing at least one.

The meanings of each of the other fields are as follows: The average mean, the STD is the standard deviation, min represents the minimum value, and Max represents the maximum value.

Then how to deal with these missing data. One way is to delete the data, and another way is to insert a new value at the missing value. The value in the second method can be either an average or a median, and the actual average or median needs to be determined on the basis of the facts. For example age this data (1-100 years old), this kind of stable, variable differential data, generally inserted average, while the variation of the interval is relatively large data, generally inserted median.

The specific actions for handling the missing value of the price are as follows:

2, abnormal value processing combat

Like the process of missing values, to handle an exception value, you first need to find the exception value. The discovery of outliers is often done by drawing scatter plots, because similar data is distributed centrally to a region in a scatter plot, and the anomalous data is distributed far from the area. According to this property, it is convenient to find the outliers in the data. The following figure:

First you need to extract price and comment data from the data. The usual practice can be extracted by looping, but this method is too complex, there is a simple way is the data box transpose, this time the original column data into the current row data, it is easy to obtain price data and comment data. Next, the scatter chart is plotted by the plot () method, the first parameter of the plot () method represents the horizontal axis, the second parameter represents the ordinate, the third parameter represents the type of the graph, and "O" represents the scatter plot. Finally, the show () method is displayed, so that the outliers can be visually observed. These outliers are not helpful in the analysis of data, and they often need to be removed from the data represented by these outliers or converted to normal values in practice. The following illustration is a scatter plot:

According to the figure above, the value of the comment is greater than 100000, the price is greater than 1000 of the data are processed, you can achieve the effect of handling outliers. And the implementation of the two specific processing methods are as follows:

The first is to change the value to the median, average, or other value. The following figure shows the exact operation:

The second is the deletion of the processing method, that is, the direct deletion of these exception data, is also recommended to use a method. The following figure shows the exact operation:

3, distribution analysis

Distribution analysis refers to the analysis of the distribution state of the data, that is, whether it is linear distribution or normal distribution. In general, the method of histogram is used to analyze the distribution. The histogram is plotted with the following steps: Calculating the extreme difference, calculating the group distance, and drawing the histogram. The specific actions are shown in the following illustration:

where the Arrange () method is used to develop the style, the first parameter of the Arrange () method represents the minimum, the second parameter represents the maximum, the third parameter represents the group spacing, and then the Hist () method is applied to draw the histogram.
TAOB table Taobao commodity price histogram as shown in the following figure, roughly in line with the normal distribution:

TAOB Table of the Taobao product comment histogram as shown in the following figure, is roughly the descending curve:

4, the painting of word cloud picture

Sometimes it is necessary to draw the word cloud picture according to a piece of text information, and draw the concrete operation as follows:

The general process of implementation is: first use Cut () to make a word on the document, after the word is finished, the words are sorted into a fixed format, and then according to the required Word cloud display form to read the corresponding picture (the following image of the word cloud is the shape of the cat), and then use WC. Wordcloud () carries on the transformation of the word cloud picture, finally displays the corresponding word cloud picture through the imshow (). For example, the word cloud image is drawn from the old nine. txt document as shown in the following illustration:

Introduction of common classification algorithms

There are many common classification algorithms, as shown in the following illustration:

KNN algorithm and Bayesian algorithm are important algorithms, in addition to some other algorithms, such as decision tree algorithm, logical regression algorithm and SVM algorithm. The adaboost algorithm is mainly used for the weak classification algorithm to transform the strong classification algorithm.

Four, the classification of IRIS cases of actual combat

If there are some iris data, this data contains some of the iris characteristics, such as petal length, petal width, calyx length and calyx width of the four characteristics. With these historical data, the data can be used for the classification model training, after the model training is completed, when a new unknown type of iris, you can use the training model to determine the type of iris. There are different implementations of this case, but it would be better to use which classification algorithm to implement it.

1, KNN algorithm

(1), the introduction of KNN algorithm

First of all consider such a problem, in the above Taobao products, there are three categories of goods, are snacks, brand-name bags and electrical appliances, they have two characteristics: price and comment. According to the price to order, brand-name bags The most expensive, the second electric appliances, snacks the cheapest; According to the number of comments to order, the number of snack reviews, the second electric appliance, brand-name bag is the least. Then, with price as x axis and comment as Y axis, the rectangular coordinate system is established, and the distribution of these three kinds of goods is plotted in the coordinate system, as shown in the following figure:

It is clear that all three types of goods are concentrated in different regions. If a new product with a known feature is present, use. Represents the new product. According to its characteristics, the location of the product in the coordinate system map, as shown in the picture, asked the product is most likely to be the three types of goods.

This kind of problem can be solved by using KNN algorithm, the realization of the algorithm is to calculate the unknown goods to other goods in the Euclidean distance of the sum, and then sorted, the sum of the smaller distance, indicating that the unknown commodity and this kind of goods more similar. For example, after the calculation of the unknown goods and electrical products of the Euclidean distance between the smallest, then you can think that the goods belong to electrical goods.

(2) Implementation mode

The specific implementation of the above process is as follows:

Of course, can also be directly switched, which is more concise and convenient, the disadvantage is that the use of people can not understand its principle:

(3) using KNN algorithm to solve the classification problem of Iris

The iris data is first loaded. There are two specific loading schemes, one is to read directly from the IRIS data set, after setting a good path, through the Read_csv () method to read, separate the characteristics and results of the dataset, the specific operations are as follows:

Another method of loading is to use Sklearn to implement loading. The data set of the iris in the datasets of Sklearn, by using the Datasets Load_iris () method, allows the data to be loaded, followed by the same features and categories, and then separated from the training data and test data (typically Cross-validation), The specific use of the Train_test_split () method for separation, the third parameter represents the test scale, the fourth parameter is a random seed, the following actions:

After the load is complete, you can call the KNN algorithm mentioned above to classify it.

2, Bayesian algorithm

(1), the introduction of Bayesian algorithm

First, we introduce naive Bayesian formula: P (b| A) =p (a| b) p (b)/q (A). If there are some courses of data, as shown in the table below, prices and class hours are the characteristics of the course, sales are the results of the course, if a new course, the price is high and more hours, according to the existing data to predict the sales of new courses.

It is clear that this problem belongs to the classification problem. The table is processed first, and the feature one and feature two are converted into numbers, that is, 0 represents low, 1 represents, and 2 represents high. After digitizing, [[t1,t2],[t1,t2],[t1,t2]]--[[0,2],[2,1],[0,0]], and then transpose the two-dimensional list (for subsequent statistics), get the [[T1,t1,t1],[t2,t2,t2]] ——-[ 0,2,0],[2,1,0]]. [0,2,0] represents the price of each course, [2,1,0] represents the number of classes in each course.

The original problem can be equivalent to the high price, class hours, the new curriculum sales are high, medium and low probability. That is, p (c| AB) =p (ab| c) P (c) (b)/P (AB) =p (a| C) P (b| c) P (c)/P (AB) = "P" (a| C) P (b| c) P (c), where C has three kinds of cases: c0= High, c1=, c2= Low. And ultimately need to compare P (c0| AB), P (c1| AB) and P (c2| AB) The size of those three, and
P (c0| AB) =p (a| C0) P (b| C0) P (C0) =2/4*2/4*4/7=1/7
P (c1| AB) =p (a| C1) P (b| C1) P (C1) =0=0
P (c2| AB) =p (a| C2) P (b| C2) P (C2) =0=0
Obviously P (c0| AB) The biggest, can predict this new course sales is high.

(2), implementation mode

Like the KNN algorithm, Bayesian algorithm has two ways to implement, one is the detailed implementation:

The other is how the integration is implemented:

3. Decision Tree Algorithm

The decision tree algorithm is based on the theory of information entropy, and the computational process of the algorithm is divided into the following steps:
(1) Calculate the total information entropy first
(2) Calculating information entropy of each feature
(3) Computing E and information gain, e= total information entropy-information gain, information gain = Total information entropy-E
(4) e If the smaller, the greater the information gain, the smaller the uncertainties

Decision tree refers to the characteristics of the data, for the first feature, whether to consider this feature (0 representatives do not consider, 1 representatives to consider) will form a binary tree, and then the second feature is also considered ... Until all the features are considered, a decision tree is eventually formed. The following figure is a decision tree:

The decision tree algorithm implementation process is: First takes out the data category, then to the Data Transformation description Way (for example will "is" translates into 1, "no" translates into 0), builds the decision tree with the help of the decisiontreeclassifier in the Sklearn, uses the Fit () method to carry on the data training, The results can be obtained directly using predict () after the training is completed, and the Export_graphviz is used to visualize the decision tree. The specific implementation process is shown in the following illustration:

4. Logical Regression algorithm

The logic regression algorithm is realized by the principle of linear regression. If there is a linear regression function: Y=a1x1+a2x2+a3x3+...+anxn+b, where X1 to xn represent the various characteristics, although it can be fitted with this line, but because the Y range is too large, resulting in poor robustness. If you want to achieve classification, you need to reduce the range of Y to a certain space, such as [0,1]. By this time, the Y range can be reduced by changing the element method:
Make Y=ln (p/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.