The algorithm realization of decision tree in the case of commodity purchasing ability forecast
Bai Ningsu
December 24, 2016 22:05:42
absrtact: with the upsurge of machine learning and deep learning, all kinds of books abound. However, the majority is the basic theory knowledge Introduction, lacks the realization the thorough understanding. This series of articles is derived from the author's notes combining video learning and book basics. This series of articles will be written in a theoretical and practical way. First, we introduce the category of machine learning and deep learning, then introduce the training set, test set and so on. Then, we introduce the common algorithms of machine learning, namely supervised learning (decision tree, near sampling, support vector machine, neural network algorithm) supervised learning regression (linear regression, nonlinear regression) unsupervised learning (K-means clustering, hierarchical clustering). This article uses each algorithm theory knowledge introduction, then unifies the Python concrete realization source code and the Case Analysis way (
This article original creation, reprint annotated source : Decision tree in the commodity purchasing power ability Forecast Case Algorithm Realization (3))
Catalogue
"Machine Learn" python development tool: Anaconda+sublime (1)
Machine learning and its basic concept introduction (2) Learn
The algorithm implementation of "machine Learn" decision tree in the case of commodity purchasing power capacity Prediction (3)
1 decision Tree/judgment tree (decision trees)
1 Decision Trees (dicision tree) is one of the classification algorithms in machine learning supervised algorithms, and the evaluation of classification and prediction algorithms in machine learning is mainly reflected in:
- Accuracy: The accuracy of prediction is the core problem of this algorithm, which is used in credit system, commodity purchase forecast and so on.
- Speed: A good algorithm not only requires accuracy, but also is one of the important criteria for measuring the speed of operation.
- Strong line: With fault tolerance and other functions and extensibility.
- Scale: able to cope with real-life cases
- Explanatory: The results of the operation can explain its meaning.
2
Decision Treeis a tree structure similar to a flowchart: where each inner node represents a test on an attribute, each branch represents an attribute output, and each leaf node represents a class or class distribution. The topmost layer of the tree is the root node.
Do you want to play on the case? The root node shows 9 days in 14 days for play, 5 days is not suitable for playing. There is not all the same situation, the description also needs to be subdivided: 1 sunny days: sunny days are suitable for playing, 3 day is not suitable for playing, also need to subdivide ① humidity is less than or equal to 70 times 2 days are suitable for playing, stop dividing; ② humidity greater than 70 has 3 days are not suitable for playing, stop dividing. 2 Cloudy: A total of 4 days are suitable for playing, stop dividing. 3 Rainy Day: 3 days for playing, 2 days is not suitable for playing, continue to divide. ① have 3 days without wind and are fit to play, stop dividing, ② wind for 2 days and are not fit to play, stop dividing.Note: Sometimes it is not easy to divide too thin, the characteristics of too much detail will affect the accuracy of the prediction. The majority is classified as one, and the very few can be attributed to the majority.
Case: As above decision Tree, if one day is: sunny, humidity 90 determine whether suitable for playing, can be known by the picture is not suitable for playing.
3 Official document: Http://scikit-learn.org/stable/modules/tree.html2 basic algorithm of constructing decision tree: Judging customer's ability to purchase goods
2.1 Algorithm Result graph:
According to the decision tree analysis of the following customer data to determine the purchasing power of new customers. which
Customer Age: Youth, middle age, old age
Customer income Income: Low, medium and high
Client Identity student: A student, not a student
Customer Credit credit_rating: General credit, good credit
Whether to buy computer buy_computer: buy, not buy
2.2 Before we introduce the decision tree algorithm, we introduce the concept of entropy. The concept of entropy (entropy): How to measure information and abstraction? In the 1948, Shannon put forward the concept of "information entropy (entropy)", the amount of information and its uncertainty have a direct relationship, to find out a very very uncertain thing, or we know nothing about things, need to understand a lot of information ==> The measurement of information is equal to the amount of uncertainty. Example: Guess the World Cup winner, if you know nothing, guess how many times? The odds for each team to win are not equal, bit (bit) to measure the amount of information. Information entropy is used as follows: 1 when each team wins the same odds, 32 of the information entropy of the World Cup winning team is 5, the calculation is 2^5=32, that is, you can guess 5 times to win the team. 2 When the team won the odds are not equal, such as Brazil, Germany, the Netherlands is the high-side strong probability, information entropy is less than 5, that is, you can use less than 5 times to guess which team won the championship.
Note: The greater the uncertainty of the variable, the greater the entropy
2.3 Decision Tree Induction algorithm (ID3)1970-1980, J.ross. Quinlan first proposed ID3 algorithm, the first step is to select the attribute to determine the node, we use the information entropy comparison. The second step is the amount of information acquired (information Gain): Gain (a) = info (d)-infor_a (d) How much information is obtained as a node classification by A
detailed :
Information Acquisition/information gain (Information Gain): Gain (A) = info (d)-infor_a (d), such as age information gain, Gain (age) = info (buys_computer)-Infor_age ( Buys_computer).
info (buys_computer) is in these 14 records, the probability of purchase 9/14, not the purchase of 5/14, is brought into the information entropy formula.
Infor_age (Buys_computer) is the age attribute, the youth 5/14 purchase probability is 2/5, does not purchase 3/5; Middle-aged 4/14 purchase probability is 1, the purchase probability is 0, the elder 5/14 buys the probability 3/5, the purchase probability is 2/5. Substituting information entropy formula for each
info (buys_computer) is poor with infor_age (Buys_computer), which is the information gain of age, as follows:
Similarly, Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit_rating) =0.048
So, select the maximum information gain as the root node that is age as the first root node
Repeat calculation can
2.4 Decision Tree algorithm: The formal description of the decision tree algorithm is as follows:
- The tree begins with a single node representing the training sample (step 1).
- If the sample is in the same class, the node becomes a leaf and is labeled with that class (steps 2 and 3).
- Otherwise, the algorithm uses entropy-based metrics called information gain as heuristic information, choosing the attributes that best classify the samples (step 6). This property becomes the "test" or "decision" attribute of the node (step 7). In this version of the algorithm,
- All properties are categorized, that is, discrete values. Continuous attributes must be discretized.
- For each known value of the test property, create a branch and divide the sample accordingly (step 8-10).
- The algorithm uses the same process to recursively form a sample decision tree on each partition. Once an attribute appears on a node, it is not necessary to consider it on any descendant of that node (step 13).
- The recursive partitioning step is stopped only if one of the following conditions is true:
- (a) All samples of a given node belong to the same class (steps 2 and 3).
- (b) No remaining attributes can be used to further divide the sample (step 4). In this case, the majority vote is used (step 5).
- This involves converting a given node to a leaf and marking it with the class in which the majority of the sample is located. replacement, it is possible to store knots
- The class distribution of the point sample.
- (c) Branching
- Test_attribute = a i do not have a sample (step 11). In this case, the majority of the classes in samples
- Create a leaf (step 12)
Based on the decision tree ID3, the algorithm is improved, and other algorithms are derived, such as: C4.5: (Quinlan) and classification and Regression Trees (CART): (L. Breiman, J. Friedman, R. Ol Shen, C. Stone). These algorithms
Common denominator: Both are greedy algorithms, top-down (top-down approach)
Difference: Attribute Selection metric methods differ: C4.5 (gain ratio, gain ratio), CART (Gini index, Gini index), ID3 (information gain, information gain)
2.5 How do I handle the properties of a continuity variable? Some data are continuous, unlike the above experimental data can be discretized representation. For example, according to the weather conditions, the humidity is a continuous value, and we are going to use humidity 70 as a dividing point, which is the embodiment of continuous variable discretization.
2.6 Supplemental Knowledge
Tree Pruning leaves(Avoid overfitting): In order to avoid fitting the problem, we can be attributed to the tedious tree pruning (is to reduce the height of the tree), can be divided into first pruning and post-pruning.
The advantages of decision trees: Intuitive, easy to understand,
Small-scale data sets are valid
Disadvantages of decision Trees: When processing continuous variable is bad, the category is more, the error increases faster, the scale is generally 3 implementation of decision tree algorithm based on Python code: Predicting customer's ability to buy goods
3.1 Machine Learning Library: Scikit-learnpython
Scikit-learnpython, whose features are simple and efficient data mining and machine learning analysis, simple and efficient data mining and machine learning analysis, open to all users, highly reusable according to different needs, based on NumPy, scipy and matplotlib, open source, Commercial level: Get BSD License. Scikit-learn coverage Classification (classification), regression (regression), Clustering (clustering), dimensionality reduction (dimensionality reduction), model selection ), pretreatment (preprocessing) and other fields.
use of 3.2 Scikit-learn:Anaconda integrates the following packages and does not need to be installed to use
- Install SCIKIT-LEARN:PIP, Easy_install, Windows Installer, install the necessary package:numpy, scipy and matplotlib, use Anaconda (including NumPy, SciPy and other scientific calculation common package)
- Installation Note: Python interpreter version (2.7 or 3.4?). ), 32-bit or 64-bit system
Product Purchase Example:
Convert to CSV file as follows:
3.3 Operation effect is as follows:
Among them, datafile storage model training data set and test data set, Tarfile is the algorithm generates text form of dot file and converted PDF image file, two py file, one is training algorithm one is test training result. The right predictive value "0 1 1" represents three test data with the latter two having the ability to purchase. Specific algorithms and details are explained in the next section.
3.4 Specific algorithms and details
Python Imports the decision tree-related package file, then through the CSV format into the Sklearn Toolkit can be recognized in the data format, then call the decision tree algorithm, and finally the model training results are displayed graphically.
Import of packages:
from sklearn.feature_extraction Import dictvectorizerimport csv from Sklearn Import Tree from Sklearn Import preprocessing from sklearn.externals.six import Stringio
Reads a CSV file, stores its eigenvalues in the list featurelist, and stores the predicted target values in Labellist
The Description:python calls the Machine learning Library Scikit-learn's decision tree algorithm, realizes the commodity purchasing power forecast, and translates into the PDF image display Author:bai ningchaodatetime:2016 December 24 14 : 08:11blog url:http://www.cnblogs.com/baiboy/"def traindicisiontree (csvfileurl): ' read csv file, Store its eigenvalues in the list featurelist, storing the predicted target values in labellist ' featurelist = [] labellist = [] #读取商品信息 Allelectronicsdata=open (csvfileurl) reader = Csv.reader (allelectronicsdata) #逐行读取信息 headers=str ( Allelectronicsdata.readline ()). Split (', ') #读取信息头文件 Print (headers)
Operation Result:
Storing feature sequences and target sequences
' storage feature sequence and target sequence ' for row in reader: labellist.append (Row[len (Row)-1]) #读取最后一列的目标数据 rowdict = {} # Dictionary that holds the eigenvalues for I in range (1, Len (row)-1): rowdict[headers[i]] = row[i] # print ("Rowdict:", rowdict) Featurelist.append (rowdict) print (featurelist) print (labellist)
Operation Result:
Numerical Value of eigenvalues
' Vetorize features: numerically eigenvalues ' VEC = Dictvectorizer () #整形数字转化 dummyx = Vec.fit_transform (featurelist). ToArray () #特征值转化是整形数据 print ("Dummyx:" + str (DUMMYX)) print ( vec.get_feature_names ()) print (" Labellist: "+ str (labellist)) # vectorize class labels lb = preprocessing. Labelbinarizer () dummyy = Lb.fit_transform (labellist) print ("Dummyy: \ n" + str (DUMMYY))
Operation Result:
The above algorithm is to convert the commodity information into a machine learning Decision Tree Library file can be recognized form, that is, the following form:
Using decision trees for categorical prediction processing
' Use decision tree for categorical prediction processing ' # CLF = tree. Decisiontreeclassifier () #自定义采用信息熵的方式确定根节点 CLF = tree. Decisiontreeclassifier (criterion= ' entropy ') CLF = Clf.fit (Dummyx, Dummyy) print ("CLF:" + str (CLF)) # Visualize model with open (".. /tarfile/allelectronicinformationgainori.dot ", ' W ') as F: f = Tree.export_graphviz (CLF, feature_names=vec.get_ Feature_names (), out_file=f)
Operation Result:
To convert it into an image presentation, you need to download the plugin: Install the download Graphviz:
Install it all the way down and then open cmd into the DOS environment and enter. /tarfile/tname.dot path; #2 Enter Dot-tname.dot-o name.pdf command to convert DOS to PDF format
Open File Visible:
4 Full Project download
Full project sharing
"Machine Learn" decision Tree case: A python-based forecasting system for commodity purchasing ability