1. Decision Tree Algorithm
Decision Tree, also known as tree, is a tree structure similar to binary tree or multi-fork tree. A decision tree uses the attributes of a sample as a node, using the value of the attribute as a branch, that is, a process like a flowchart, where each inner node represents a test on an attribute, each branch represents a test output, and each leaf node represents a class or class distribution. It analyzes and summarizes the properties of a large number of samples. The root node is the most informative attribute in all samples, and the intermediate node is the most informative attribute of the sample sub-set of the subtree containing the node root, and the leaf node of the decision tree is the category value of the sample.
Starting from the root node of the tree, the test conditions are used for inspection records, the appropriate branches are selected according to the test results, the class designation of the leaf node is assigned to the inspection record along the branch or to another internal node, using the new test condition or reaching a leaf node. Each branch of the decision tree is either a new decision node or the end of the tree, called a leaf. In the process of traversing the decision tree from top to bottom, each node encounters a problem, different answers to the questions on each node result in different branches, and finally a leaf node. This process is the process of classifying using decision trees. The decision tree algorithm can predict the relationship of the trend of a case from one or more predictor variables, according to the category dependent variable.
In SQL Server 2008, we can view the decision tree model through the Mining Model viewer. As shown in 1.
In Figure 1, we can see that the decision tree display consists of a series of splits, the most important of which is determined by the algorithm, located on the left side of the viewer in the all node. Other splits appear on the right. The dependency network shows the dependencies between the input attributes and the predictable attributes in the model. And you can filter the dependency strength by the slider.
2. Clustering Analysis algorithm
Cluster analysis algorithm is to measure the similarity between individuals, is based on the individual data points in the distance of the geometric space to judge, the closer the distance, the more similar, the more easily categorized into a class. After the classification is initially defined, the algorithm determines how well the classification represents the point grouping by calculation, and then attempts to redefine the groupings to create a classification that better represents the data. The algorithm loops through the process until it can no longer refine the results by redefining the classification. Simply put, clustering is the process of grouping a collection of data objects into multiple classes consisting of similar objects. Clustering for business customer analysis, you can find different customer groups from the customer base, and analyze the behavior patterns of different customer groups.
In SQL Server 2008, we can view the clustering model through the mining Model viewer. As shown in 2.
In Figure 2, the classification diagram shows the strength of the correlation between classes. The classification profile graph understands the degree of correlation between the dependent variable and the independent variable. The classification feature mainly presents the characteristics of each class. The comparison between the two categories shows that there are two kinds of characteristics.
3.Naive Bayes algorithm
The Naive Bayes algorithm is a classification algorithm provided by Microsoft SQL Server Analysis Services for predictive modeling. The Naive Bayes algorithm uses Bayesian theorem, which assumes that the effect of a property value on a given class is independent of the value of other properties. Compared with other algorithms, the algorithm requires a small amount of computation, so that the mining model can be quickly generated to discover the relationship between the input column and the predictable column. The algorithm can be used for initial data detection, when used in large databases, the algorithm also shows high accuracy and high speed, comparable to decision tree and neural network.
The algorithm uses supervised learning method, before classification, need to know the type of classification beforehand. Through the training samples of learning, to effectively classify. is to generate the central concept of training samples by training the attribute relationships in the sample, and to predict the unclassified data objects with these already generated central concepts.
In SQL Server 2008, we can view the naive Bayes model through the Mining Model viewer. As shown in 3.
In Figure 3, the dependency network can further understand the distribution of the data. Property profiles to understand the distribution of attributes for each variable. Attribute characteristics can be used to see the probability of the basic characteristics of different group classification. Attribute comparison is the comparison of attributes between rendering properties.
4. Association Rules algorithm
The rule of association algorithms is to discover the degree of relationship between variables and individuals in a database, that is, to find interesting associations or related relationships between itemsets in a large number of data. For example, in association rule mining, a typical example is shopping basket analysis, the analysis process is through the analysis of the customer's purchase of the relationship between the different products to tap the customer's buying habits, and help vendors to develop marketing strategies. There are two important parameter support degree and confidence degree in association rule algorithm. Support is the probability that X-itemsets and Y-itemsets occur at the same time, and an X, Y event occurs. The confidence level refers to the probability of X-itemsets and Y-itemsets, and the probability of the X event occurring, the Y event.
In SQL Server 2008, we can view the association rule model through the mining Model viewer. 4 is shown
In Figure 4, the rules can look at the association rules that are generated in the algorithm, and we can use this to understand the content of the association rules and their support and confidence level. Itemsets can look at the groups of objects that are produced in the algorithm, which allows us to understand the contents of each object group and its support. Dependency networks can present correlations between products, and graphically understand the dependencies between products.
5. Sequential analysis and cluster analysis algorithms
The sequential clustering algorithm is a sequential parsing algorithm provided by Microsoft SQL Server analysis Services. You can use this algorithm to study data that contains events that can be linked to by the following path or sequence. The algorithm finds the most common order by grouping or classifying the same order.
In SQL Server 2008, we can view the sequential analysis model through the mining Model viewer. 5 is shown
In Figure 5, the taxonomy diagram shows all the categories in the mining model, and the degree of shading between the lines indicates the similarity of the classifications. You can adjust the number of lines displayed by adjusting the slider to the right of the category. The categorical profile graph provides an overall view of the classification created by the algorithm, showing each attribute in the classification and the distribution of the attributes. Classification features can check the composition characteristics of the classification. The classification comparison allows you to compare the attributes of two categories. State transitions can be used to browse the transitions between sequence states in a selected category after a category is selected.
6. Timing algorithm
The time Series algorithm provides some regression algorithms which are optimized for continuous value prediction, and decomposes the temporal sequence into the main trend components, seasonal components, and verifies whether the theoretical model can react to the phenomena.
In SQL Server 2008, we can view the time series model through the Mining Model viewer. 6 is shown
In Figure 6, the chart bar shows the past values of the Predictor variables, as well as the predicted values, and the error interval. The model displays the algorithm as a tree in the completed model.
7. Neural Network algorithm
A neural network is a set of connected input and output units in which each connection is connected to a right. During the training and learning phase, the right of the training network can be adjusted to predict the correct class marking of the input samples. Neural network algorithms create networks of up to three-layer neurons. These layers are the input layer, the optional hidden layer, and the output layer, respectively. Input layer: The input neuron defines all the input attribute values and their probabilities for the data mining model. The hidden neurons receive input from the input neurons and provide output to the output neurons. Hidden layers are locations that assign weights to various input probabilities. Weights describe the relevance or importance of a particular input to a hidden neuron. The greater the weight assigned by the input, the more important the value you enter. The output neuron represents the predictable attribute value of the data mining model.
In SQL Server 2008, we can view the neural network model through the Mining Model viewer. 7 is shown
In Figure 7, the input selects the properties and attribute values that the neural network model will use as input. The output specifies the properties of the neural network using the output. The bar on the right side of a variable's specified property indicates the state of the output attribute that the specified input attribute state is inclined to. The size of the bar indicates the degree to which the output state tends to enter the state.
8. Logistic regression algorithm
Logistic regression is a statistical method of classifying records according to input field values, and by establishing a set of equations, the input field values are linked to the probabilities of each class of output fields. The model analyzes the relationship between two categorical or ordered dependent variables and explanatory variables, and uses independent variables to predict the probability of the dependent variable being given a certain value. Once the model is generated, it can be used to estimate the probability of new data. The target class with the highest probability is specified as the predicted output value for that record. The logistic regression algorithm is a variant of the neural network algorithm, which is used to determine the effect of multiple factors on a pair of results. Model the relationship between the input and the output. To measure the impact of each input on the output and weigh the role of different inputs in the completed model.
9. Linear Regression algorithm
Linear regression algorithm is a commonly used method for statistic and prediction, which makes a variant of decision Tree algorithm, helps to calculate the linear relation between dependent variable and independent variable, estimates the best linear equation of predicting output field based on input field, and then uses the relation to predict. Use linear regression to determine the relationship between two contiguous columns, representing a line or plane, to evaluate the relationship between data columns.
9 Types of data mining algorithms in SQL Server 2008