Common data mining algorithms

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nine common data mining algorithms are provided in SQL Server. These algorithms are used in different data mining application scenarios. Next we will analyze and discuss each algorithm one by one.

1. Decision Tree Algorithm

A decision tree, also known as a decision tree, is a tree structure similar to a binary tree or a multi-tree. The decision tree uses the attributes of samples as nodes and the values of attributes as branches, which is similar to the process of flowcharts. Each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a class or class distribution. It analyzes and summarizes attributes of a large number of samples. The root node is the attribute with the largest amount of information in all samples. The intermediate node is the attribute with the largest amount of information in the Sample Subset contained in the subtree with the root node, the leaf node of the decision tree is the class value of the sample.

Start from the root node of the tree and use the test conditions to test the records. select an appropriate branch based on the test results and follow the branch or reach another internal node, when a new test condition is used or a leaf node is reached, the class title of the leaf node is assigned to the test record. Each branch of a decision tree is either a new decision node or the end of the tree, called a leaf. During the process of traversing along the decision tree from top to bottom, each node will encounter a problem. different answers to the questions on each node will lead to different branches and finally reach a leaf node. This process is the process of classification using decision trees. The decision tree algorithm can predict the trend of a case from one or more prediction variables based on the category dependent variable.

In SQL Server 2008, we can view the decision tree model through the Mining Model viewer. 1.

Figure 1

In Figure 1, we can see that the decision tree is composed of a series of shards. The most important Shard is determined by the algorithm and is located on the left side of the viewer in the "all" node. Other splits appear on the right. The dependency network displays the dependency between the input and predictable attributes in the model. You can also use the slider to filter the dependency strength.

2. Clustering Analysis Algorithm

The clustering analysis algorithm is used to measure the similarity between individuals. It is determined based on the distance between individual data points in the geometric space. The closer the distance is, the closer the data points are, the more likely the data points are to be classified as one type. After the classification is initially defined, the algorithm determines the suitability of the classification to indicate the vertex grouping conditions, and then tries to redefine these groups to create a classification that can better represent the data. The algorithm executes this process cyclically until it can no longer improve the results by redefining the classification. In short, clustering is the process of grouping a collection of data objects into multiple classes composed of similar objects. Clustering is used in customer analysis on business. Different customer groups can be found in the customer database and behavior patterns of different customer groups can be analyzed.

In SQL Server 2008, we can view the cluster analysis model through the Mining Model viewer. 2.

Figure 2

In Figure 2, the classification relationship diagram shows the correlations between classes. The classification section describes the correlation between dependent variables and independent variables. Classification features mainly present the characteristics of each category. Classification comparison mainly compares the two types of features.

3. Naive Bayes Algorithm

The naive Bayes algorithm is a classification algorithm provided by Microsoft SQL Server Analysis Services for predictive modeling. The naive Bayes algorithm uses Bayesian Theorem to assume that the impact of a property value on a given class is independent of that of other attributes. Compared with other algorithms, this algorithm requires a small amount of computing, so it can quickly generate a mining model to discover the relationship between input columns and predictable columns. The algorithm can be used for initial data detection. When used in large databases, the algorithm also shows high accuracy and high speed, which is similar to decision trees and neural networks.

Algorithms adopt supervised learning. Before classification, you must know the classification Type in advance. Effectively classify training samples by learning them. This is to generate the central concepts of training samples through the attribute relationships in the training samples, and use these existing central concepts to predict unclassified data objects.

In SQL Server 2008, we can view the naive Bayes model through the Mining Model viewer. 3.

Figure 3

In Figure 3, the dependency network can further understand the data distribution. The attribute configuration file shows the feature distribution of each variable. Attribute features show the basic feature probabilities of different groups. Property comparison is to compare the features of rendering properties.

4. association rule algorithm

Association Algorithm rules are used to discover the relationship between variables and individuals in the database, that is, to find interesting associations or related relationships between item sets in a large amount of data. For example, in association rule mining, a typical example is shopping basket analysis. This analysis process is to analyze the relationship between different products purchased by the customer to explore the purchasing habits of the customer, and help sellers develop marketing strategies. The association rule algorithm has two important parameter support and confidence levels. Support is the probability that X and Y events occur simultaneously in X and Y sets. Confidence level refers to the probability that y events occur under the probability that X events occur in X sets and Y sets.

In SQL Server 2008, we can view the association rule model through the Mining Model viewer. 4.

In Figure 4, the rules can be used to view the association rules generated in the algorithm. Here, we can learn about the content of the association rules and their support and confidence level. The item set allows you to view the object groups generated in the algorithm. You can learn about the content and support of each object group. Dependency networks can present the correlations between products and understand the correlations between products through graphs.

5. Sequence analysis and Clustering Analysis Algorithms

Sequence Analysis and cluster analysis algorithms are a sequence analysis algorithm provided by Microsoft SQL Server Analysis Services. You can use this algorithm to study data that contains events that can be linked through the following paths or sequentially. This algorithm groups or classifies the same sequence to find the most common sequence.

In SQL Server 2008, we can view the sequence analysis model through the Mining Model viewer. 5.

Figure 5

In Figure 5, the classification relationship diagram shows all the categories in the mining model. The brightness and brightness of the links between the categories indicate the similarity of the categories. You can adjust the number of lines displayed by adjusting the slider on the right of the category. The classification surface chart provides an overall view of the classification created by the algorithm, showing the distribution of each attribute and attribute in the classification. Classification features can check the composition features of the classification. Classification comparison can compare the attributes of two categories. Status transition allows you to browse the transition between sequence states in the selected category after selecting a category.

6. Time Series Algorithm

The time series algorithm provides Regression Algorithms optimized for continuous Value Prediction. The time series are decomposed into the main trend component and seasonal variation component, and whether the theoretical model can respond to the phenomenon.

In SQL Server 2008, we can view the time series model through the Mining Model viewer. 6.

Figure 6

In figure 6, the chart bar displays the previous values, predicted values, and error ranges of the predicted variable products. The model displays the completed model as a tree.

7. Neural Network Algorithms

A Neural Network is a set of connected input and output units. Each connection is connected to a weight. In the training and learning stage, adjust the weights of the Training Network to predict the correct class labels of input samples. The neural network algorithm creates a network composed of three layers of neurons. These layers are the input layer, the optional hidden layer, and the output layer. Input layer: Input neurons define all INPUT attribute values and their probabilities of the data mining model. Hidden neurons receive input from input neurons and provide output to output neurons. The hidden layer distributes weights to various input probabilities. Weight indicates the correlation or importance of a specific input for hidden neurons. The greater the input weight, the more important the input value is. Output neurons represent the predictable attribute values of the data mining model.

In SQL Server 2008, we can view the neural network model through the Mining Model viewer. 7.

Figure 7

In Figure 7, the input selected neural network model is used as the INPUT attribute and attribute value. Output specifies the attributes of the output neural network. The entry on the right of the attribute specified by the variable indicates the output attribute state that is prone to the INPUT attribute state. The size of entries indicates the degree to which the output state tends to be input.

8. Logistic Regression Algorithms

Logistic regression is a statistical method used to classify records based on input domain values. By establishing a set of equations, the input domain values are associated with the probability of each class of output fields. The model analyzes the relationship between binary classification or ordered dependent variable and interpreted variable, and uses the independent variable to predict the probability of the dependent variable in a given value. Once a model is generated, it can be used to estimate the probability of new data. The target class with the highest probability is specified as the prediction output value of the record. The logical regression algorithm is a variant of the neural network algorithm used to determine the impact of multiple factors on a pair of results. Model the relationship between input and output. Measure the impact of each input on the output, and weigh the role of different inputs in the completed model.

9. Linear Regression Algorithms

Linear regression is a common method used for statistics and prediction. This makes the decision tree algorithm a variant that facilitates the calculation of linear relationships between dependent variables and independent variables, the Optimal Linear Equation of the output field is estimated based on the input field, and then the relationship is used for prediction. Use linear regression to determine the relationship between two consecutive columns, representing a straight line or plane, to evaluate the relationship between data columns.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Common data mining algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Common data mining algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support