"Bi Thing" Microsoft cluster analysis algorithm--three-person identity division

Source: Internet
Author: User

Original: "Bi Thing" Microsoft cluster analysis algorithm--three-person identity division

What is cluster analysis?

Cluster analysis is an exploratory method of data analysis. In general, we use clustering to group and categorize seemingly unordered objects to better understand the object of study. Clustering results require that the similarity of objects in group is higher, and the similarity between groups is low. In the Three Kingdoms data analysis, many problems can be solved by means of clustering analysis, such as three-person identity division.
What is the basic process of cluster analysis?

    • Select a cluster variable

In the analysis of the identity of the three countries, we will, according to certain assumptions, as far as possible to select the role of the identity of the variables, these variables are generally related to the identity of the command, force, intelligence, politics, charm, stunts, guns, soldiers, crossbows, cavalry, weapons, navy and so on. However, the clustering process has some requirements for the variables used in clustering:
The values of these variables in different research objects have obvious differences;
There can be no high correlation between these variables.
Because, first of all, the number of variables used for clustering is not as much as possible, the variables without significant differences have no substantive meaning to clustering, and may result in deviations; Secondly, highly correlated variables are equivalent to weighting these variables, which amplifies the effect of a certain aspect factor on the user classification.
Methods for identifying the appropriate cluster variables:
The variables are clustered, and a representative variable is selected from all kinds of clusters.
Master component Analysis or factor analysis, generate a new variable as a cluster variable.

    • Cluster analysis

The real execution process is surprisingly simple relative to the pre-cluster preparation. When the data is ready, throw it into the analysis software (usually the analysis service) and run, and the results come out.
The question is, how many types of characters are appropriate? In general, several criteria can be combined to determine:
1. Look at the inflection point
2. Judging by experience or character characteristics
3. Logically able to explain clearly

    • Identify the key features of various users

After defining a classification scheme, we then need to return to observe the performance of the various three-way characters on each variable. According to the results of the difference test, we distinguish the level of different users in this index by color area.

    • Cluster Interpretation & naming

When it comes to understanding and interpreting user classifications, it's best to combine more data, such as Kingdoms 12 data ... Finally, choose the most obvious characteristics of each category to name it, and you're done!

Let's move on to the topic, and we'll continue to take advantage of the last solution, followed by the following steps:





In the mining model, the main is to list the established mining model, you can also add a new mining model, and adjust variables, variable usage includes ignore (ignore), input (enter variable), Predict (predictor, input variable) and predictonly (predictor variable),:


In the mining model, click the right mouse button, select "Set algorithm Parameters" for the methodology of the parameter settings to edit, which contains:
Cluster_count: Specifies the approximate number of clusters to be established by the algorithm. If the approximate number of clusters cannot be established from the data, the algorithm will establish clustering as much as possible. If Cluster_count is set to 0, the algorithm uses heuristics to determine the number of clusters that should be established, and the default value is 10.
Cluster_seed: Specifies the number of seeds that are used to randomly generate clusters at the initial stage of the model establishment.
Clustering_method: The clustering method used by the algorithm can be an extensible EM (1), non-expandable EM (2), an expandable K-means (3), or a non-expandable k-means (4).
Maximum_input_attribute: Specifies that the algorithm can handle the maximum number of input properties before invoking a feature option. Setting this value to 0 specifies a limit that does not have a maximum number of attributes.
Maximum_states: Specifies the maximum number of attribute states supported by the algorithm. If a property has a number of states that are larger than the maximum number of States, the algorithm uses the most common state of the attribute and treats the other state as missing.
Minimum_support: This parameter specifies the minimum number of cases in each cluster.
Modelling_cardinality: This parameter specifies the number of sample models constructed during cluster processing.
Sample_size: Specifies that if the Clustering_method parameter is set to an expandable clustering method, the algorithm uses the number of cases on each trip. Setting Sample_size to 0 causes the entire data set to cluster in a single process, which can cause problems with memory and efficiency.
Stopping_tolerance: Specifies the value used to determine when the aggregation is reached and when the algorithm finishes building the model. Aggregation is reached when the overall change of the clustering probability is less than shopping_tolerance divided by the ratio of the model size.

The Mining Model Viewer is the result of this clustering analysis, in which the clustering chart is the strength of various associations, and further understanding of the distribution of data. and on each cluster node, right-click, and then select "Drill" on the menu that appears, you can browse the sample data attributes that belong to this class.

The degree of correlation between the dependent variable and the independent variable is understood from the classification profile chart.


"Classification characteristics" is mainly to present the characteristics of each class, see figure


In the "classification comparison", the main is to show a comparison between the characteristics of the two class,


Reference documents:
Microsoft Cluster analysis algorithm
Http://msdn.microsoft.com/zh-cn/library/ms174879.aspx

"Bi Thing" Microsoft cluster analysis algorithm--three-person identity division

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.