Summary: Data Mining: three categories and six items

Source: Internet
Author: User
DataMining can be divided into three categories and six sub-items: Classification and Clustering belong to the Classification and segmentation class; Regression and Time-series belong to the prediction class; Association and Sequence belong to the Sequence rule class. Classification is calculated based on the values of some variables and then classified based on the results. (The calculation result is

Data Mining can be divided into three categories and six sub-items: Classification and Clustering belong to the Classification and segmentation class; Regression and Time-series belong to the prediction class; Association and Sequence belong to the Sequence rule class. Classification is calculated based on the values of some variables and then classified based on the results. (The calculation result is

Data Mining can be divided into three categories and six items:

Classification and Clustering are Classification and segmentation classes;

Regression and Time-series belong to the prediction class;

Association and Sequence belong to the Sequence rule class.

Classification is calculated based on the values of some variables and then classified based on the results. (The calculation result is finally classified into several discrete values, for example, dividing a group of data into two types: "may respond" or "may not respond ). Classification is often used to solve the filtering problem of mailing objects as described above. We will use data that has been classified based on historical experience to study their features, and then predict other unclassified or new data based on these features. The classified data that we use to find features may come from our existing customer data, or a complete database for partial sampling, which is then tested through actual operation; for example, a Classification Model is created using partial sampling of a large mailing object database, and this Model is used to classify and predict other data or new data in the database.

Clustering is used to divide data into groups. The purpose is to find out the differences between groups and find out the similarity of group members. Different from Classification, Clustering does not know the method or basis for Classification before analysis. Therefore, it is necessary to interpret the meaning of these groups in concert with professional domain knowledge.

Regression uses a series of existing values to predict the possible values of a continuous value. If the range is expanded, Logistic Regression can also be used to predict class variables, especially the use of modern analysis techniques such as class neural networks or decision tree theories and other analysis tools, the prediction model is no longer limited by the traditional linear model. The prediction function greatly increases the flexibility of selection tools and the breadth of application scope.

The Time-Series Forecasting function is similar to the Regression function, but it uses existing values to predict future values. The biggest difference between the two is that the values analyzed by Time-Series are related to Time. Time-Series Forecasting tools can deal with Time-related features, such as Time periodicity, class, seasonality, and other special factors (such as past and future factors ).

Association is to find out what appears simultaneously in an event or data. For example, if A is an option for an event, the probability that B appears in the event is also high. (For example, if a customer buys ham and orange juice, the chance of buying milk at the same time is 85% .)

Sequence Discovery is closely related to Association. The difference is that events in Sequence Discovery are separated by time. For example, if a-share tickets increase by 12% a day, in addition, if the weighted index of the stock market drops on the day, the probability of a rise in the B-share price within two days is 68% ).

Data Mining is widely used in various fields. As long as the industry has Data warehousing or databases with analytical value and needs, you can use Mining tools for purposeful Mining and analysis. Generally, common application cases are mostly used in the retail, direct marketing, manufacturing, finance and insurance, communication, and medical services industries.

Discover the consumption habits of customers in the sales data, and identify the product combinations of customer preferences through transaction records, other examples, including identifying the characteristics of lost customers and introducing new products, are common examples in the retail industry. After direct marketing emphasizes the concept of crowd and database marketing, after importing Data Mining technology, this makes direct marketing more powerful. For example, Data Mining is used to analyze the consumption behaviors and transaction records of customer groups, and basic Data is used to separate customers based on their brand value levels, in this way, differentiated marketing is achieved. The demand for Data Mining in the manufacturing industry is mostly used in quality control, and the most important factor affecting product quality is identified during the manufacturing process, in order to improve the efficiency of the operation process.

Recently, telephone companies, credit card companies, insurance companies, and stock traders are very interested in the Detection of Fraud (Fraud Detection), which results in considerable losses each year, data Mining can identify similar characteristics from customer Data with poor credit and predict possible fraudulent transactions to reduce losses. Financial and financial industries can use Data Mining to analyze market trends and predict the operation and stock price of individual companies. Another unique use of Data Mining is used in the medical industry to predict the efficiency of surgery, medication, diagnostics, or process control.

Generally, the theory and technology of Data Mining can be divided into two types: traditional technology and improved technology. Traditional technologies are represented by statistical analysis. sequence statistics, probability theory, regression analysis, and category data analysis in statistics are all traditional data mining technologies, in particular, Data Mining objects are mostly Data with a large number of variables and a large number of samples. It is a Factor Analysis (Factor Analysis) used to streamline variables in the multi-variable Analysis in higher statistics) discriminant Analysis for classification, and Cluster Analysis for grouping groups.

In terms of improvement technology, Decision tree theory (demo-trees), Neural Network (Neural Network), and rule Induction are widely used. A decision tree is a prediction model that uses a tree to show the effects of different variables on data. It constructs classification rules based on the effects on the target variables, generally, it is used for analysis of customer data. For example, it is used to find a variable combination that affects the Classification result for mail objects with a return letter or not. The commonly used Classification method is CART (Classification and Regression Trees) and CHAID (Chi-Square Automatic Interaction Detector. A Neural Network is a data analysis model that simulates the thinking structure of the human brain, parameters are constantly adjusted based on the knowledge gained from self-learning in input variables and values to construct patterns ). A Neural Network is a non-linear design. Compared with traditional regression analysis, the advantage is that there is no need to limit the mode during analysis, especially when there is an interactive effect between data variables, it can be automatically detected; the disadvantage is that the analysis process is a black box, so it is often unable to be presented in a readable model format, and the weighting and conversion of each stage are not clear, therefore, neural networks are used when data is highly nonlinear and carries a considerable degree of variable sympathetic effect.

Rule Induction is the most common format in the field of knowledge discovery. It is a series... /Then... (If/Then) "The logic rules for data segmentation technology, in actual use, how to define rules to be effective is the biggest problem, generally, items with too few data items need to be removed first to avoid meaningless logical rules.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.