With the intensification of market competition, China Telecom is facing more and more pressure, customer churn is also increasing. From the statistics, the number of fixed-line PHS this year has exceeded the number of accounts. In the face of such a grim market, the urgent task is to make every effort to reduce the loss of customers. Therefore, it is necessary to establish a set of models that can predict customer churn rate in time by using data
1 What is data mining?
The most commonly accepted definition of "Data Mining" is the discovery"Models" for Data.
1.1 statistical modeling
Statisticians were the first to use the term "data min
First contact data mining related knowledge, worship Daniel's article, hope to be able to add their own understanding
What is clustering, classification, regression.
Article 1: Data mining commonly used methods (classification, regression, clustering, association rules, etc.), slightly to the conceptual interpretatio
Purpose of collecting web logsWeb log mining refers to the use of data mining technology, the site user access to the Web server process generated by the log data analysis and processing, so as to discover the Web users access patterns and interests, such information on the site construction potentially useful and unde
First, data mining
Data mining is an advanced process of using computer and information technology to obtain useful knowledge implied from a large and incomplete set of data. Web Data mining
Spatial Data
Multimedia Data
For example, image data
Description-based retrieval system: keywords, titles, dimensions, etc.
Content-based retrieval system: color composition, texture, shape, object and wavelet transformation.
Time series data and sequence data
Trend Analysis
I plan to organize the basic concepts and algorithms of data mining, including association rules Mining, classification, clustering of common algorithms, please look forward to. Today we are talking about the most basic knowledge of association rule mining.
Association rules minin
In various data mining algorithms, association rule mining is an important one, especially influenced by basket analysis. association rules are applied to many real businesses, this article makes a small Summary of association rule mining. First, like clustering algorithms, association rule
(x)} = {p (x \bigcap y) \over p (x) p (y)}
Close down properties (downward Closure property)
If an item set satisfies a minimum support requirement, then any non-empty set of the set of items must satisfy this minimum support degree. Introduction to Apriori algorithm
Apriori algorithm is a frequent itemsets algorithm for mining Association rules, whose core idea is to close and detect the frequent itemsets through the generation of candidate sets.
Ap
Tags: using SP data, BS, users, technical objects, different methods
First:
Data type,
Different attributes of an object are described by different data types, such as age --> int; birthday --> date. Different types of data mining must be treated differently.
Second:
rule algorithm---AprioriFirst introduce a few professional nounsMining Datasets: The collection of data to be mined. That's a good understanding.Frequent patterns: Patterns that occur frequently in mining datasets, such as itemsets, sub-structures, sub-sequences, and so on. This is how to understand, in short, mining data
Data preprocessing
STEP1: Data sampling: Because in the establishment of customer churn model process, the loss of customers often accounted for the proportion of all customers are very small, at this time, the best way is to retain the entire loss of customer population, but not the loss of customer population sampling, so that customer churn and non-customer churn in the 1:1~1:2
STEP2:
Distinction between classification and clustering classification (classification):
A classifier will "learn" from the training it receives, thus having the ability to classify unknown data, a process typically called supervised learning (supervised learning). The so-called classification, in simple terms, is based on the characteristics of the text or attributes, divided into the existing categories.Common classification algorithms include: Decision t
I. Concepts
Association Rule Mining: discovering interesting and frequent patterns, associations, and correlations between item sets of a large amount of data, such as the food database and relational database.
Measurement of the degree of interest of association rules:Support,Confidence
K-item set: a set of K items
Frequency of the item set: number of transactions that contain the item set
Frequent Item Se
transaction by user shell+ip+ hostname according to different user's login (all three are the same user) Based on this, the basic principle of mining 2 algorithm for user input command sequence frequent pattern is realized.
The fp-growth algorithm mainly solves the collection of frequent items where the number of occurrences reaches a certain threshold in multiple sets. A FP tree is a compressed representation of input
(' relative importance ') Plt.draw () plt.show ()
The code is a bit long, but mainly divided into two, one is model training, the other is based on the importance of training to screen important features and drawing.
The attributes that are more important than 18 are obtained as shown in the following illustration:
It is important to see the three properties of TILTLE_MR title_id gender. and the title related to the attributes are our analysis of the name, can be seen in some string propertie
frequent itemset (frequent itemsets)
Called I={i1,i2,..., im} i=\{i_1, i_2, ..., i_m\} is a collection of items (item) , D={t1,t2,..., Tn} d=\{t_1, T_2, ..., t_n\}, I∈[1,n] I∈[1,n For the transaction dataset (Transaction data itemsets), the transaction Ti t_i consists of several items in I i.
Set S S as a set of items, s={i|i∈i} s=\{i|i∈i\}, short term set (itemset). The set of items that contains K items is called the K-item set .
T T is a transacti
Several basic concepts and two basic algorithms for association rules are described in the previous few. But actually in the commercial application, the writing algorithm is less than, understands the data, grasps the data, uses the tool to be important, the preceding basic article is to the algorithm understanding, this article will introduce the open source utilizes the
only 1. So the count of conditional pattern bases is determined by the minimum count of nodes in the path.Depending on the conditional pattern base, we can get the conditional FP tree for that commodity, for example i5:According to the conditions of the FP tree, we can do a full array of combinations, to get the frequent patterns excavated (here to the commodity itself, such as i5 also counted in, each commodity mining out of the frequent pattern mus
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.