PS: Due to space issues, this blog mainly introduces the project Understanding Problem in the data mining standardization process. The remaining five aspects are as follows, in particular, modeling and other components involving specific algorithms will be written in the follow-up blog in the form of open-source software such as orange and knime or some Python applets.
Part of this article is translation, and some are some small experiences in data mining projects, which are not correct. If you need an electronic version of the relevant books, you can also contact me flclain@gmail.com
There are several standardized procedures for data mining:
Semma
(Sample, example E, modify, model, assess used by SAS Institute Inc)
CRISP-DM
(Cross industry standard process for data mining as defined by CRISP-DM Consortium)
KDD-Process
Here we briefly describe the CRISP-DM process to analyze a standardized data mining process.
The CRISP-DM process can be seen from the following table mainly divided into 6 steps
This blog is mainly aboutProject understanding
In this step, we should first ask ourselves the following questions:
1. What is this question and what are the expected benefits?
This is the process of refining requirements. We need to clearly define the requirements of the project and the final expected benefits of the project, such as a clear definition like "increase the conversion rate by 5%.
2. What is the solution?
We need to be able to clearly tell the project owner about the expected performance of some final solutions.
3. What do we know about this field?
The deep understanding of projects is based on the understanding of relevant fields.
Suppose we areData Analyst(Analyst), we andProject leaderThere are always various problems in the project owner. These problems exist in communication, lack of understanding and organization, as shown in the following table:
To refine the requirements and facilitate correct communication and understanding, we need to determine thisProject subject(Determine the project objective ).
Of course,Repeat demandIt is a solution. In fact, during previous work, the leader always asked me to repeat the requirements and let us determine the consistency of mutual understanding. Another method is to useMind chart(Mind Map) orCognition Graph(Cognitive map) to briefly describe beliefs, experiences, known factors and how they interact.
Taking a project about the frequency of commodities in the shopping cart of a mall user as an example, its cognitive picture is as follows:
The gray part in the middle is our interest. Different arrows indicate the positive and negative effects of different factors. For example, a user's ability to pay has a positive impact on the frequency of commodities in the shopping basket, while a shopping opportunity has a negative impact on the user's ability to pay.
We have two points of attention for the construction of cognitive graphs:
1. Only the direct impact should be shown in the figure.
2. Each node should be carefully selected. This should be selected only when the sentence "When... higher" is used.
When a cognitive graph is created, the subject of this question should be defined and precisely defined.
One format is a table divided into objective, delivery, and success criteria, as shown below:
The next step isIt is estimated that there are some challenges in this data analysis project.. This requires review of available resources, requirements, and challenges in the project.
The most important resource is data and knowledge, that is, the background knowledge provided by the database (usually when data is dumped locally) and experts.
Based on our discussion and research on the field, we can get a series of clear or metaphoricalAssume that(Assumption) andRisks(Risks. This list should be useful and project-related for identifying the data we have obtained in the subsequent data analysis steps.
It is worth noting that it is expected that the problem can be solved by existing data, which will lead to constant changes and "optimization" models. It is possible to consider whether the data is not suitable for this problem.
We need to carefully track assumptions and verify them as early as possible.
General requirements and assumptions include:
Requirements and restrictions
L model requirements (which can be explained when the model is required)
L ethical, legal, and political factors (not applicable to attributes such as gender, age, and race)
L technical factors (solution performance problems)
Assume that
L typical representativeness (sampling problem)
L Information informativeness (attributes in the database should show all the impact factors)
L good data quality (correct, comprehensive, up-to-date, and unambiguous)
L existing external factors presence of external factors (we assume that the external world will not change frequently)
In the end, we need to define our analysis objectives. Of course, we need to transform the subject of our analysis into a more technical data mining goal. Specifically, it is to further refine the requirements and define indicators and make a simple estimation model. There is not much to say about the model selection, but we only need to pay attention to the appropriate aspects of each model, because they each have their own advantages and disadvantages.
Data understanding(Data understanding)
The next part will be written slowly.