Data cleaning is an indispensable link in the entire
data analysis process, and the quality of the results is directly related to the model effect and final conclusion. In actual operation, data cleaning usually occupies 50% -80% of the analysis process. Some foreign academic institutions will study how to do data cleaning, and there are many related books.
In this article, I will try to sort out the
data cleaning process at a very shallow level for your reference.
Pretreatment stage
The pre-processing stage mainly does two things:
One is to import data into processing tools. Generally speaking, it is recommended to use a database and build a MySQL environment by running a single machine. If the amount of data is large (more than 10 million levels), you can use text file storage + Python operation.
The second is to look at the data. There are two parts here: one is to look at the metadata, including field interpretation, data sources, code tables, etc., all describing the data; the second is to extract a part of the data, use manual viewing methods, and have an intuitive understanding of the data itself, and I initially found some problems and prepared for the subsequent treatment.
Step 1: missing value cleaning
Missing values are the most common data problem, and there are many ways to deal with missing values. I recommend the following four steps:
1. Determine the range of missing values: Calculate the proportion of missing values for each field, and then develop strategies according to the proportion of missing values and the importance of the fields.
2. Remove unneeded fields: This step is very simple, just delete it directly ... But it is strongly recommended to back up every step of cleaning, or successfully test on small-scale data and then process the full amount of data, otherwise you will regret it if you delete the error Nothing (more to say, delete must match where when writing SQL!).
3. Fill in the missing content: Some missing values can be filled in by the following three methods:
Use business knowledge or experience to guess fill in missing values
Fill in missing values with calculation results of the same indicator (mean, median, mode, etc.)
Fill missing values with calculation results of different indicators
The first two methods are easier to understand. Regarding the third method, let's take the simplest example: the age field is missing, but there is a masked six-digit ID number, so ...
4. Re-acquisition: If some indicators are very important and the missing rate is high, it is necessary to know with the access personnel or business staff whether there are other channels to obtain relevant data.
Above, I simply sorted out the steps of missing value cleaning, but some of them are much more complicated than I said, such as filling in missing values. Many books on statistical methods or statistical tools will mention related methods, and those who are interested can learn in depth.
Step 2: format content cleaning
If the data comes from the system log, then usually in terms of format and content, it will be consistent with the description of metadata. If the data is collected manually or filled in by the user, there is a high probability that there will be some problems with the format and content. In simple terms, the format content problems have the following categories:
1. The display format of time, date, numerical value, full half-width, etc. is inconsistent
This kind of problem is usually related to the input terminal. It may also be encountered when integrating multi-source data, and it can be processed into a consistent format.
2. There are characters that should not exist in the content
Some content may include only a part of characters, for example, the ID number is a number + letter, and the Chinese name is a Chinese character (Zhao C is still a minority). The most typical are the spaces in the head, tail, and middle, and there may also be problems with the presence of digital symbols in the name, the appearance of Chinese characters in the ID number, and so on. In this case, it is necessary to use semi-automatic verification and semi-manual methods to find possible problems and remove unnecessary characters.
3. The content does not match the content in this field
Names written in gender, ID numbers written in mobile phone numbers, etc., are all such problems. However, the particularity of this problem is that it cannot be dealt with simply by deleting, because the cause may be manual filling errors, there may be no verification at the front end, or some or all of the columns may not be aligned when importing data. Therefore, it is necessary to identify the problem type in detail.
The format content problem is a more detailed problem, but many analysis errors are planted in this pit, such as cross-table association or VLOOKUP failure (multiple spaces cause the tool to think that "Chen Danyi" and "Chen Danyi" are not one person), incomplete statistics (Of course, there is a problem with the result when the number is mixed with a letter), the model output fails or the effect is not good (the data is wrong, the date and the age are mixed, so ...). Therefore, please pay attention to this part of the cleaning work, especially when the processed data is collected manually, or when you are sure that the product front-end verification design is not good ...
Step 3: Logic error cleaning
This part of the work is to remove some data that can directly find the problem using simple logical reasoning to prevent the analysis results from deviating. It mainly includes the following steps:
1. Deduplication
Some analysts like to put the replay in the first step, but I strongly recommend to put the replay after format content cleaning, the reason has already been said (multiple spaces lead the tool to think that "Chen Danyi" and "Chen Danyi" are not alone , Deduplication failed). Moreover, not all repetitions can be so simple ...
I once did data analysis related to telesales and found that the salesmen used nothing to grab orders ... For example, a company called "ABC Butler Co., Ltd.", in the hands of sales A, and then sales B to grab this customer , Enter an "ABC official family limited company" in the system. You see, you can't see the difference between the two without a closer look, and even if you can see it, can you guarantee that there is no such thing as "ABC Official Home Co., Ltd." ... At this time, you must hug the RD thigh Ask someone to write a fuzzy matching algorithm for you, or see it with your naked eyes.
The above is not the most ruthless.
In the system you are using, it is possible that both roads are called Balizhuang Road, dare you go directly to repeat it? (Attached to the heavy tips: the range of two Balizhuang Road's house number is not the same)
Of course, if the data is not manually entered, then simply deduplication.
2. Remove unreasonable values
It can be made clear in one sentence: when someone fills in the form blindly, he is 200 years old, and his annual income is 1 million yuan (estimated that he did not see the word "million"). This kind of information will either be deleted or treated as missing values. How to find this value? Tip: Available but not limited to Box-plot.
3. Amend contradictions
Some fields can be verified with each other, for example: the ID number is 1101031980XXXXXXXX, and the age is filled with 18 years old. Although we understand the idea that others will always be 18 years old, we know that the real age can provide better services to users (again ... …). At this time, it is necessary to determine which field provides more reliable information based on the data source of the field, and remove or reconstruct the unreliable field.
In addition to the cases listed above, there are many unlisted cases of logic errors, which should be handled appropriately in actual operation. In addition, this step may be repeated in the subsequent data analysis modeling process, because even if the problem is very simple, not all problems can be found at once. What we can do is use tools and methods to minimize the possibility of problems. Sex, making the analysis process more efficient.
Step 4: Non-demand data cleaning
This step is very simple: delete unnecessary fields.
But in practice, there are many problems, such as:
Delete fields that seem unnecessary but actually important to the business;
A field feels useful, but I do n’t know how to use it, I do n’t know if it should be deleted;
For a while, the wrong field was deleted.
The advice I give in the first two cases is: If the amount of data is not large enough to handle the field without deleting it, then the fields that can be deleted should not be deleted as much as possible. In the third case, please backup your data ...
Step 5: Relevance verification
If your data has multiple sources, then it is necessary to verify the association. For example, if you have offline purchase information for a car and telephone customer service questionnaire information, the two are related by name and mobile phone number. Then look at whether the vehicle information registered offline by the same person and the vehicle information asked by the online survey The same vehicle, if not (do n’t laugh, it ’s possible that such a problem occurs if the business process design is not good!), Then you need to adjust or remove the data.
Strictly speaking, this has left the scope of data cleaning, and related data changes should be involved in the database model. But I still want to remind you that the integration of data from multiple sources is a very complicated task. You must pay attention to the correlation between the data. Try not to conflict with each other during the analysis process, but you are unaware .
The above is a simple combing of the data cleaning process. Due to the limitation of ability, it is inevitable that there will be a loss of 10,000. Please give me your feedback, thanks.