1.csv as a data intermediary, the fastest, but if the contents of a field has a comma, the whole is a mess, or the excel2007 format is better.
2. Import the table's fields, preferably set to nvarchar, can be empty. Type conversion later
3. Data cleansing, such as customer data, gender, address.
Step two: Format content cleaning
If the data is from the system log, it is usually consistent with the description of the metadata in terms of format and content. If the data is collected manually or the user fills in, there is a great possibility that there are some problems in the format and content, simply speaking, the format content problem has the following categories:
1, time, date, value, full half-width and other display format inconsistent
This problem is usually related to the input, and it is also possible to integrate multi-source data and process it into a consistent format. For example gender field "male, male", province "Anhui, Anhui"
2, the content of the character should not exist
Some of the content may include only some of the characters, such as the ID number is a number + letters, Chinese names are Chinese characters (Zhao C this case or a few). The most typical is the head, tail, the middle of the space, may also appear in the name of the number symbol, the identification number in the appearance of Chinese characters and so on. In this case, the semi-manual method of semi-automatic verification is needed to identify possible problems and remove unwanted characters.
3. Content does not match the contents of this field
The name of the gender, the identification number of the mobile phone number and so on, are the problem. However, the problem is special: it can not be simply deleted to deal with, because the cause may be manual filling errors, there may be no front-end verification, there may be some or all of the import data when the column is not aligned, so to identify the problem type in detail.
Format content problem is the details of the problem, but a lot of analysis errors are planted in this hole, such as cross-table association or VLOOKUP failure (multiple spaces cause the tool to think "Andy Lau" and "Andy Lau" is not a person), the statistical value is not complete (the number of mixed letters of course, the result of a problem), The model output fails or does not work well (the data is in the wrong column, the date and age are mixed, so ...). Therefore, please be sure to pay attention to this part of the cleaning work, especially in the processing of data is collected manually, or you determine the product front-end calibration design is not very good time ...
Step three: Logic error cleaning
This part of the work is to remove some use simple logical reasoning can directly identify the problem of data, to prevent the analysis results deviation. The main steps include the following:
1, go to Heavy
Some analysts like to go back in the first step, but I strongly recommend to replay in the format after the cleaning, the reason has been said (multiple spaces cause the tool to think "Andy Lau" and "Andy Lau" is not a person, to fail). Moreover, not all repetition can be so simple to remove ...
In the process of heavy, but also can fill back the missing data, such as the customer data there are 3 repetitions, Miss Liu, Liu, Liu xx, are the same address, then her gender, even if the missing, you know she is female.
I have done telephone sales-related data analysis, found that sales to rob a single simply fair bet ... For example, a company called "ABC Butler Co., Ltd.", in sales a hand, and then sales B in order to rob this customer, in the system entered a "ABC official home Limited." You see, you do not see the difference between the two, and even if you see it, you can guarantee that there is no "ABC Official House Limited" this thing ... This time, either to hug Rd thighs ask someone to write you a fuzzy matching algorithm, or the naked eye to see it.
The above is not the most ruthless, please see:
You use the system is probably two roads are called eight Li Zhuang Road, dare to go directly to heavy? (with the weight of the small tips: two eight-li-Li Road, the number range is not the same)
Of course, if the data is not manually entered, then simply go to the weight.
2. Remove Unreasonable value
A word can be said clearly: someone fill in the blanks, age 200 years old, annual income of 1 billion (estimate is not see "million" word), this is either deleted, or by the missing value processing. How is this value found? Tip: Available but not limited to box plots (box-plot).
3. Correcting contradictory contents
Some fields can be verified with each other, for example: The ID number is 1101031980XXXXXXXX, and then the age of 18 years old, although we understand that people always 18 years old idea, but know the real age can provide users with better service AH (and nonsense ...). At such times, it is necessary to determine which field provides more reliable information, to remove or refactor unreliable fields, based on the data source of the field.
Logic Error In addition to the above enumerated cases, there are a lot of cases not enumerated, in the actual operation should be discretionary. In addition, this step may be duplicated in the subsequent data analysis modeling process, because even though the problem is simple, not all problems can be identified at once, and what we can do is to use tools and methods to minimize the likelihood of problems and make the analysis process more efficient.
Fourth step: Non-demand data cleansing
This is a very simple step: delete The fields that you don't want.
But the actual operation, there are many problems, such as:
Delete The fields that appear to be unnecessary but are actually important to the business;
A field feels useful, but does not know how to use it, not knowing whether to delete it;
In a moment of mistaken, delete the typo paragraph.
In the first two cases I give the suggestion is: if the amount of data is not large enough to delete the field can not be processed, then the field should not be deleted as far as possible. In the third case, please back up the data ...
Fifth Step: Correlation verification
If your data has multiple sources, then it is necessary to correlate validation. For example, you have the car's offline purchase information, there are telephone customer service questionnaire information, both by name and mobile Phone Number Association, then to see, the same person on the same line of information and the online survey of vehicles to ask the vehicle information is not the same car, if not (don't laugh, business process design is not good is possible this problem!), Then you need to adjust or remove the data.
In the strict sense, this is out of the scope of data cleansing, and the associated data changes should be involved in the database model. But I would like to remind you that multiple sources of data integration is very complex work, it is important to pay attention to the correlation between the data, as far as possible in the analysis process to avoid conflicting data, and you are not aware of the situation.
Data import and export, data cleansing