Foreword
Scientific workers, engineers, business analysts who deal with data,
data analysis is a core task in their work. This is not just for "big data" practitioners, even the data on your laptop hard drive is worth analyzing. The first step in data analysis is
data cleaning. The original data may come from various sources, including:
Web server logs
The output of a scientific instrument
Export results of online questionnaire
1970s government data
Report prepared by business consultant
What these sources have in common is that you absolutely can't expect their weird formats. The data is for you, then you have to deal with it, but the data may often be:
Incomplete (some fields of some records are missing)
Inconsistent (field name and structure are inconsistent)
Data corruption (some records may be destroyed for various reasons)
Therefore, you must always maintain your cleaning program to clean these raw data and convert them into an easy-to-analyze format, often called data wrangling. Next, I will introduce some about how to effectively clean the data, all the content can be implemented by any programming language.
Use assertions
This is the most important experience: use assertions to find bugs in the code. Write your assumptions about the format of the code in the form of assertions, and if you find that data contradicts your assertions, modify those assertions.
Are the records ordered? If so, assert it! Does each record have 7 fields? If so, assert it. Is each field an odd number between 0-26? If so, assert it! In short, everything that can be asserted is asserted!
In an ideal world, all records should be in a neat format, and follow a concise internal structure. But this is not the case in practice. Write an assertion that your eyes bleed, even if you have to bleed.
The program for washing data will definitely crash frequently. This is good, because every crash means that your bad data is contrary to your original assumptions. Iteratively improve your assertions until you can successfully get through. But be sure to keep them as strict as possible, not too loose, or you may not achieve the effect you want. The worst case is not that the program fails, but that it is not the result you want.
Don't skip records silently
Some records in the original data are incomplete or damaged, so the procedure of washing data can only be skipped. Silently skipping these records is not the best way, because you do n’t know what data is missing. Therefore, it is better to do this:
Print out the warning message so you can look for something wrong later
Record how many records were skipped and how many records were successfully cleaned. Doing so will give you a rough sense of the quality of the original data. For example, if you only skipped 0.5%, this is still the case. But if you skipped 35%, then it is time to look at what is wrong with the data or code.
Use Set or Counter to store the variable category and the frequency of the category
Often, some fields in the data are of enumerated type. For example, the blood type can only be A, B, AB or O. It is good to use assertions to limit the blood type to one of these four types. Although it is good, if a certain category contains multiple possible values, especially if some values may be unexpected, you cannot use assertions. At this time, it is easier to use the counter data structure to store. By doing this you can:
For a certain category, if you encounter a new value that you did not expect, you can print a message to remind you.
After washing the data, you can check it backwards. For example, if someone mistakenly fills in the blood type as C, then it can be easily found in retrospect.
Breakpoint cleaning
If you have a lot of raw data to clean, it may take a long time to complete the cleaning at one time, it may be 5 minutes, 10 minutes, one hour, or even a few days. In practice, it often breaks down halfway through washing.
Suppose you have 1 million records, and your cleaning program crashed because of some abnormalities in section 325392. You fixed this bug and then re-cleaned. In this case, the program has to be cleaned from 1 to 325391 again. Useless work. In fact, you can do this: 1. Let your cleaning program print out the number of items currently being cleaned, so that if it crashes, you will know which one it crashed. 2. Allow your program to start cleaning at the breakpoint, so that when re-cleaning, you can start directly from 325392. The rewashed code may crash again, you just need to fix the bug again and start from the record of the crash again.
After all the records are cleaned, re-clean again, because the code after the bug modification may bring some changes to the cleaning of the previous records. Two cleanings are guaranteed to be foolproof. But in general, setting breakpoints can save a lot of time, especially when you are debugging.
Test on some data
Do not try to clean all the data at once. When you first start writing cleaning code and debugging, test on a smaller subset, then expand this subset of tests and test again. The purpose of this is to enable your cleaning program to quickly complete the cleaning on the test set, for example, a few seconds, which will save you time for repeated testing.
But be aware that if you do this, the subset used for testing often cannot cover some exotic records, because exotic flowers are always relatively rare.
Print the cleaning log to a file
When running the cleaning program, print the cleaning log and error prompts to the file, so that you can easily use a text editor to view them.
Optional: Store the original data together
This experience is useful when you do n’t have to worry about storage space. In this way, the original data can be saved as a field in the cleaned data. After cleaning, if you find which record is wrong, you can directly see what the original data looks like, which is convenient for you to debug.
However, the disadvantage of this is that it needs to consume double the storage space and make certain cleaning operations slower. So this article only applies when efficiency allows.
The last point, verify the data after cleaning
Remember to write a verification program to verify that the clean data you get after cleaning is in the format you expect. You cannot control the format of the original data, but you can control the format of the clean data. Therefore, make sure that the format of the clean data is in accordance with your expected format.
This is actually very important, because after you complete the data cleaning, the next step will be directly on the clean data. If it is absolutely necessary, you will never even touch the original data again. Therefore, make sure that the data is clean enough before you start data analysis. Otherwise, you may get erroneous analysis results. By then, it will be difficult to discover the mistakes made in the data cleaning process a long time ago.