More than two years of contact with Python, and has never been independent of Python to complete a project, said ashamed. Recently because of work needs, with Excel and Oracle collation data seems to be not good, and then turn to Python, of course, stepped on a lot of pits, record down, to avoid the future into the pit, after all, not commonly used, good scar will forget the pain ...
Business Scenario:
The leader took a few Excel, a total of 150W insurance data, need to follow the specific rules to filter out the data to meet the criteria.
Fields: Business organization, policy number, Case number, insured, code 1, the vehicle number, VIN code, driver, telephone, Chuxian time, Chuxian, maintenance enterprise, fixed loss amount, three car number, three Vin, three drivers, code 2, three maintenance enterprises, survey fixed loss personnel;
150W data does not have a unique identification field: The same case number corresponds to a standard car number, 0-Multiple three car number, a marked car number corresponding to one or more policy number, so need to pass the marked car number, policy number, the report number three fields uniquely determine a claim record.
Filter rules:
Telephone frequency is greater than or equal to 3 times (within 1 years); Frame number (subject and three is greater than or equal to 3 times (within 1 years; the driver's name (subject and three) is equal to 3 times (within 1 years), the subject and the three maintenance units are the same workshop; the same car number or frame number Chuxian two times within 10th; 6:00 the case, do the risk identification.
Given the data, there are several questions:
- Partial record data is incomplete and the overall missing rate is low
- Partial field information entry error, for example, a character in the phone field, 11 digits in the time field (suspected phone information)
- Have duplicate data
- Excel data table column names are inconsistent
Summary of issues:
This is a simple data filtering job!
But just work not busy, just also want to take data to practice practiced hand, so began to practice up ~
before you do, you need to get a full understanding of the data through a variety of ways, otherwise, waste time!
By asking the data source to determine the relationship between variables, using common sense to judge the value of each variable, through exploratory analysis to understand the loss/value of each variable, results-oriented analysis of data cleaning process may encounter problems.
Problem decomposition:
- Data is stored in multiple Excel tables, and the data is read into a variable
- Organize data by filter criteria
- Output filtering data
Code section, I want to put in another article inside ~
Python Basic Data cleansing