Python Basic Data cleansing

Source: Internet
Author: User

More than two years of contact with Python, and has never been independent of Python to complete a project, said ashamed. Recently because of work needs, with Excel and Oracle collation data seems to be not good, and then turn to Python, of course, stepped on a lot of pits, record down, to avoid the future into the pit, after all, not commonly used, good scar will forget the pain ...

Business Scenario:

  The leader took a few Excel, a total of 150W insurance data, need to follow the specific rules to filter out the data to meet the criteria.

Fields: Business organization, policy number, Case number, insured, code 1, the vehicle number, VIN code, driver, telephone, Chuxian time, Chuxian, maintenance enterprise, fixed loss amount, three car number, three Vin, three drivers, code 2, three maintenance enterprises, survey fixed loss personnel;

150W data does not have a unique identification field: The same case number corresponds to a standard car number, 0-Multiple three car number, a marked car number corresponding to one or more policy number, so need to pass the marked car number, policy number, the report number three fields uniquely determine a claim record.

Filter rules:

Telephone frequency is greater than or equal to 3 times (within 1 years); Frame number (subject and three is greater than or equal to 3 times (within 1 years; the driver's name (subject and three) is equal to 3 times (within 1 years), the subject and the three maintenance units are the same workshop; the same car number or frame number Chuxian two times within 10th; 6:00 the case, do the risk identification.

  Given the data, there are several questions:

    1. Partial record data is incomplete and the overall missing rate is low
    2. Partial field information entry error, for example, a character in the phone field, 11 digits in the time field (suspected phone information)
    3. Have duplicate data
    4. Excel data table column names are inconsistent

Summary of issues:

This is a simple data filtering job!

  But just work not busy, just also want to take data to practice practiced hand, so began to practice up ~

before you do, you need to get a full understanding of the data through a variety of ways, otherwise, waste time!

By asking the data source to determine the relationship between variables, using common sense to judge the value of each variable, through exploratory analysis to understand the loss/value of each variable, results-oriented analysis of data cleaning process may encounter problems.

Problem decomposition:

    1. Data is stored in multiple Excel tables, and the data is read into a variable
    2. Organize data by filter criteria
    3. Output filtering data

Code section, I want to put in another article inside ~

Python Basic Data cleansing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.