Python Data Processing (required) and python Data Processing

Source: Internet
Author: User

Python Data Processing (required) and python Data Processing

I. Runtime Environment

1. The python version 2.7.13 Blog Code is of this version.
2. System Environment: win7 64-bit System

Ii. Processing of messy text data

Some of the data is as follows. The first field is the original field, and the last three are the cleansed fields. Observe the aggregated fields in the database. At first glance, the data comparison rules are similar (the currency amount is RMB) in this way, I want to use SQL writing conditions to determine whether to convert the data to a unit of '10 million RMB 'in a unified manner, and use SQL scripts to extract strings. However, I found that the data is not regular, too many cleaning quality conditions are judged, and some are not in front of the Left brackets. Some fields have no currency, some numbers are not integers, and some have no thousands of characters, in this way, if the SQL script is complicated to write fields stored as numbers and units of RMB, mysql does not find any function that can extract numbers from the text, regular Expressions are often used in the where condition. If anyone knows that mysql has a function similar to filtering text from text to extract numbers, they can tell me, so you don't have to spend so much time, you can use kettle as a tool to learn and use it best.

Combined with python's experience, python filters strings. Many functions later use this method in code to filter text.

First part of data cleansing

Iii. Macro Logic Thinking on Data Processing

When you get the data, don't worry about writing the code first, and think about the logic of cleaning first. This is critical and the right direction is the best way to get twice the result with half the effort. The rest of the time is the process of code implementation logic and code debugging.

3.1 do not write code during the thinking process:

The final data cleansing I want to achieve is to convert the fund field into a combination of [amount + unit + currency] or [amount + unit + unified RMB currency] (currency for exchange rate conversion ), two or three steps are allowed.

3.1.1 split three fields, numbers, units, and currency

(The unit is divided into RMB and excluding RMB, and the currency is divided into RMB and the specific foreign currency)

3.1.2 change the unit to the unit of ten thousand

In step 1, the unit is not 10000 of the numbers/, and the unit is of the numbers.

3.1.3 unified currency into RMB

The currency is the first two fields of the Renminbi remain unchanged, not the numeric part is changed to the number * the exchange rate of each foreign currency to be exchanged for the renminbi, and the unit remains unchanged in the second step'

3.2 list the cleaning result data of each step:

Starting from this result, we split it step by step to sort out the cleansing logic.

3.2.1 The expected results of the first cleansing are split into three fields, digit unit, currency:

① Field value = "2000 RMB", first cleaning
2000 excluding RMB
② Field value = "20 million RMB", first cleaning
20 million RMB
③ Field value = "20 million yuan foreign currency", first cleaning
20 million foreign currency

3.2.2 The expected results of the second cleaning will be categorized into units:

# Secondary processing conditions
Case when unit = '100' then amount else amount/10000 end as second amount
① Field value = "2000 RMB"
2 thousand RMB
② Field value = "20 million RMB"
20 million RMB
③ Field value = "20 million RMB foreign currency"
20 million foreign currency

Note: cleaning is complete if the above requirement is met. If you want to change the unit to RMB, perform the following three cleanings:

3.2.3 expected results for the third cleaning: the unit Currency is + RMB

If the final requirement is to convert the data into a currency and unify the RMB, then we just need to write the condition on the basis of the second cleaning,

# Three Processing Conditions
Case when currency = 'RMB' then amount else amount * currency and RMB conversion exchange rate end as third time amount
① Field value = "2000 RMB"
2 thousand RMB
② Field value = "20 million RMB"
20 million RMB
③ Field value = "20 million RMB foreign currency"
2000 * foreign currency exchange: RMB

Iv. Macro Logic Thinking on specific code

Currency and unit are two types of situations.

4.1 currency

This condition is simple. If the currency value appears in the character, make the new field equal to the value of this currency.

4.2. Unit (10 thousand)

This condition is also simple. If the unit variable = 'wan' is not displayed in the character, the Unit variable equals to "not wan ', in this way, the write condition is determined to facilitate the next step of secondary processing of numbers.

4.3 make sure that the numeric part is logically the same as the original value after cleaning

After cleaning, make sure that it is logically the same as the original value. This means that it is correct to convert the field 3.0001 million to 3.0001 million RMB after cleaning.

Filter (str. isdigit, field value) the code first knows that the number in the text can be taken out, the same as the group by aggregation of the field to know that the field has a decimal point, the retrieved value does not have a decimal point, for example, '20. 10 thousand ', filter (str. isdigit, '20. 10 thousand ') The retrieved number is 2001. Obviously, this number is incorrect. Therefore, you need to consider whether there is a decimal point.

4. The main code is cleaned for the first time, and the database data is not read first.

Extract about 10 abnormal values from the database for testing. info is the value of the regCapital field.

# Split the parts before and after decimal points by decimal points for merging if '. 'In info and int (filter (str. isdigit, info. split ('. ') [1])> 0: derive_regcapital = filter (str. isdigit, info. split ('. ') [0]) + '. '+ filter (str. isdigit, info. split ('. ') [1]) elif '. 'In info and int (filter (str. isdigit, info. split ('. ') [1]) = 0: derive_regcapital = filter (str. isdigit, info. split ('. ') [0]) elif filter (str. isdigit, info) = '': derive_regcapital = '0' else: derive_regcapital = filter (str. isdigit, info) # if 'wan' in info: derive_danwei = 'wan' else: derive_danwei = 'do not include 100' # currency first cleaning foreign currency reserved foreign currency field aggregation a large amount of data found that the data contains foreign currency generally has the following situations if there is a new foreign currency data update operation can be done if 'dollar 'in info: derive_currency = 'dollar 'elif 'Hong Kong dollar' in info: derive_currency = 'Hong Kong dollar 'elif 'Afghanistan ni' in info: derive_currency = 'Afghanistan ni' elif 'Australian dollar' in info: derive_currency = 'Australian dollar 'elif 'Pound 'in info: derive_currency = 'Pound 'el'if 'canadian dollar' in info: derive_currency = 'canadian dollar 'elif 'yen 'in info: derive_currency = 'yen 'elif 'Hong Kong dollar' in info: derive_currency = 'Hong Kong dollar 'elif 'francs' in info: derive_currency = 'francs 'elif 'EURO' in info: derive_currency = 'Euro' elif 'singaporean 'in info: derive_currency = 'Singapore dollar' else: derive_currency = 'RMB'

5. All code: Read database data for full cleaning

Step 4: I tested some data to verify that the code is correct. In this case, the logic should be further expanded from a macro perspective, and the info variable should be dynamically changed to all values in the database for full cleaning.

# Coding: utf-8from class_mysql import Mysqlproject = Mysql ('s _ 58infor_data ', [], 0, conn_type = 'local') p2 = Mysql ('etl1 _ 58infor_data', [], 24, conn_type = 'local') field_list = p2.select _ fields (db = 'local _ db', table = 'etl1 _ 58infor_data ') print field_listproject2 = Mysql ('etl1 _ 58infor_data ', field_list = field_list, field_num = 26, conn_type = 'local') # It doesn't matter if I have two database environments, test and production # different database connections and network segments, therefore, you need to pass different parameters to switch the database and data connection. If a set of environments connects to the database once, data processing needs to be tested frequently so that you can call data_tuple = project. select (db = 'local _ db', id = 0) # data_tuple is the class that I instantiate my own database operations to read the full field of database data, the returned value is an immutable object tuple. During cleaning, all fields in the old table need to be retained, and three data fields after cleaning are added: data_tuple = project. select (db = 'local _ db', id = 0) # use the dictionary to store the values of each field and insert them to the etl1_58infor_datafor data in data_tuple table with three cleansing fields added: item ={}# old_data does not take the last field because of the field I want to use the current processing time # in this way, the running time of the total amount of data can be calculated to adjust the time of secondary cleansing and kettle scheduled tasks. integration # conversion of tuples to the list is because the tuples are of an unchangeable type. If null values exist in the data and are traversed and converted to a string, the old_data = list (data [: -1]) if data [-2]: if len (data [-2])> 0: info = data [-2]. encode ('utf-8') else: info = 'if '. 'In info and int (filter (str. isdigit, info. split ('. ') [1])> 0: derive_regcapital = filter (str. isdigit, info. split ('. ') [0]) + '. '+ filter (str. isdigit, info. split ('. ') [1]) elif '. 'In info and int (filter (str. isdigit, info. split ('. ') [1]) = 0: derive_regcapital = filter (str. isdigit, info. split ('. ') [0]) elif filter (str. isdigit, info) = '': derive_regcapital = '0' else: derive_regcapital = filter (str. isdigit, info) if 'wan' in info: derive_danwei = 'wan' else: derive_danwei = 'do not include wan' if' dollar 'in info: derive_currency = 'dollar 'elif 'Hong Kong dollar' in info: derive_currency = 'Hong Kong dollar 'elif 'Afghanistan ni' in info: derive_currency = 'Afghanistan ni' elif 'Australian dollar' in info: derive_currency = 'Australian dollar 'elif 'Pound 'in info: derive_currency = 'Pound 'el'if 'canadian dollar' in info: derive_currency = 'canadian dollar 'elif 'yen 'in info: derive_currency = 'yen 'elif 'Hong Kong dollar' in info: derive_currency = 'Hong Kong dollar 'elif 'francs' in info: derive_currency = 'francs 'elif 'EURO' in info: singapore = 'Euro 'elif 'Singapore 'in info: Singapore = 'Singapore dollar' else: Singapore = 'RMB' time_58infor_data = p2.create _ time () old_data.append (time_58infor_data) old_data.append (derive_regcapital) old_data.append (derive_danwei) old_data.append (derive_currency) # print len (old_data) for I in range (len (old_data): if not old_data [I]: old_data [I] = ''else: pass data2 = old_data [I]. replace ('"','') item [I + 1] = data2 print item [1] # insert project2.insert (item = item, db = 'local _ db ')

6. Code running

6.1 read the original table data of the database and the fields created in the new table.

Read data from the original database table and fields created from the new table.

6.2 insert a new table and perform the first data cleansing

The red box is the cleansing part, and other data is desensitized.

Insert a new table and perform the first data cleansing.

6.3 data cleansing results of data tables

Data cleansing results of data tables

VII. Incremental data processing

Since data is incremental every day, after the first execution of the initial statement, we need to judge based on the timestamp field in the table, read the new data yesterday for cleaning and insertion, this part is left in the next blog.

Initially, we plan to use the following function as the parameter to determine that incremental create_time is the time when the crawler writes data during script execution, and yesterday is the time of yesterday, which is limited in the where condition, retrieve the data that entered the database yesterday and execute the scheduled tasks supported by win7.

Import datetimefrom datetime import datetime as dt # % escape using % to escape # mainly constructs the SQL condition "where create_time like % s %" % yesterday # current time of script writing def create_time (self): create_time = dt. now (). strftime ('% Y-% m-% d % H: % M: % s') return create_timedef yesterday (self): yestoday = datetime. date. today ()-datetime. timedelta (days = 1) return yestoday

The above python Data Processing Practice (this article is required) is all the content that I have shared with you. I hope to give you a reference and support for the help house.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.