Python Data processing practical

Source: Internet
Author: User

First, the operating environment

1, Python version 2.7.13 blog code is this version
2. System environment: Win7 64-bit system

Second, the need to deal with the messy text data

Some of the data are as follows, the first field is the original field, followed by 3 is the field to be purged, from the Database aggregation field observation, at first glance the data comparison law, similar (currency amount million) so, I think with SQL write conditional judgment, unified conversion to ' million yuan ' units, with SQL script string interception can be done, But later found that the data is not regular, the condition to judge too much cleaning quality is not necessarily, some front is not the left parenthesis, some fields there is no currency, some numbers are not integers, some do not have words, so if stored into numbers and ' Million RMB ' unit two fields write SQL script complicated, MySQL I didn't find a function to extract numbers from text, regular expressions are often used in where conditions like, if anyone knows MySQL has a function like extracting numbers from text, you can tell me ha, so you don't have to pay so much effort to use kettle a Tools, tools ingenious the best.

In combination with Python's experience, Python has many functions for string filtering, which is used in the code later to filter the text.


First partial cleaning data three, the macroscopic logic thinking of data processing

Get the data, do not rush to write code, first think about the cleaning logic, this is critical, the direction of the more effective, the rest of the time is the code to implement the logic and debugging code process.

3.1 Thinking process does not write code:

I want to achieve the final data cleaning is to convert the money field into "Amount + units + each currency" combination form or "Amount + unit + Unified renminbi currency" (currency exchange rate conversion), two or three steps can be

3.1.1 Split three fields, numbers, units, currencies

(The unit is divided into million and does not contain million, the currency is divided into renminbi and specific foreign currency)

3.1.2 Unified unit to million units

The first step in the unit is not million of the number of parts/10000, is the number of thousands of parts remain unchanged

3.1.3 The currency into renminbi

The currency is the first two fields of the renminbi are unchanged, not the number of parts into the number * exchange rate of foreign currencies to the renminbi, the unit is still the second step unified ' million '

3.2 Stage look at the steps of the cleaning effect data list:

Starting with this result, we take steps to disassemble, first comb the cleaning logic part

3.2.1 First cleaning expected effect split three field number unit currency:

① field value = "2000 RMB", first time cleaning
2000 不含万 人民币
② field value = "20 million RMB", first time cleaning
2000 万 人民币
③ field value = "20 million yuan foreign currency", first time cleaning
2000 万 外币

3.2.2 Second cleaning the desired effect is to unify the unit into million:
#二次处理条件case when 单位=‘万’ then 金额 else 金额/10000 end as 第二次金额

① field value = "2000 RMB"
0.2 万 人民币
② field value = "20 million RMB"
2000 万 人民币
③ field value = "20 million yuan foreign currency"
2000 万 外币

Note: If the above requirements are cleaned, if you want to change the unit to RMB, do the following three times cleaning

3.2.3 Third cleaning expected effect: The unit currency is unified to million + RMB

If the final demand is converted into currency unified renminbi, then we will be on the basis of two cleaning, then write the conditions on the good,

#三次处理条件case when 币种=‘人民币’ then 金额 else 金额*币种和人民币的换算汇率 end as 第三次金额

① field value = "2000 RMB"
0.2 万 人民币
② field value = "20 million RMB"
2000 万 人民币
③ field value = "20 million yuan foreign currency"
2000*外币兑换人民币汇率 万 人民币

IV. macro-logical thinking on specific code

Currency and units These two are 2 cases, very well written

4.1. Currency section

This is a simple condition, if the value of the currency appears in the character, let the new field be equal to the value of the currency.

4.2. Units (million units)

This condition is also simple, million characters appear in the character unit this variable = ' million ' does not appear to let the unit variable equals ' does not contain million ', so write is to facilitate the next two times the number of processing time to write the conditions to judge.

4.3, the number of parts to ensure that after cleaning and the original value of the same logic to do some judgment

Ensure that after cleaning and the original value logically the same means that if there is such a field 3.0001 million after cleaning into 3.0001 million yuan is also correct.

filter(str.isdigit,字段的值)This code I first know that the text can be taken out, the same field group by aggregation after the field of the decimal point, the value is removed no longer with a decimal point, such as ' 200100 ', the filter(str.isdigit,‘20.01万’) number taken out is 2001, obviously this number is not correct, Therefore, it is necessary to consider the situation with or without a decimal point, the same as the original field

Four, first cleaning the main code, do not read the database data first

Extracting outliers from the database 10 or so test, info is the value of the Regcapital field

   #带小数点的以小数点分割 remove the parts before and after the decimal point for stitchingIf‘.‘In infoand Int (Filter (Str.isdigit,info.Split‘.‘) [1])) >0:derive_regcapital=Filter (Str.isdigit,info.Split‘.‘) [0]) +‘.‘ +Filter (Str.isdigit,info.Split‘.‘) [1]) elif‘.‘In infoand Int (Filter (Str.isdigit,info.Split‘.‘) [1]) = =0:derive_regcapital =Filter (Str.isdigit, info.Split‘.‘) [0]) elifFilter (str.isdigit,info) = =": derive_regcapital=' 0 'Else:derive_regcapital=Filter (Str.isdigit,info)#单位 unified with million and no millionIfMarriottIn info:derive_danwei=MarriottElse:derive_danwei=' does not contain million '#币种 first clean foreign currency reserved foreign currency field aggregates large amounts of data found in data that contain foreign currency in the case of the following cases if there is a new foreign currency in the update operation to make the dataIf' Dollars 'In info:derive_currency=' USD ' ElifhkIn info:derive_currency =' HKD ' elif' Afghani 'In info:derive_currency =' Afghani ' elifAustralian dollarIn info:derive_currency =' AUD ' elif' GBP 'In info:derive_currency = ' GBP ' elif  ' Canadian dollar ' in info:derive_ Currency =  ' Canadian dollar ' elif  ' yen ' in info: derive_currency =  ' yen ' elif  ' HKD ' in info:derive_currency =  ' HKD ' elif  ' franc ' in info:derive_currency =  ' franc ' elif  ' Euro ' in info:derive_currency =  ' Euro ' elif  ' Singapore ' in info:derive_currency =  Singapore dollar ' else:derive_currency =  ' RMB '     
Five, all code: Read the database data for the full amount of cleaning

The fourth step I was to test some of the data, verify that the code is correct, at this time, the logic should be further expanded from the macro, the info variable into the database all the values, for the full amount of cleaning

#coding: Utf-8From Class_mysqlImport Mysqlproject=mysql (' S_58infor_data ', [],0,conn_type=' Local ') P2=mysql (' Etl1_58infor_data ', [],24,conn_type=' Local ') Field_list=p2.select_fields (db=' local_db ', table=' Etl1_58infor_data ')Print Field_listproject2=mysql (' Etl1_58infor_data ', field_list=field_list,field_num=26,conn_type=' Local ')#以上部分 don't understand, because I have two sets of database environments, testing and production#不同的数据库连接和网段, so you have to pass different parameters to switch between database and data connection if a set of environments connects to a database, data processing requires frequent testing to make it easy for you to call Data_tuple=project.select (db=' local_db ', id=0)#data_tuple is that I instantiate myself write the class of the operation database to read the whole field of database data, the return value is an immutable object tuple tuple, cleaning need to keep the old table all fields, and add 3 clean data field data_tuple= Project.select (db=' local_db ', id=0)#遍历元组 use a dictionary to store values for each field into a table that adds 3 cleaning fields Etl1_58infor_dataFor dataIn data_tuple:item={}#old_data不取最后一个字段 is because of that field I want to use the current processing time#这样可以计算数据总量运行的时间 to adjust the time of cleaning two times to connect with kettle Scheduled tasksThe reason for #元组转换为列表 conversion is because the tuple is immutable, and if there is a null value traversal in the data, converting to a string will result in an error old_data=list (data[:-1])If data[-2]:If Len (data[-2]) >0:info=data[-2].encode (' Utf-8 ')else:info=‘‘If‘.‘In infoand int (Filter (Str.isdigit,info.split (‘.‘) [1])) >0:derive_regcapital=filter (Str.isdigit,info.split (‘.‘) [0]) +‘.‘ +filter (Str.isdigit,info.split (‘.‘) [1])Elif‘.‘In infoand int (Filter (Str.isdigit,info.split (‘.‘) [1]) = =0:derive_regcapital = Filter (Str.isdigit, Info.split (‘.‘) [0])Elif filter (str.isdigit,info) = =": derive_regcapital=' 0 'Else:derive_regcapital=filter (Str.isdigit,info)IfMarriottIn info:derive_danwei=MarriottElse:derive_danwei=' does not contain million 'If' Dollars 'In info:derive_currency=' Dollars 'ElifhkIn info:derive_currency =hkElif' Afghani 'In info:derive_currency =' Afghani 'ElifAustralian dollarIn info:derive_currency =Australian dollarElif' GBP 'In info:derive_currency =' GBP 'Elif' Canadian dollar 'In info:derive_currency =' Canadian dollar 'ElifJPYIn info:derive_currency =JPYElifhkIn info:derive_currency =hkElifFrancIn info:derive_currency =FrancElifEuroIn info:derive_currency =EuroElifSingaporeIn info:derive_currency =' Singapore dollar 'else:derive_currency =  ' RMB ' time_58infor_data = P2.create_ Time () old_data.append (time_58infor_data) old_data.append (derive_regcapital) old_data.append (Derive_danwei) old_ Data.append (derive_currency)  #print Len (old_data) for i in Range (len (old_data)): if not old_ Data[i]: Old_data[i]= "else:  Pass Data2=old_data[i].replace ( ") Item[i+1]=data2 print item[1]  #插入测试环境 table Project2.insert (item=item,db= ' local_db ')  
VI. Code Operation 6.1 read the database's original table data and the fields created by the new table
Read Database raw table data and new table created field 6.2 insert a new table and perform the first data cleansing

The red frame part is the cleaning part, the other data does the desensitization processing


Insert a new table and perform the first Data cleansing 6.3 data sheet Data Cleansing Results
Data Sheet Data Cleansing Results vii. incremental Data processing

Since the data is incrementally entered every day, after the first execution of the initial session, we have to judge by the timestamp field in the table, read the new data yesterday for cleaning and inserting, which is left to the next blog post.
The preliminary plan is to use the following function as a parameter to determine the increment create_time is the time when the crawler script executes, yesterday is yesterday time, in the Where condition to limit, take out yesterday entered the database of data to perform Win7 system support timed task

import datetimefrom datetime import datetime as dt#% escaped using percent to escape Span class= "hljs-comment" > #主要构造sql中条件 "where create_time like%s%%"% yesterday# Write script run current time def create_time '%y-%m-%d%h:%m:%s ') return create_timedef yesterday1) return yestoday       

Python data processing actual combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.