Python processing instances of cluttered text data

Source: Internet
Author: User

First, the operating environment

1, Python version 2.7.13 blog code is this version
2. System environment: Win7 64-bit system

Second, the need to deal with the messy text data

Some of the data are as follows, the first field is the original field, followed by 3 is the field to be purged, from the Database aggregation field observation, at first glance the data comparison law, similar (currency amount million) so, I think with SQL write conditional judgment, unified conversion to ' million yuan ' units, with SQL script string interception can be done, But later found that the data is not regular, the condition to judge too much cleaning quality is not necessarily, some front is not the left parenthesis, some fields there is no currency, some numbers are not integers, some do not have words, so if stored into numbers and ' Million RMB ' unit two fields write SQL script complicated, MySQL I didn't find a function to extract numbers from text, regular expressions are often used in where conditions like, if anyone knows MySQL has a function like extracting numbers from text, you can tell me ha, so you don't have to pay so much effort to use kettle a Tools, tools ingenious the best.

In combination with Python's experience, Python has many functions for string filtering, which is used in the code later to filter the text.


First-time Partial cleaning data

On the macroscopic logic of data processing

Get the data, do not rush to write code, first think about the cleaning logic, this is critical, the direction of the more effective, the rest of the time is the code to implement the logic and debugging code process.

3.1 Thinking process does not write code:

I want to achieve the final data cleaning is to convert the money field into "Amount + units + each currency" combination form or "Amount + unit + Unified renminbi currency" (currency exchange rate conversion), two or three steps can be

3.1.1 Split three fields, numbers, units, currencies

(The unit is divided into million and does not contain million, the currency is divided into renminbi and specific foreign currency)

3.1.2 Unified unit to million units

The first step in the unit is not million of the number of parts/10000, is the number of thousands of parts remain unchanged

3.1.3 The currency into renminbi

The currency is the first two fields of the renminbi are unchanged, not the number of parts into the number * exchange rate of foreign currencies to the renminbi, the unit is still the second step unified ' million '

3.2 Stage look at the steps of the cleaning effect data list:

Starting with this result, we take steps to disassemble, first comb the cleaning logic part

3.2.1 First cleaning expected effect split three field number unit currency:

① field value = "2000 RMB", first time cleaning
2000 不含万 人民币
② field value = "20 million RMB", first time cleaning
2000 万 人民币
③ field value = "20 million yuan foreign currency", first time cleaning
2000 万 外币

3.2.2 Second cleaning the desired effect is to unify the unit into million:

#二次处理条件case when unit = ' then amount ' else amount/10000 end as second amount

① field value = "2000 RMB"
0.2 万 人民币
② field value = "20 million RMB"
2000 万 人民币
③ field value = "20 million yuan foreign currency"
2000 万 外币

Note: If the above requirements are cleaned, if you want to change the unit to RMB, do the following three times cleaning

3.2.3 Third cleaning expected effect: The unit currency is unified to million + RMB

If the final demand is converted into currency unified renminbi, then we will be on the basis of two cleaning, then write the conditions on the good,

#三次处理条件case when currency = ' renminbi ' then amount else amount * currency and RMB conversion rate end as third amount

① field value = "2000 RMB"
0.2 万 人民币
② field value = "20 million RMB"
2000 万 人民币
③ field value = "20 million yuan foreign currency"
2000*外币兑换人民币汇率 万 人民币

IV. macro-logical thinking on specific code

Currency and units These two are 2 cases, very well written

4.1. Currency section

This is a simple condition, if the value of the currency appears in the character, let the new field be equal to the value of the currency.

4.2. Units (million units)

This condition is also simple, million characters appear in the character unit this variable = ' million ' does not appear to let the unit variable equals ' does not contain million ', so write is to facilitate the next two times the number of processing time to write the conditions to judge.

4.3, the number of parts to ensure that after cleaning and the original value of the same logic to do some judgment

Ensure that after cleaning and the original value logically the same means that if there is such a field 3.0001 million after cleaning into 3.0001 million yuan is also correct.

filter(str.isdigit,字段的值)This code I first know that the text can be taken out, the same field group by aggregation after the field of the decimal point, the value is removed no longer with a decimal point, such as ' 200100 ', the filter(str.isdigit,‘20.01万’) number taken out is 2001, obviously this number is not correct, Therefore, it is necessary to consider the situation with or without a decimal point, the same as the original field

Four, first cleaning the main code, do not read the database data first

Extract outliers from the database 10 or so test, info is the value of the Regcapital field

   #带小数点的以小数点分割 remove the part of the decimal point before and after stitching if '. ' in info and int (filter (Str.isdigit,info.split ('. ') [1])) >0:derive_regcapital=filter (Str.isdigit,info.split ('. ') [0]) +'.' +filter (Str.isdigit,info.split ('. ') [1]) elif '. ' in info and int (filter (Str.isdigit,info.split ('. ') [1])) ==0:derive_regcapital = Filter (Str.isdigit, Info.split ('. ') [0]) elif filter (str.isdigit,info) = = ": derive_regcapital= ' 0 ' else:derive_regcapital=filter (str.isdigit,i NFO) #单位 with million and does not contain million for unified if ' million ' in info:derive_danwei= ' else:derive_danwei= ' without million ' #币种 first clean foreign currency reserved foreign currency field aggregates large number of data discoveries In the case of foreign currencies, there are some cases where the new foreign currency appears if the update operation for the data is available if ' USD ' in info:derive_currency= ' USD ' Elif ' HKD ' in Info:de rive_currency = ' HKD ' elif ' afghani ' in info:derive_currency = ' Afghani ' elif ' AUD ' in info:derive_currency = ' AUD ' elif ' GBP ' in info:derive_currency = ' GBP ' elif ' Canadian dollar ' in info:derive_currency = ' Canadian dollar ' Eli F ' JPY ' in info:derive_currency = ' yen ' ElIf ' HKD ' in info:derive_currency = ' HKD ' elif ' franc ' in info:derive_currency = ' franc ' elif ' Euro ' in info: derive_currency = ' euro ' elif ' singapore ' in info:derive_currency = ' singapore dollar ' else:derive_currency = ' RMB '

Five, all code: Read the database data for the full amount of cleaning

The fourth step I was to test some of the data, verify that the code is correct, at this time, the logic should be further expanded from the macro, the info variable into the database all the values, for the full amount of cleaning

#coding: Utf-8from class_mysql import mysqlproject=mysql (' S_58infor_data ', [],0,conn_type= ' local ') p2=mysql (' etl1_ 58infor_data ', [],24,conn_type= ' local ') field_list=p2.select_fields (db= ' local_db ', table= ' Etl1_58infor_data ') Print Field_listproject2=mysql (' Etl1_58infor_data ', field_list=field_list,field_num=26,conn_type= ' local ') #以上部分 I don't understand. Because I have two sets of database environments, test and production # different database connections and network segments, so you have to pass different parameters to switch between database and data connection if a set of environments connects to a database, data processing requires frequent testing to make it easy for you to call data_tuple= Project.select (db= ' local_db ', id=0) #data_tuple is the class that I instantiate my own write operations database to read the full field of database data, the return value is an immutable object tuple tuple, cleaning need to keep the old table all fields, Add 3 Clean data fields Data_tuple=project.select (db= ' local_db ', id=0) #遍历元组 use a dictionary to store values for each field into a table that adds 3 cleaning fields etl1_58infor_datafor Data in data_tuple:item={} #old_data不取最后一个字段 because that field I want to use the current processing time #这样可以计算数据总量运行的时间 to adjust the time of two cleaning to and Kettle timed task docking # tuple into the list The reason for the conversion is that the tuple is immutable if there is a null value traversal in the data converted to a string error old_data=list (Data[:-1]) if Data[-2]:if len (data[-2]) >0:info= Data[-2].encode (' utf-8 ') else:info= ' if '. ' in info and int (filter ('. ') (Str.isdigit,info.split (') ') [1])) >0:derive_regcapital=filter (Str.isdigit,info.split ('. ') [0]) +'.' +filter (Str.isdigit,info.split ('. ') [1]) Elif '. ' in info and int (filter (Str.isdigit,info.split ('. ') [1])) ==0:derive_regcapital = Filter (Str.isdigit, Info.split ('. ') [0]) Elif filter (str.isdigit,info) = = ": derive_regcapital= ' 0 ' else:derive_regcapital=filter (str.isdigit,info) if ' Million ' in info:derive_danwei= ' else:derive_danwei= ' does not contain million ' if ' USD ' in info:derive_currency= ' USD ' Elif ' port Coins ' in info:derive_currency = ' HKD ' elif ' afghani ' in info:derive_currency = ' Afghani ' elif ' AUD ' in Info:der ive_currency = ' AUD ' elif ' GBP ' in info:derive_currency = ' GBP ' elif ' Canadian dollar ' in info:derive_currency = ' Canadian dollar ' El If ' yen ' in info:derive_currency = ' yen ' elif ' HKD ' in info:derive_currency = ' HKD ' elif ' franc ' in Info:de rive_currency = ' franc ' elif ' Euro ' in info:derive_currency = ' euro ' elif ' singapore ' in info:derive_currency = ' singapore dollar ' els E:derive_Currency = ' RMB ' time_58infor_data = P2.create_time () old_data.append (time_58infor_data) old_data.append (derive_r Egcapital) old_data.append (Derive_danwei) old_data.append (derive_currency) #print len (old_data) for I in range (len, old _data): If not Old_data[i]: old_data[i]= ' else:pass data2=old_data[i].replace (' "', ') item[i+1] =data2print item[1] #插入测试环境 table Project2.insert (item=item,db= ' local_db ')

Vi. operation of the code

6.1 Read the Database raw table data and the fields created by the new table


Read the database's original table data and the fields created by the new table

6.2 Inserting a new table and cleaning the data for the first time

The red frame part is the cleaning part, the other data does the desensitization processing


Insert a new table and perform the first data cleansing

6.3 Data Sheet Data Cleansing Results


Data Sheet Data Cleansing Results

VII. Incremental Data processing

Since the data is incrementally entered every day, after the first execution of the initial session, we have to judge by the timestamp field in the table, read the new data yesterday for cleaning and inserting, which is left to the next blog post.
The preliminary plan is to use the following function as a parameter to determine the increment create_time is the time when the crawler script executes, yesterday is yesterday time, in the Where condition to limit, take out yesterday entered the database of data to perform Win7 system support timed task

Import datetimefrom datetime import datetime as dt#% escaped using percent of the # main constructs SQL in condition "where create_time like%s%%"  % yesterday# Write script run current time def create_time (self):    create_time = Dt.now (). Strftime ('%y-%m-%d%h:%m:%s ') return create_timedef Yesterday (self):    yestoday= datetime.date.today ()-datetime.timedelta (Days=1) return Yestoday

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.