Data Processing Based on python crawlers (detailed description) and python Crawlers
1. First, understand the following functions:
Set the variable length () function char_length () replace () function max () function
1.1 set the variable set @ variable name = Value
Set @ address = 'China-Shandong province-Liaocheng City-xinxian County '; select @ address
1.2, length () function char_length () function difference
Select length ('A'), char_length ('A'), length ('中'), char_length ('中 ')
1.3 replace () function and length () function combination
Set @ address = 'China-Shandong province-Liaocheng City-xinxian County '; select @ address, replace (@ address,'-', '') as address_1, length (@ address) as len_add1, length (replace (@ address, '-', '') as len_add2, length (@ address)-length (replace (@ address ,'-', '') as _ count
How can I determine how to add several separated fields to the new data table when there is an obvious delimiter in etl field cleansing?
Calculate the maximum number of characters in com_industry to determine the maximum number of fields to be added + 1 to the number of fields to be split. This table is 3. Therefore, four industry fields can be split, that is, four industries. level
select max(length(com_industry)-length(replace(com_industry,'-',''))) as _max_countfrom etl1_socom_data
1.4 set the variable substring_index () string truncation function usage
Set @ address = 'China-Shandong province-Liaocheng City-xinxian County '; select substring_index (@ address,'-', 1) as china, substring_index (@ address ,'-', 2), '-',-1) as province, substring_index (@ address, '-', 3), '-',-1) as city, substring_index (@ address, '-',-1) as district
1.5 conditional judgment function case when
Case when then else value end as field name select case when 89> 101 then 'greater than 'else' less than 'end as betl1_socom_data
Ii. kettle conversion etl1 cleaning
First, create a table in the video.
We recommend that you use the BTREE algorithm to improve the query efficiency.
2.1.kettle file name: trans_etl1_socom_data
2. Control: Table input >>> table output
2. 3. Data Stream direction: s_socom_data >>>> etl1_socom_data
Kettle conversion 1
2.4. Table input 2.4. SQL script preliminary cleansing of the com_district and com_industry Fields
Select. *, case when com_district like '% ye' or com_district like' % zhi' or com_district like '% yu' then null else com_district end as com_district1, case when com_district like '% ye' or com_district like' % zhi' or com_district like '% yu' then concat (com_district,'-', com_industry) else com_industry end as com_industry_total, replace (com_addr, 'address: ', '') as com_addr1, replace (com_phone,' Tel: ','') as com_phone1, replace (com_fax,' pass the truth: ', '') as com_fax1, replace (com_mobile, 'cell Phone:','') as com_mobile1, replace (com_url, 'url: ', '') as com_url1, replace (com_email, 'mailbox: ', '') as com_email1, replace (com_contactor, 'Contact:','') as com_contactor1, replace (com_emploies_nums, 'Company count :', '') as com_emploies_nums1, replace (com_reg_capital, 'registered capital: wan','') as com_reg_capital1, replace (com_type, 'economic type: ', '') as com_type1, replace (com_product, 'Company product: ', '') as com_product1, replace (com_desc, 'Company profile:','') as com_desc1from s_socom_data as
2.5 table output
Table output settings considerations
Note:
① Do not select the crop table option for Crawler increment operations
② Data connection problem select the database where the table is located in the output table
③ Field ing ensures that the number of fields in the data stream is consistent with that in the physical table
Iii. kettle conversion etl2 cleaning
First, four fields are added to the table. The demonstration steps are shown in the video.
We recommend that you use the BTREE algorithm to improve the query efficiency.
Fields are split and cleaned for the new com_industry generated by etl1.
3.1.kettle file name: trans_etl2_socom_data
. Includes controls: Table input >>> table output
3. Data Stream direction: etl1_socom_data >>>> etl2_socom_data
Note:
① Do not select the crop table option for Crawler increment operations
② Data connection problem select the database where the table is located in the output table
③ Field ing ensures that the number of fields in the data stream is consistent with that in the physical table
Kettle conversion 2
3.4. The SQL script splits com_industry and cleans all fields. The time relationship between the registered capital fields is not detailed. Just adjust the code.
Select. *, case # The value of industry ''is set to null. when length (com_industry) = 0 then null # other values take the first separator before else substring_index (com_industry ,'-', 1) end as com_industry1, case when length (com_industry)-length (replace (com_industry ,'-','')) = 0 then null # 'transportation, warehousing, and postal service' value industry 2 is also set to nullwhen length (com_industry)-length (replace (com_industry ,'-', '') = 1 and length (substring_index (com_industry, '-',-1) = 0 then nullwhen length (com_industry)-length (replace (com_industry ,'-', '') = 1 then substring_index (com_industry, '-',-1) else substring_index (com_industry, '-', 2), '-',-1) end as com_industry2, case when length (com_industry)-length (replace (com_industry, '-', '') <= 1 then nullwhen length (com_industry) -length (replace (com_industry, '-', '') = 2 then substring_index (com_industry, '-',-1) else substring_index (com_industry ,'-', 3), '-',-1) end as com_industry3, case when length (com_industry)-length (replace (com_industry ,'-','')) <= 2 then nullelse substring_index (com_industry, '-',-1) end as com_industry4from etl1_socom_data as
Iv. Cleaning Effect Quality Check
4.1 check whether the crawler data source and website data are consistent
If crawlers and data are processed together, this step can be omitted. If crawlers are connected to upstream crawlers, this step will be judged first. Otherwise, cleaning will be useless, generally, crawlers are required to store the requested url for later data processing and viewing data quality.
4.2 calculate the crawler data source and the data volume of each etl cleansing data table
Note: The data volume of the three tables not filtered by aggregation in the SQL script must be equal.
4.2.1 in SQL query, the following table is in the same database. If not in the same database from, the name of the database where the table is located should be added.
It is not recommended to use a large data volume.
select count(1) from s_socom_dataunion allselect count(1) from etl1_socom_dataunion allselect count(1) from etl2_socom_data
4.2.2 comparison of total table output after kettle conversion is completed
Total output data volume of kettle table
4.3 view etl cleaning quality
Make sure that the first two steps are correct, data Processing-responsible etl cleaning self-check started to write scripts for data source cleansing fields check the socom website mainly cleans regions and industries and replaces redundant fields for other fields, therefore, the script check is adopted,
Find page_url and website data for verification
In the where clause, this is used to check the cleaning status of a field.
select * from etl2_socom_data where com_district is null and length(com_industry)-length(replace(com_industry,'-',''))=3
Http://www.socom.cn/company/7320798.htmlthis page data and etl2_socom_datatable final cleaning data comparison
Website page data
Etl2_socom_data table data
Cleaning is completed.
The above article is based on the python crawler Data Processing (detailed description). I hope you can give us a reference and support the house of helpers.