Python crawler data processing

Source: Internet
Author: User

First, understand the following functions

Set variable length () function char_length () replace () function max () function
1.1. Set the variable set @ variable name = value

set @address=‘中国-山东省-聊城市-莘县‘;select @address

1.2. Length () function char_length () function difference

select length(‘a‘),char_length(‘a‘),length(‘中‘),char_length(‘中‘)

1.3. Replace () function and length () function combination

set @address=‘中国-山东省-聊城市-莘县‘;select @address,replace(@address,‘-‘,‘‘) as address_1,length(@address) as len_add1,length(replace(@address,‘-‘,‘‘)) as len_add2,length(@address)-length(replace(@address,‘-‘,‘‘)) as _count

ETL Cleaning field when there are obvious delimiters how to determine the new data table to add a few separate fields

Calculate the maximum number of characters in Com_industry to determine how many fields you can add to the maximum of +1 to be split into fields this table is 3 so you can split 4 industry fields and 4 industry levels

select max(length(com_industry)-length(replace(com_industry,‘-‘,‘‘))) as _max_countfrom etl1_socom_data

1.4. Set Variable Substring_index () string intercept function usage

Set @address = ' China-Shandong province-Liaocheng-Xin County '; select Substring_index ( @address, 1) as China,substring_index (Substring_index ( @address,  '-', 2),  '-', -1)  As Province,substring_index (Substring_index ( @address,  '-', Span class= "Hljs-number" >3),  '-', -1) as city,substring_index ( @address, -1) as District       

1.5. Conditional Judgment function case
Case when and then else value end as field name

select case when 89>101 then ‘大于‘ else ‘小于‘ end as betl1_socom_data
Second, kettle conversion ETL1 Cleaning

First build the table step in the video
Field index does not mention index algorithm to enhance query efficiency with btree algorithm

2.1.kettle file name: Trans_etl1_socom_data
2.2. Include Controls: Table input >>> table output
2.3. Data Flow direction: S_socom_data>>>>etl1_socom_data


Kettle Conversion 1

2.4, table input 2.4, SQL script Preliminary cleaning com_district and Com_industry fields

Select A.*,CaseWhen Com_districtLike'% industry 'or com_districtLike'% weaving 'or com_districtLike'% bred 'ThenNullelse com_districtEndAs Com_district1,CaseWhen Com_districtLike'% industry 'or com_districtLike'% weaving 'or com_districtLike'% bred 'ThenConcat (Com_district,'-', com_industry)else Com_industryEndAs Com_industry_total,Replace (COM_ADDR,' Address: ',‘‘)As COM_ADDR1,Replace (Com_phone,' Phone: ',‘‘)As Com_phone1,Replace (Com_fax,' Fax: ',‘‘)As Com_fax1,Replace (Com_mobile,' Mobile: ',‘‘)As Com_mobile1,Replace (Com_url,' URL: ',‘‘)As COM_URL1,Replace (Com_email,' Email: ',‘‘)As Com_email1,Replace (Com_contactor,' Contact: ',‘‘)As Com_contactor1,Replace (Com_emploies_nums,' Number of companies: ', ' ' as com_emploies_nums1,replace (com_reg_capital,' registered capital: Million ',') as com_reg_ Capital1,replace (com_type,' economic type: ', ') as Com_type1,replace (com_product,' company product: ',' as Com_product1,replace (Com_desc,' Company Profile: ', ') as Com_desc1 from S_socom_data as a  


2.5. Table Output


Table Output Settings considerations


Precautions:
① involving crawler incremental operations do not tick the crop table option
② Data connection problem Select the database in the table output
③ field mapping issues ensure that the field in the data flow and the number of fields in the physical table correspond consistently

Third, kettle conversion ETL2 Cleaning

First Build table added 4 fields demonstration steps in the video
Field index does not mention index algorithm to enhance query efficiency with btree algorithm

Field split cleaning mainly for new com_industry generated by ETL1
3.1.kettle file name: Trans_etl2_socom_data
3.2. Include controls: Table input >>> table output
3.3. Data Flow direction: Etl1_socom_data>>>>etl2_socom_data
Precautions:
① involving crawler incremental operations do not tick the crop table option
② Data connection problem Select the database in the table output
③ field mapping issues ensure that the field in the data flow and the number of fields in the physical table correspond consistently


Kettle Conversion 2

3.4, SQL script to Com_industry to complete all fields cleaning registration funds field time relationship is not carefully disassembled adjustment code can be

Select A.*,Case #行业为' value is set to NULLWhenLength (com_industry) =0Thennull# other take the first-delimiter beforeelse Substring_index (Com_industry,‘-‘,1)EndAs Com_industry1,CaseWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =0Thennull#' Transportation, warehousing and the postal industry-' This value industry2 is also placed asNullWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =1andLength (Substring_index (Com_industry,‘-‘,-1)) =0ThenNullWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =1Then Substring_index (Com_industry,‘-‘,-1)Else Substring_index (Substring_index (Com_industry,‘-‘,2),‘-‘,-1)EndAs Com_industry2,CaseWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) <=1ThenNullWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =2Then Substring_index (Com_industry,‘-‘,-1)Else Substring_index (Substring_index (Com_industry,3),  '-',  -1) end as com_industry3,case Span class= "Hljs-keyword" >when length (com_industry)-length (replace (Com_industry, '-', 2 then nullelse substring_index (Com_industry,-1) Span class= "Hljs-keyword" >end as com_industry4from etl1_ Socom_data as a           
Iv. quality inspection of cleaning effect 4.1 crawler data source data and site data match

If the work itself is a crawler and data processing in the process, the grasp of the time in fact has been judged, this step can be omitted, if docking upstream crawler colleague, this step first judge, otherwise cleaning is not diligent, generally require the Crawler colleague store request URL for data quality

4.2 Calculate the data volume of the crawler data source and each ETL cleaning data table

Note: SQL scripts are not aggregated filtered 3 table data should be equal

4.2.1, SQL query the table below I am in the same database if you do not add the database name of the table after the same database from
It is not recommended to use when the data volume is large

select count(1) from s_socom_dataunion allselect count(1) from etl1_socom_dataunion allselect count(1) from etl2_socom_data

4.2.2 Comparison of total table output after execution of kettle conversion


Kettle Table Output Total data volume 4.3 View ETL Cleaning quality

Make sure the first two steps are correct, data processing is responsible for ETL cleaning work self-examination start the field to clean the data source to write a script check SOCOM Web site is mainly for the region and the industry has been cleaned to other fields do replace extra field processing, so take a script check,
Find Page_url and website data for verification

Where it's written to make it easier to see the cleaning of a field

select * from etl2_socom_data where com_district is null and length(com_industry)-length(replace(com_industry,‘-‘,‘‘))=3

http://www.socom.cn/company/7320798.htmlThis page data and Etl2_socom_data table final cleaning data comparison


Site page Data
Etl2_socom_data table Data

Cleaning work completed.

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Python crawler data processing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.