First, understand the following functions
Set variable length () function char_length () replace () function max () function
1.1. Set the variable set @ variable name = value
set @address=‘中国-山东省-聊城市-莘县‘;select @address
1.2. Length () function char_length () function difference
select length(‘a‘),char_length(‘a‘),length(‘中‘),char_length(‘中‘)
1.3. Replace () function and length () function combination
set @address=‘中国-山东省-聊城市-莘县‘;select @address,replace(@address,‘-‘,‘‘) as address_1,length(@address) as len_add1,length(replace(@address,‘-‘,‘‘)) as len_add2,length(@address)-length(replace(@address,‘-‘,‘‘)) as _count
ETL Cleaning field when there are obvious delimiters how to determine the new data table to add a few separate fields
Calculate the maximum number of characters in Com_industry to determine how many fields you can add to the maximum of +1 to be split into fields this table is 3 so you can split 4 industry fields and 4 industry levels
select max(length(com_industry)-length(replace(com_industry,‘-‘,‘‘))) as _max_countfrom etl1_socom_data
1.4. Set Variable Substring_index () string intercept function usage
Set @address = ' China-Shandong province-Liaocheng-Xin County '; select Substring_index ( @address, 1) as China,substring_index (Substring_index ( @address, '-', 2), '-', -1) As Province,substring_index (Substring_index ( @address, '-', Span class= "Hljs-number" >3), '-', -1) as city,substring_index ( @address, -1) as District
1.5. Conditional Judgment function case
Case when and then else value end as field name
select case when 89>101 then ‘大于‘ else ‘小于‘ end as betl1_socom_data
Second, kettle conversion ETL1 Cleaning
First build the table step in the video
Field index does not mention index algorithm to enhance query efficiency with btree algorithm
2.1.kettle file name: Trans_etl1_socom_data
2.2. Include Controls: Table input >>> table output
2.3. Data Flow direction: S_socom_data>>>>etl1_socom_data
Kettle Conversion 1
2.4, table input 2.4, SQL script Preliminary cleaning com_district and Com_industry fields
Select A.*,CaseWhen Com_districtLike'% industry 'or com_districtLike'% weaving 'or com_districtLike'% bred 'ThenNullelse com_districtEndAs Com_district1,CaseWhen Com_districtLike'% industry 'or com_districtLike'% weaving 'or com_districtLike'% bred 'ThenConcat (Com_district,'-', com_industry)else Com_industryEndAs Com_industry_total,Replace (COM_ADDR,' Address: ',‘‘)As COM_ADDR1,Replace (Com_phone,' Phone: ',‘‘)As Com_phone1,Replace (Com_fax,' Fax: ',‘‘)As Com_fax1,Replace (Com_mobile,' Mobile: ',‘‘)As Com_mobile1,Replace (Com_url,' URL: ',‘‘)As COM_URL1,Replace (Com_email,' Email: ',‘‘)As Com_email1,Replace (Com_contactor,' Contact: ',‘‘)As Com_contactor1,Replace (Com_emploies_nums,' Number of companies: ', ' ' as com_emploies_nums1,replace (com_reg_capital,' registered capital: Million ',') as com_reg_ Capital1,replace (com_type,' economic type: ', ') as Com_type1,replace (com_product,' company product: ',' as Com_product1,replace (Com_desc,' Company Profile: ', ') as Com_desc1 from S_socom_data as a
2.5. Table Output
Table Output Settings considerations
Precautions:
① involving crawler incremental operations do not tick the crop table option
② Data connection problem Select the database in the table output
③ field mapping issues ensure that the field in the data flow and the number of fields in the physical table correspond consistently
Third, kettle conversion ETL2 Cleaning
First Build table added 4 fields demonstration steps in the video
Field index does not mention index algorithm to enhance query efficiency with btree algorithm
Field split cleaning mainly for new com_industry generated by ETL1
3.1.kettle file name: Trans_etl2_socom_data
3.2. Include controls: Table input >>> table output
3.3. Data Flow direction: Etl1_socom_data>>>>etl2_socom_data
Precautions:
① involving crawler incremental operations do not tick the crop table option
② Data connection problem Select the database in the table output
③ field mapping issues ensure that the field in the data flow and the number of fields in the physical table correspond consistently
Kettle Conversion 2
3.4, SQL script to Com_industry to complete all fields cleaning registration funds field time relationship is not carefully disassembled adjustment code can be
Select A.*,Case #行业为' value is set to NULLWhenLength (com_industry) =0Thennull# other take the first-delimiter beforeelse Substring_index (Com_industry,‘-‘,1)EndAs Com_industry1,CaseWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =0Thennull#' Transportation, warehousing and the postal industry-' This value industry2 is also placed asNullWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =1andLength (Substring_index (Com_industry,‘-‘,-1)) =0ThenNullWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =1Then Substring_index (Com_industry,‘-‘,-1)Else Substring_index (Substring_index (Com_industry,‘-‘,2),‘-‘,-1)EndAs Com_industry2,CaseWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) <=1ThenNullWhenLength (Com_industry)-LengthReplace (Com_industry,‘-‘,")) =2Then Substring_index (Com_industry,‘-‘,-1)Else Substring_index (Substring_index (Com_industry,3), '-', -1) end as com_industry3,case Span class= "Hljs-keyword" >when length (com_industry)-length (replace (Com_industry, '-', 2 then nullelse substring_index (Com_industry,-1) Span class= "Hljs-keyword" >end as com_industry4from etl1_ Socom_data as a
Iv. quality inspection of cleaning effect 4.1 crawler data source data and site data match
If the work itself is a crawler and data processing in the process, the grasp of the time in fact has been judged, this step can be omitted, if docking upstream crawler colleague, this step first judge, otherwise cleaning is not diligent, generally require the Crawler colleague store request URL for data quality
4.2 Calculate the data volume of the crawler data source and each ETL cleaning data table
Note: SQL scripts are not aggregated filtered 3 table data should be equal
4.2.1, SQL query the table below I am in the same database if you do not add the database name of the table after the same database from
It is not recommended to use when the data volume is large
select count(1) from s_socom_dataunion allselect count(1) from etl1_socom_dataunion allselect count(1) from etl2_socom_data
4.2.2 Comparison of total table output after execution of kettle conversion
Kettle Table Output Total data volume 4.3 View ETL Cleaning quality
Make sure the first two steps are correct, data processing is responsible for ETL cleaning work self-examination start the field to clean the data source to write a script check SOCOM Web site is mainly for the region and the industry has been cleaned to other fields do replace extra field processing, so take a script check,
Find Page_url and website data for verification
Where it's written to make it easier to see the cleaning of a field
select * from etl2_socom_data where com_district is null and length(com_industry)-length(replace(com_industry,‘-‘,‘‘))=3
http://www.socom.cn/company/7320798.html
This page data and Etl2_socom_data table final cleaning data comparison
Site page Data
Etl2_socom_data table Data
Cleaning work completed.
You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.
Python crawler data processing