Data Processing Based on python crawlers (detailed description) and python Crawlers

Source: Internet
Author: User

Data Processing Based on python crawlers (detailed description) and python Crawlers

1. First, understand the following functions:

Set the variable length () function char_length () replace () function max () function

1.1 set the variable set @ variable name = Value

Set @ address = 'China-Shandong province-Liaocheng City-xinxian County '; select @ address

1.2, length () function char_length () function difference

Select length ('A'), char_length ('A'), length ('中'), char_length ('中 ')

1.3 replace () function and length () function combination

Set @ address = 'China-Shandong province-Liaocheng City-xinxian County '; select @ address, replace (@ address,'-', '') as address_1, length (@ address) as len_add1, length (replace (@ address, '-', '') as len_add2, length (@ address)-length (replace (@ address ,'-', '') as _ count

How can I determine how to add several separated fields to the new data table when there is an obvious delimiter in etl field cleansing?

Calculate the maximum number of characters in com_industry to determine the maximum number of fields to be added + 1 to the number of fields to be split. This table is 3. Therefore, four industry fields can be split, that is, four industries. level

select max(length(com_industry)-length(replace(com_industry,'-',''))) as _max_countfrom etl1_socom_data

1.4 set the variable substring_index () string truncation function usage

Set @ address = 'China-Shandong province-Liaocheng City-xinxian County '; select substring_index (@ address,'-', 1) as china, substring_index (@ address ,'-', 2), '-',-1) as province, substring_index (@ address, '-', 3), '-',-1) as city, substring_index (@ address, '-',-1) as district

1.5 conditional judgment function case when

Case when then else value end as field name select case when 89> 101 then 'greater than 'else' less than 'end as betl1_socom_data

Ii. kettle conversion etl1 cleaning

First, create a table in the video.

We recommend that you use the BTREE algorithm to improve the query efficiency.

2.1.kettle file name: trans_etl1_socom_data

2. Control: Table input >>> table output

2. 3. Data Stream direction: s_socom_data >>>> etl1_socom_data

Kettle conversion 1

2.4. Table input 2.4. SQL script preliminary cleansing of the com_district and com_industry Fields

Select. *, case when com_district like '% ye' or com_district like' % zhi' or com_district like '% yu' then null else com_district end as com_district1, case when com_district like '% ye' or com_district like' % zhi' or com_district like '% yu' then concat (com_district,'-', com_industry) else com_industry end as com_industry_total, replace (com_addr, 'address: ', '') as com_addr1, replace (com_phone,' Tel: ','') as com_phone1, replace (com_fax,' pass the truth: ', '') as com_fax1, replace (com_mobile, 'cell Phone:','') as com_mobile1, replace (com_url, 'url: ', '') as com_url1, replace (com_email, 'mailbox: ', '') as com_email1, replace (com_contactor, 'Contact:','') as com_contactor1, replace (com_emploies_nums, 'Company count :', '') as com_emploies_nums1, replace (com_reg_capital, 'registered capital: wan','') as com_reg_capital1, replace (com_type, 'economic type: ', '') as com_type1, replace (com_product, 'Company product: ', '') as com_product1, replace (com_desc, 'Company profile:','') as com_desc1from s_socom_data as

2.5 table output

Table output settings considerations

Note:

① Do not select the crop table option for Crawler increment operations

② Data connection problem select the database where the table is located in the output table

③ Field ing ensures that the number of fields in the data stream is consistent with that in the physical table

Iii. kettle conversion etl2 cleaning

First, four fields are added to the table. The demonstration steps are shown in the video.

We recommend that you use the BTREE algorithm to improve the query efficiency.

Fields are split and cleaned for the new com_industry generated by etl1.

3.1.kettle file name: trans_etl2_socom_data

. Includes controls: Table input >>> table output

3. Data Stream direction: etl1_socom_data >>>> etl2_socom_data

Note:

① Do not select the crop table option for Crawler increment operations

② Data connection problem select the database where the table is located in the output table

③ Field ing ensures that the number of fields in the data stream is consistent with that in the physical table

Kettle conversion 2

3.4. The SQL script splits com_industry and cleans all fields. The time relationship between the registered capital fields is not detailed. Just adjust the code.

Select. *, case # The value of industry ''is set to null. when length (com_industry) = 0 then null # other values take the first separator before else substring_index (com_industry ,'-', 1) end as com_industry1, case when length (com_industry)-length (replace (com_industry ,'-','')) = 0 then null # 'transportation, warehousing, and postal service' value industry 2 is also set to nullwhen length (com_industry)-length (replace (com_industry ,'-', '') = 1 and length (substring_index (com_industry, '-',-1) = 0 then nullwhen length (com_industry)-length (replace (com_industry ,'-', '') = 1 then substring_index (com_industry, '-',-1) else substring_index (com_industry, '-', 2), '-',-1) end as com_industry2, case when length (com_industry)-length (replace (com_industry, '-', '') <= 1 then nullwhen length (com_industry) -length (replace (com_industry, '-', '') = 2 then substring_index (com_industry, '-',-1) else substring_index (com_industry ,'-', 3), '-',-1) end as com_industry3, case when length (com_industry)-length (replace (com_industry ,'-','')) <= 2 then nullelse substring_index (com_industry, '-',-1) end as com_industry4from etl1_socom_data as

Iv. Cleaning Effect Quality Check

4.1 check whether the crawler data source and website data are consistent

If crawlers and data are processed together, this step can be omitted. If crawlers are connected to upstream crawlers, this step will be judged first. Otherwise, cleaning will be useless, generally, crawlers are required to store the requested url for later data processing and viewing data quality.

4.2 calculate the crawler data source and the data volume of each etl cleansing data table

Note: The data volume of the three tables not filtered by aggregation in the SQL script must be equal.

4.2.1 in SQL query, the following table is in the same database. If not in the same database from, the name of the database where the table is located should be added.

It is not recommended to use a large data volume.

select count(1) from s_socom_dataunion allselect count(1) from etl1_socom_dataunion allselect count(1) from etl2_socom_data

4.2.2 comparison of total table output after kettle conversion is completed

Total output data volume of kettle table

4.3 view etl cleaning quality

Make sure that the first two steps are correct, data Processing-responsible etl cleaning self-check started to write scripts for data source cleansing fields check the socom website mainly cleans regions and industries and replaces redundant fields for other fields, therefore, the script check is adopted,

Find page_url and website data for verification

In the where clause, this is used to check the cleaning status of a field.

select * from etl2_socom_data where com_district is null and length(com_industry)-length(replace(com_industry,'-',''))=3

Http://www.socom.cn/company/7320798.htmlthis page data and etl2_socom_datatable final cleaning data comparison

Website page data

Etl2_socom_data table data

Cleaning is completed.

The above article is based on the python crawler Data Processing (detailed description). I hope you can give us a reference and support the house of helpers.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.