Python crawler instance _ DATA crawling method for urban public transit network sites, python Crawler

Last Update:2018-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawled site:Http://beijing.8684.cn/

(1) Environment configuration, directly add the Code:

#-*-Coding: UTF-8-*-import requests # import requestsfrom bs4 import BeautifulSoup # import BeautifulSoupimport osheaders in bs4 = {'user-agent ': 'mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 100'} all_url = 'HTTP: // beijing.8684.cn '# Start URL start_html = requests. get (all_url, headers = headers) # print (start_html.text) Soup = BeautifulSoup (start_html.text, 'lxml') # parse html documents in lxml format

(2) crawling site analysis

1. There are three types of bus lines in Beijing:

This article crawls through the beginning of a number. "F12" starts the developer tool, click "Elements", and click "1". You can find that the link is saved in<div class="bus_kt_r1">So you only need to extract the href In the div:

Code:

all_a = Soup.find(‘div',class_='bus_kt_r1').find_all(‘a')

2. Go down and find that every link is in<div id="con_site_1" class="site_list"> Of<a>The herf is the Line URL, and its content is the line name,Code:

Href = a ['href '] # retrieve the href attribute of tag a. html = all_url + hrefsecond_html = requests. get (html, headers = headers) # print (second_html.text) Soup2 = BeautifulSoup (second_html.text, 'lxml') all_a2 = Soup2.find ('div ', class _ = 'CC _ content '). find_all ('div ') [-1]. find_all ('A') # The div with both id and class does not know why it cannot be obtained, so it has to be retrieved again.

3. Open the line link and you will see the specific site information. After you open the page and analyze the document structure, you will find that the basic information of the line is stored in<div class="bus_i_content">While bus station information is stored in<div class="bus_line_top">And<div class="bus_line_site">Extract the Code:

Title1 = a2.get _ text () # retrieve the text href1 = a2 ['href '] # retrieve the href attribute of tag a # print (title1, href1) html_bus = all_url + href1 # construct the line site urlthrid_html = requests. get (html_bus, headers = headers) Soup3 = BeautifulSoup (thrid_html.text, 'lxml') bus_name = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('h1 '). get_text () # extract line name bus_type = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('A '). get_text () # extract the line attribute bus_time = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [0]. get_text () # Run Time bus_cost = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [1]. get_text () # fare bus_company = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [2]. find ('A '). get_text () # bus company bus_update = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [3]. get_text () # Update Time bus_label = Soup3.find ('div ', class _ = 'bus _ lab') if bus_label: bus_length = bus_label.get_text () # Line mileage else: bus_length = [] # print (bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update) all_line = Soup3.find _ all ('div ', class _ = 'bus _ line_top ') # Line introduction all_site = Soup3.find _ all ('div ', class _ = 'bus _ line_site') # bus Station line_x = all_line [0]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [0]. find_all ('span ') [-1]. get_text () sites_x = all_site [0]. find_all ('A') sites_x_list = [] # upstream line site for site_x in sites_x: sites_x_list.append (site_x.get_text () line_num = len (all_line) if line_num = 2: # If there is a loop, two lists are also returned, but one of them is empty line_y = all_line [1]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [1]. find_all ('span ') [-1]. get_text () sites_y = all_site [1]. find_all ('A') sites_y_list = [] # downstream line site for site_y in sites_y: sites_y_list.append (site_y.get_text () else: line_y, sites_y_list = [], [] information = [bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update, bus_length, line_x, sites_x_list, line_y, sites_y_list]

Since then, we have resolved the information about a line and the upstream and downstream sites. If you want to climb a bus network site in the city, you only need to join the cycle.

Complete code:

#-*-Coding: UTF-8-*-# Python3.5import requests # import requestsfrom bs4 import BeautifulSoup # import BeautifulSoupimport osheaders = {'user-agent' in bs4 ': 'mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 100'} all_url = 'HTTP: // beijing.8684.cn '# Start URL start_html = requests. get (all_url, headers = headers) # print (start_html.text) Soup = BeautifulSoup (start_html.text, 'lxml') all_a = Soup. find ('div ', class _ = 'bus _ kt_r1 '). find_all ('A') Network_list = [] for a in all_a: href = a ['href '] # retrieve the href attribute html = all_url + href second_html = requests of tag. get (html, headers = headers) # print (second_html.text) Soup2 = BeautifulSoup (second_html.text, 'lxml') all_a2 = Soup2.find ('div ', class _ = 'CC _ content '). find_all ('div ') [-1]. find_all ('A') # div with both id and class does not know why it cannot be obtained, so we have to return for a2 in all_a2: title1 = a2.get _ text () # retrieve the text href1 = a2 ['href '] # retrieve the href attribute of tag a # print (title1, href1) html_bus = all_url + href1 thrid_html = requests. get (html_bus, headers = headers) Soup3 = BeautifulSoup (thrid_html.text, 'lxml') bus_name = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('h1 '). get_text () bus_type = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('A '). get_text () bus_time = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [0]. get_text () bus_cost = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [1]. get_text () bus_company = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [2]. find ('A '). get_text () bus_update = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [3]. get_text () bus_label = Soup3.find ('div ', class _ = 'bus _ labe') if bus_label: bus_length = bus_label.get_text () else: bus_length = [] # print (bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update) all_line = Soup3.find _ all ('div ', class _ = 'bus _ line_top') all_site = Soup3.find _ all ('div ', class _ = 'bus _ line_site ') line_x = all_line [0]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [0]. find_all ('span ') [-1]. get_text () sites_x = all_site [0]. find_all ('A') sites_x_list = [] for site_x in sites_x: sites_x_list.append (site_x.get_text () line_num = len (all_line) if line_num = 2: # if there is a loop, two lists are also returned, but one of them is empty line_y = all_line [1]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [1]. find_all ('span ') [-1]. get_text () sites_y = all_site [1]. find_all ('A') sites_y_list = [] for site_y in sites_y: loads (site_y.get_text () else: line_y, sites_y_list = [], [] information = [bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update, bus_length, line_x, sites_x_list, line_y, sites_y_list] Network_list.append (information) # define the SAVE Function and save the operation result as the def text_save (content, filename, mode = 'A'): # Try to save a list variable in txt file. file = open (filename, mode) for I in range (len (content): file. write (str (content [I]) + '\ n') file. close () # output the processed data text_save(Network_list,'Network_bus.txt ');

Finally, the public transit network site information of the whole city is output. This time, it will be saved in the txt file, or saved to the database, such as mysql or MongoDB. I will not write it here, if you are interested, try it and attach the result diagram after the program is run:

The above Python crawler example _ the method for crawling the data of the city public transit network site is all the content that I shared with you. I hope you can give us a reference and support the house of tourists.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler instance _ DATA crawling method for urban public transit network sites, python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler instance _ DATA crawling method for urban public transit network sites, python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support