Python crawler instance _ DATA crawling method for urban public transit network sites, python Crawler
Crawled site:Http://beijing.8684.cn/
(1) Environment configuration, directly add the Code:
#-*-Coding: UTF-8-*-import requests # import requestsfrom bs4 import BeautifulSoup # import BeautifulSoupimport osheaders in bs4 = {'user-agent ': 'mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 100'} all_url = 'HTTP: // beijing.8684.cn '# Start URL start_html = requests. get (all_url, headers = headers) # print (start_html.text) Soup = BeautifulSoup (start_html.text, 'lxml') # parse html documents in lxml format
(2) crawling site analysis
1. There are three types of bus lines in Beijing:
This article crawls through the beginning of a number. "F12" starts the developer tool, click "Elements", and click "1". You can find that the link is saved in<div class="bus_kt_r1">
So you only need to extract the href In the div:
Code:
all_a = Soup.find(‘div',class_='bus_kt_r1').find_all(‘a')
2. Go down and find that every link is in<div id="con_site_1" class="site_list">
Of<a>
The herf is the Line URL, and its content is the line name,Code:
Href = a ['href '] # retrieve the href attribute of tag a. html = all_url + hrefsecond_html = requests. get (html, headers = headers) # print (second_html.text) Soup2 = BeautifulSoup (second_html.text, 'lxml') all_a2 = Soup2.find ('div ', class _ = 'CC _ content '). find_all ('div ') [-1]. find_all ('A') # The div with both id and class does not know why it cannot be obtained, so it has to be retrieved again.
3. Open the line link and you will see the specific site information. After you open the page and analyze the document structure, you will find that the basic information of the line is stored in<div class="bus_i_content">
While bus station information is stored in<div class="bus_line_top">
And<div class="bus_line_site">
Extract the Code:
Title1 = a2.get _ text () # retrieve the text href1 = a2 ['href '] # retrieve the href attribute of tag a # print (title1, href1) html_bus = all_url + href1 # construct the line site urlthrid_html = requests. get (html_bus, headers = headers) Soup3 = BeautifulSoup (thrid_html.text, 'lxml') bus_name = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('h1 '). get_text () # extract line name bus_type = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('A '). get_text () # extract the line attribute bus_time = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [0]. get_text () # Run Time bus_cost = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [1]. get_text () # fare bus_company = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [2]. find ('A '). get_text () # bus company bus_update = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [3]. get_text () # Update Time bus_label = Soup3.find ('div ', class _ = 'bus _ lab') if bus_label: bus_length = bus_label.get_text () # Line mileage else: bus_length = [] # print (bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update) all_line = Soup3.find _ all ('div ', class _ = 'bus _ line_top ') # Line introduction all_site = Soup3.find _ all ('div ', class _ = 'bus _ line_site') # bus Station line_x = all_line [0]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [0]. find_all ('span ') [-1]. get_text () sites_x = all_site [0]. find_all ('A') sites_x_list = [] # upstream line site for site_x in sites_x: sites_x_list.append (site_x.get_text () line_num = len (all_line) if line_num = 2: # If there is a loop, two lists are also returned, but one of them is empty line_y = all_line [1]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [1]. find_all ('span ') [-1]. get_text () sites_y = all_site [1]. find_all ('A') sites_y_list = [] # downstream line site for site_y in sites_y: sites_y_list.append (site_y.get_text () else: line_y, sites_y_list = [], [] information = [bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update, bus_length, line_x, sites_x_list, line_y, sites_y_list]
Since then, we have resolved the information about a line and the upstream and downstream sites. If you want to climb a bus network site in the city, you only need to join the cycle.
Complete code:
#-*-Coding: UTF-8-*-# Python3.5import requests # import requestsfrom bs4 import BeautifulSoup # import BeautifulSoupimport osheaders = {'user-agent' in bs4 ': 'mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 100'} all_url = 'HTTP: // beijing.8684.cn '# Start URL start_html = requests. get (all_url, headers = headers) # print (start_html.text) Soup = BeautifulSoup (start_html.text, 'lxml') all_a = Soup. find ('div ', class _ = 'bus _ kt_r1 '). find_all ('A') Network_list = [] for a in all_a: href = a ['href '] # retrieve the href attribute html = all_url + href second_html = requests of tag. get (html, headers = headers) # print (second_html.text) Soup2 = BeautifulSoup (second_html.text, 'lxml') all_a2 = Soup2.find ('div ', class _ = 'CC _ content '). find_all ('div ') [-1]. find_all ('A') # div with both id and class does not know why it cannot be obtained, so we have to return for a2 in all_a2: title1 = a2.get _ text () # retrieve the text href1 = a2 ['href '] # retrieve the href attribute of tag a # print (title1, href1) html_bus = all_url + href1 thrid_html = requests. get (html_bus, headers = headers) Soup3 = BeautifulSoup (thrid_html.text, 'lxml') bus_name = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('h1 '). get_text () bus_type = Soup3.find ('div ', class _ = 'bus _ I _t1 '). find ('A '). get_text () bus_time = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [0]. get_text () bus_cost = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [1]. get_text () bus_company = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [2]. find ('A '). get_text () bus_update = Soup3.find _ all ('P', class _ = 'bus _ I _t4 ') [3]. get_text () bus_label = Soup3.find ('div ', class _ = 'bus _ labe') if bus_label: bus_length = bus_label.get_text () else: bus_length = [] # print (bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update) all_line = Soup3.find _ all ('div ', class _ = 'bus _ line_top') all_site = Soup3.find _ all ('div ', class _ = 'bus _ line_site ') line_x = all_line [0]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [0]. find_all ('span ') [-1]. get_text () sites_x = all_site [0]. find_all ('A') sites_x_list = [] for site_x in sites_x: sites_x_list.append (site_x.get_text () line_num = len (all_line) if line_num = 2: # if there is a loop, two lists are also returned, but one of them is empty line_y = all_line [1]. find ('div ', class _ = 'bus _ line_txt '). get_text () [:-9] + all_line [1]. find_all ('span ') [-1]. get_text () sites_y = all_site [1]. find_all ('A') sites_y_list = [] for site_y in sites_y: loads (site_y.get_text () else: line_y, sites_y_list = [], [] information = [bus_name, bus_type, bus_time, bus_cost, bus_company, bus_update, bus_length, line_x, sites_x_list, line_y, sites_y_list] Network_list.append (information) # define the SAVE Function and save the operation result as the def text_save (content, filename, mode = 'A'): # Try to save a list variable in txt file. file = open (filename, mode) for I in range (len (content): file. write (str (content [I]) + '\ n') file. close () # output the processed data text_save(Network_list,'Network_bus.txt ');
Finally, the public transit network site information of the whole city is output. This time, it will be saved in the txt file, or saved to the database, such as mysql or MongoDB. I will not write it here, if you are interested, try it and attach the result diagram after the program is run:
The above Python crawler example _ the method for crawling the data of the city public transit network site is all the content that I shared with you. I hope you can give us a reference and support the house of tourists.