Python implements parallel capturing of 0.4 million house price data for the entire site (can be changed to capture the city), python captures
Preface
This crawler crawls house price information to practice data processing and whole-site crawling over 0.1 million.
The most intuitive way to increase the data volume is to increase the functional logic requirements. For Python features, carefully select the data structure. In the past, even if the logic part of the function is repeated, the I/O Request frequency was intensive, and the loop nested depth was only 1 ~ 2 S, and as the data size increases ~ The 2 S difference may be extended to 1 ~ 2 h.
Therefore, for websites with a large volume of data to be captured, you can reduce the time cost of capturing information from two aspects.
1) Optimize the function logic and select an appropriate data structure to conform to Pythonic programming habits. For example, the use of join () to merge strings saves memory space than "+.
2) I/O-intensive and CPU-intensive, multiple threads and multi-process parallel execution methods are selected to improve the execution efficiency.
1. Obtain the index
Encapsulate the request and set the timeout.
# Obtain list page def get_page (url): headers = {'user-agent': r'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 'R' Chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3 ', 'Referer': r'http: // bj.fangjia.com/ershoufang/', 'host': r 'bj .fangjia.com ', 'connection': 'Keep-alive'} timeout = 60 socket. setdefatimetimeout (timeout) # set timeout req = request. request (url, headers = headers) response = request. urlopen (req ). read () page = response. decode ('utf-8') return page
Level 1 Location: region information
Level 2 Location: Section Information (obtain section information based on the region location and store it in dict as a key_value pair)
Stored in dict mode, you can quickly find the target. -> {'Chaoyang ': {'body', 'anzhen', 'jianxiang qiao '......}}
Level 3 location: Subway information (search for information about houses around the subway)
Add the location Subway information to dict.-> {'Chaoyang ': {'body': {'line 5', 'Line 10 ', 'Line 13'}, 'anzhen ', 'jian xiangqiao '......}}
Corresponding url: http://bj.fangjia.com/ershoufang/--r-%E6%9C%9D%E9%98%B3%7Cw-5%E5%8F%B7%E7%BA%BF%7Cb-%E6%83%A0%E6%96%B0%E8%A5%BF%E8%A1%97
Decoded url: Http://bj.fangjia.com/ershoufang/--r-chaoyang | w-5 Line | B-huixin West Street
Based on the url parameter mode, you can obtain the target url in two ways:
1) obtain the destination url Based on the index path
# Retrieve the listing information list (nested dictionary traversal) def get_info_list (search_dict, layer, tmp_list, search_list): layer + = 1 # Set the dictionary level for I in range (len (search_dict )): tmp_key = list (search_dict.keys () [I] # extract the current dictionary level key tmp_list.append (tmp_key) # Add the current key value as an index to tmp_list tmp_value = search_dict [tmp_key] if isinstance (tmp_value, str): # tmp_list.append (tmp_value) when the key value is url) # Add the url to tmp_list search_list.append (copy. deepcopy (tmp_list) # Add the tmp_list index url to search_list tmp_list = tmp_list [: layer] # retain the index elif tmp_value = ''based on the level '': # Skip layer-= 2 when the key value is null # Jump out of the key value level tmp_list = tmp_list [: layer] # retained index else: get_info_list (tmp_value, layer, tmp_list, search_list) # When the key value is a list, iterate through tmp_list = tmp_list [: layer] return search_list
2) url Packaging Based on dict Information
{'Chaoyang ': {'body': {'line 5 '}}}
Parameters:
-- R-Chaoyang
-- B-
-- W-5 line
Assembly parameters: http://bj.fangjia.com/ershoufang/--r-chaoyang | Line w-5 | B-
1 # create a combination of url2 def get_compose_url (compose_tmp_url, tag_args, key_args): 3 compose_tmp_url_list = [compose_tmp_url, '|' if tag_args! = 'R-'else', tag_args, parse. quote (key_args),] 4 compose_url = ''. join (compose_tmp_url_list) 5 return compose_url
2. obtain the maximum number of pages on the index page
# Retrieve the url list of the current index page count def get_info_pn_list (search_list): fin_search_list = [] for I in range (len (search_list )): print ('>>> capturing % s' % search_list [I] [: 3]) search_url = search_list [I] [3] try: page = get_page (search_url) failed t: print ('get page timeout') continue soup = BS (page, 'lxml') # obtain the maximum number of pages pn_num = soup. select ('span [class = "mr5"] ') [0]. get_text () rule = re. compile (R' \ d + ') max_pn = int (rule. findall (pn_num) [1]) # assemble the url for pn in range (1, max_pn + 1 ): print (************************** crawling the % s page ******** * **************** '% pn) pn_rule = re. compile ('[|]') fin_url = pn_rule.sub (R' | e-% s | '% pn, search_url, 1) tmp_url_list = copy. deepcopy (search_list [I] [: 3]) tmp_url_list.append (fin_url) fin_search_list.append (tmp_url_list) return fin_search_list
3. Capture the listing information Tag
Here is the Tag we want to capture:
['Region', 'plate ', 'subway', 'title', 'location ', 'square meters', 'apartment type ', 'floor', 'total price ', 'UNIT: square meters price']
# Obtain tag Information def get_info (fin_search_list, process_ I): print ('process % s start' % process_ I) fin_info_list = [] for I in range (len (fin_search_list )): url = fin_search_list [I] [3] try: page = get_page (url) failed T: print ('get tag timeout') continue soup = BS (page, 'lxml ') title_list = soup. select ('a [class = "h_name"] ') address_list = soup. select ('span [class = "address] ') attr_list = soup. select ('span [class = "attribute"] ') price_list = soup. find_all (attrs = {"class": "xq_aprice xq_esf_width"}) # The select statement cannot identify certain attribute values (including spaces in the attribute values). You can use find_all (attrs = {}) replace for num in range (20): tag_tmp_list = [] try: title = title_list [num]. attrs ["title"] print (R' ************************* % s is being obtained ** * ********************** '% title) address = re. sub ('\ n', '', address_list [num]. get_text () area = re. search ('\ d + [\ u4E00-\ u9FA5] {2}', attr_list [num]. get_text ()). group (0) layout = re. search ('\ d [^ 0-9] \ d. ', attr_list [num]. get_text ()). group (0) floor = re. search ('\ d/\ d', attr_list [num]. get_text ()). group (0) price = re. search ('\ d + [\ u4E00-\ u9FA5]', price_list [num]. get_text ()). group (0) unit_price = re. search ('\ d + [\ u4E00-\ u9FA5]/. ', price_list [num]. get_text ()). group (0) tag_tmp_list = copy. deepcopy (fin_search_list [I] [: 3]) for tag in [title, address, area, layout, floor, price, unit_price]: tag_tmp_list.append (tag) fin_info_list.append (tag_tmp_list) handle T: print ('[capture failed]') continue print ('process % s terminated '% process_ I) return fin_info_list
4. allocate tasks and capture tasks in parallel
Slice the task list, set the process pool, and capture tasks in parallel.
# Assign a task def assignment_search_list (fin_search_list, project_num): # number of tasks contained by each process in project_num. The smaller the value, the more processes. assignment_list = [] fin_search_list_len = len) for I in range (0, fin_search_list_len, project_num): start = I end = I + project_num assignment_list.append (fin_search_list [start: end]) # retrieve list fragments return assignment_list
P = Pool (4) # set the process Pool assignment_list = assignment_search_list (fin_info_pn_list, 3) # assign a task, used for multi-process result = [] # multi-process result list for I in range (len (assignment_list): result. append (p. apply_async (get_info, args = (assignment_list [I], I) p. close () p. join () for result_ I in range (len (result): fin_info_result_list = result [result_ I]. get () fin_save_list.extend (fin_info_result_list) # merge the list obtained by each process
By setting the process pool for parallel capturing, the time is shortened to 3/1 of the single process capture time, with a total time of 3 h.
The computer is 4-core. After testing, when the number of tasks is 3, the current computer is most efficient.
5. Store captured results in excel and wait for visualized Data Processing
# Storage capture result def save_excel (fin_info_list, file_name): tag_name = ['region', 'plate ', 'subway', 'title', 'location', 'square meters ', 'apartment type ', 'floor', 'total price ', 'unit square meters price'] book = xlsxwriter. workbook (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name) # It is stored on the Desktop tmp = book by default. add_worksheet () row_num = len (fin_info_list) for I in range (1, row_num): if I = 1: tag_pos = 'a % s' % I tmp. write_row (tag_pos, tag_name) else: con_pos = 'a % s' % I content = fin_info_list [I-1] #-1 is because the table header occupies tmp. write_row (con_pos, content) book. close ()
Add source code
#! -*-Coding: UTF-8-*-# Function: House Price Survey # Author: taobz from urllib import parse, requestfrom bs4 import BeautifulSoup as BSfrom multiprocessing import Poolimport reimport lxmlimport datetimeimport cProfileimport socketimport copyimport xlsxwriterstarttime = datetime. datetime. now () base_url = r 'HTTP: // bj.fangjia.com/ershoufang/'test_search_dict = {'changping ': {'camp': {'line 13 ': 'http: // w-13 % E5 % 8F % B7 % E7 % BA % BF | B-% E9 % 9C % 8D % E8 % 90% A5 '}}search_list = [] # housing Information url list tmp_list = [] # Housing Information url cache list layer =-1 # retrieve list page def get_page (url): headers = {'user-agent': r'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 'R' Chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3 ', 'Referer': r'http: // bj.fangjia.com/ershoufang/', 'host': r 'bj .fangjia.com ', 'connection': 'Keep-alive'} timeout = 60 socket. setdefatimetimeout (timeout) # set timeout req = request. request (url, headers = headers) response = request. urlopen (req ). read () page = response. decode ('utf-8') return page # obtain the query keyword dictdef get_search (page, key): soup = BS (page, 'lxml') search_list = soup. find_all (href = re. compile (key), target = '') search_dict = {} for I in range (len (search_list): soup = BS (str (search_list [I]), 'lxml') key = soup. select ('A') [0]. get_text () value = soup. a. attrs ['href '] search_dict [key] = value return search_dict # obtain the listing information list (nested dictionary traversal) def get_info_list (search_dict, layer, tmp_list, search_list ): layer + = 1 # Set the dictionary level for I in range (len (search_dict): tmp_key = list (search_dict.keys () [I] # extract the current dictionary level key tmp_list.append (tmp_key) # Add the current key value as an index to tmp_list tmp_value = search_dict [tmp_key] if isinstance (tmp_value, str): # tmp_list.append (tmp_value) when the key value is url) # Add the url to tmp_list search_list.append (copy. deepcopy (tmp_list) # Add the tmp_list index url to search_list tmp_list = tmp_list [: layer] # retain the index elif tmp_value = ''based on the level '': # Skip layer-= 2 when the key value is null # Jump out of the key value level tmp_list = tmp_list [: layer] # retained index else: get_info_list (tmp_value, layer, tmp_list, search_list) # When the key value is a list, iterate through tmp_list = tmp_list [: layer] return search_list # obtain the details of the listing def get_info_pn_list (search_list ): fin_search_list = [] for I in range (len (search_list): print ('>>> capturing % s' % search_list [I] [: 3]) search_url = search_list [I] [3] try: page = get_page (search_url) failed T: print ('get page timeout') continue soup = BS (page, 'lxml ') # obtain the maximum number of pages pn_num = soup. select ('span [class = "mr5"] ') [0]. get_text () rule = re. compile (R' \ d + ') max_pn = int (rule. findall (pn_num) [1]) # assemble the url for pn in range (1, max_pn + 1 ): print (************************** crawling the % s page ******** * **************** '% pn) pn_rule = re. compile ('[|]') fin_url = pn_rule.sub (R' | e-% s | '% pn, search_url, 1) tmp_url_list = copy. deepcopy (search_list [I] [: 3]) tmp_url_list.append (fin_url) fin_search_list.append (tmp_url_list) return fin_search_list # retrieve tag Information def get_info (fin_search_list, process_ I ): print ('process % s start' % process_ I) fin_info_list = [] for I in range (len (fin_search_list): url = fin_search_list [I] [3] try: page = get_page (url) failed T: print ('get tag timeout') continue soup = BS (page, 'lxml') title_list = soup. select ('a [class = "h_name"] ') address_list = soup. select ('span [class = "address] ') attr_list = soup. select ('span [class = "attribute"] ') price_list = soup. find_all (attrs = {"class": "xq_aprice xq_esf_width"}) # The select statement cannot identify certain attribute values (including spaces in the attribute values). You can use find_all (attrs = {}) replace for num in range (20): tag_tmp_list = [] try: title = title_list [num]. attrs ["title"] print (R' ************************* % s is being obtained ** * ********************** '% title) address = re. sub ('\ n', '', address_list [num]. get_text () area = re. search ('\ d + [\ u4E00-\ u9FA5] {2}', attr_list [num]. get_text ()). group (0) layout = re. search ('\ d [^ 0-9] \ d. ', attr_list [num]. get_text ()). group (0) floor = re. search ('\ d/\ d', attr_list [num]. get_text ()). group (0) price = re. search ('\ d + [\ u4E00-\ u9FA5]', price_list [num]. get_text ()). group (0) unit_price = re. search ('\ d + [\ u4E00-\ u9FA5]/. ', price_list [num]. get_text ()). group (0) tag_tmp_list = copy. deepcopy (fin_search_list [I] [: 3]) for tag in [title, address, area, layout, floor, price, unit_price]: tag_tmp_list.append (tag) fin_info_list.append (tag_tmp_list) handle T: print ('[capture failed]') continue print ('process % s termination' % process_ I) return fin_info_list # assign task def assignment_search_list (fin_search_list, project_num ): # project_num the number of tasks contained by each process. The smaller the value, the more processes the assignment_list = [] fin_search_list_len = len (fin_search_list) for I in range (0, fin_search_list_len, project_num ): start = I end = I + project_num assignment_list.append (fin_search_list [start: end]) # retrieve list fragments return assignment_list # store the capture result def save_excel (fin_info_list, file_name ): tag_name = ['region', 'plate ', 'subway', 'title', 'location ', 'square meters', 'apartment type ', 'floor', 'total price ', 'UNIT: square meters price'] book = xlsxwriter. workbook (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name) # It is stored on the Desktop tmp = book by default. add_worksheet () row_num = len (fin_info_list) for I in range (1, row_num): if I = 1: tag_pos = 'a % s' % I tmp. write_row (tag_pos, tag_name) else: con_pos = 'a % s' % I content = fin_info_list [I-1] #-1 is because the table header occupies tmp. write_row (con_pos, content) book. close () if _ name _ = '_ main _': file_name = input (R' is crawled, and the input file name is saved as follows :') fin_save_list = [] # capture the information storage list # first-level screening page = get_page (base_url) search_dict = get_search (page, 'r-') # second-level Filtering for k in search_dict: print (R' ************************* Level 1 Capture: capturing [% s] ************************* '% k) url = search_dict [k] second_page = get_page (url) second_search_dict = get_search (second_page, 'B-') search_dict [k] = second_search_dict # Level 3 filtering for k in search_dict: second_dict = search_dict [k] for s_k in second_dict: print (R' ************************* Level 2 capture: capturing [% s] ************************* '% s_k) url = second_dict [s_k] third_page = get_page (url) third_search_dict = get_search (third_page, 'W-') print (' % s> % s' % (k, s_k )) second_dict [s_k] = login fin_info_list = get_info_list (search_dict, layer, tmp_list, search_list) rows = second (fin_info_list) p = Pool (4) # Set process Pool assignment_list = second (login, 2) # assign a task for multi-process result = [] # multi-process result list for I in range (len (assignment_list): result. append (p. apply_async (get_info, args = (assignment_list [I], I) p. close () p. join () for result_ I in range (len (result): fin_info_result_list = result [result_ I]. get () fin_save_list.extend (fin_info_result_list) # merge the list obtained by each process into save_excel (fin_save_list, file_name) endtime = datetime. datetime. now () time = (endtime-starttime ). seconds print ('total time: % s' % time)
Summary:
The larger the data capturing scale, the more rigorous the program logic requirements, and the more skilled the python syntax requirements. You must constantly learn how to write more pythonic syntaxes.
The above is all the content of this article. I hope this article will help you in your study or work. I also hope to provide more support to the customer's home!