It's written in front.
This time the reptile is about the house price information crawl, the goal is to practice more than 100,000 data processing and the whole station type crawl.
The most intuitive way to increase the amount of data is to improve the logic requirements of the function, and to choose the data structure carefully for Python's characteristics. In the past, a small amount of data capture, even if the function of the logical part of the repeated, I/O request frequency is intensive, the loop is embedded too deep, but 1~2s difference, and with the increase in data size, this 1~2s difference may be extended into 1~2h.
Therefore, in order to crawl the data volume of the site, you can start from two aspects to reduce the time cost of grasping information.
1 optimize the function logic, choose the appropriate data structure, accord with pythonic programming custom. For example, a combination of strings uses a join () to conserve memory space than "+".
2 according to I/O-intensive and CPU-intensive, choose multithreading, multi-process parallel execution mode, improve execution efficiency.
First, get the index
Packing request requests, setting timeout timeout
# get list page
def get_page (URL):
headers = {
' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '
R ' chrome/45.0.2454.85 safari/537.36 ', '
Referer ': R ' http://bj.fangjia.com/ershoufang/',
' Host ': R ' bj.fangjia.com ',
' Connection ': ' Keep-alive '
}
Timeout =
Socket.setdefaulttimeout (timeout) # set timeout
req = Request. Request (URL, headers=headers)
response = Request.urlopen (req). Read ()
page = Response.decode (' Utf-8 ') Return
page
First-level location: regional information
Secondary POSITION: Plate information (according to the regional position to get plate information, in the form of Key_value to store in the Dict)
stored in a dict way, you can quickly query to the target you are looking for. -> {' Chaoyang ': {' work body ', ' ahn ', ' Jiangxiang ' ...}}
Level Three location: Subway information (search around the subway listings information)
Add the location of the Metro information to the Dict. -> {' Chaoyang ': {' working body ': {' line Line 5 ', ' line Line 10 ', ' line Line 13 '}, ' Ahn Jeong ', ' Jiangxiang ' ...}}
The corresponding url:http://bj.fangjia.com/ershoufang/--r-%e6%9c%9d%e9%98%b3%7cw-5%e5%8f%b7%e7%ba%bf%7cb-%e6%83%a0%e6%96%b0% e8%a5%bf%e8%a1%97
decoded URL: http://bj.fangjia.com/ershoufang/--r-chaoyang |w-5 line |b-Hui Xin Xi Jie
Depending on the parameter pattern of the URL, there are two ways to get the destination URL:
1 to obtain the destination URL according to the index path
# Get Listings (nested dictionary traversal)
def get_info_list (search_dict, layer, Tmp_list, search_list):
layer = 1 # Set dictionary level for
i in Range (len (search_dict)):
Tmp_key = List (Search_dict.keys ()) [i] # extract current dictionary level key
tmp_list.append (Tmp_key) # Add the current key value as an index to tmp_list
tmp_value = Search_dict[tmp_key]
if Isinstance (Tmp_value, str): # When the key value is a URL
tmp_list.append (Tmp_value) # Add URL to tmp_list
search_list.append (copy.deepcopy (tmp_list)) # Add the tmp_list index URL to search_list
tmp_list = Tmp_list[:layer] # based on hierarchy retention index
elif tmp_value = ': # Skip when key value is empty
layer = 2 # Bounce key value level
tmp_list = Tmp_list[:layer] # Keep index based on hierarchy
else:
get_info_list (Tmp_ Value, layer, Tmp_list, search_list) # When the key value is a list, iterate through
tmp_list = Tmp_list[:layer] return
search_list
2) Packaging URL according to dict information
{' Chaoyang ': {' working body ': {' line Line 5 '}}}
Parameters:
--r-Chaoyang
--b-Working body
--w-5 Line
Assembly parameters: http://bj.fangjia.com/ershoufang/--r-Chaoyang |w-5 line |b-working body
1 # Create a combination URL based on parameters
2 def get_compose_url (Compose_tmp_url, Tag_args, Key_args):
3 compose_tmp_url_list = [ Compose_tmp_url, ' | ' If Tag_args!= ' r ' Else ', Tag_args, Parse.quote (Key_args),]
4 compose_url = '. Join (compo Se_tmp_url_list)
5 return Compose_url
Second, get the maximum number of index pages
# Gets the URL list for the number of pages in the current index
def get_info_pn_list (search_list):
fin_search_list = [] for
I in range (Len (search_list ):
print (' >>> crawling%s '% search_list[i][:3])
Search_url = search_list[i][3]
try:
page = get_ Page (Search_url)
except:
print (' get page timeout ')
continue
soup = BS (page, ' lxml ')
# Get maximum number of pages
Pn_ num = Soup.select (' span[class= ' mr5 '] ') [0].get_text () Rule
= Re.compile (R ' \d+ ')
max_pn = Int (Rule.findall ( Pn_num) [1])
# assembly URL for
PN in range (1, max_pn+1):
print (' ************************ is crawling%s page ************ '% pn '
pn_rule = Re.compile (' [|] ')
Fin_url = Pn_rule.sub (R ' |e-%s| '% pn, Search_url, 1)
tmp_url_list = copy.deepcopy (search_list[i][:3))
Tmp_ Url_list.append (Fin_url)
fin_search_list.append (tmp_url_list) return
fin_search_list
Third, grab the listing information tag
This is the tag we want to crawl:
[' Area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' unit square meter ' price ']
# get tag Information def get_info (Fin_search_list, process_i): Print (' process%s start '% process_i) fin_info_list = [] for I in range (l En (fin_search_list)): url = fin_search_list[i][3] try:page = get_page (URL) except:print (' Get tag timeout ') Continue soup = BS (page, ' lxml ') title_list = Soup.select (' a[class= ' h_name "] ') Address_list = soup.se Lect (' span[class= ' address] ') attr_list = Soup.select (' span[class= ' attribute '] ') price_list = Soup.find_all (attrs={
' Class ': ' Xq_aprice xq_esf_width '} # Select is not recognized for some property values (including spaces in the middle of property values) and can be replaced with Find_all (attrs={}) in the For Num in range (20): Tag_tmp_list = [] Try:title = title_list[num].attrs["title"] Print (R ' ************************ is getting% s************************ '% title ' address = re.sub (' \ n ', ', Address_list[num].get_text ()) area = Re.s Earch (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0) layout = Re.search (' \d[^0-9]\d. ', Attr_list[nu
M].get_text ()). Group (0) Floor = Re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0) Price = Re.search (' \d+[\u4e00-\u9fa5] ', PRI Ce_list[num].get_text ()). Group (0) Unit_price = Re.search (' \d+[\u4e00-\u9fa5]/. ', Price_list[num].get_text ()). Group (0) Tag_tmp_list = copy.deepcopy (Fin_search_list[i][:3]) for tag in [title, address, area, layout, floor, Price, Unit_price]: tag_tmp_list.append (TAG) fin_info_list.append (tag_tmp_list) EXCEPT:PR Int (' crawl failed ') continue print (' process%s end '% process_i) return fin_info_list
Iv. assigning tasks, parallel crawling
Fragment the Task list, set the process pool, and crawl in parallel.
# Assign Task
def assignment_search_list (Fin_search_list, Project_num): # Project_num The number of tasks each process contains, the smaller the number, the more processes
Assignment_list = []
Fin_search_list_len = Len (fin_search_list) for
I in range (0, Fin_search_list_len, project_ num):
start = I end
= I+project_num
assignment_list.append (Fin_search_list[start:end]) # get list fragment
Return assignment_list
p = Pool (4) # Set process pool
assignment_list = Assignment_search_list (Fin_info_pn_list, 3) # Assigning tasks for multiple process result
= [] # Multiple processes Results list for
I in range (len (assignment_list)):
result.append (P.apply_async (Get_info, args= (assignment_list[i ], i))
p.close ()
p.join () for
result_i in range (len (result)):
fin_info_result_list = result[ Result_i].get ()
fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process
By setting the process pool to crawl, the time is shortened to 3/1 of the single process crawl time, and the total time is 3h.
The computer is 4 cores, after the test, the task number is 3 o'clock, in the current computer operation efficiency is highest.
V. Store the results in Excel and wait for Visual data processing
# Store Crawl Results
def save_excel (Fin_info_list, file_name):
tag_name = [' Area ', ' plate ', ' metro ', ' title ', ' position ', ' square ', ' huxing ', ' floor ', ' Total price ' ', ' unit square meter price '] book
= Xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '% file_name) # The default is stored on the desktop
tmp = Book.add_worksheet ()
row_num = Len (fin_info_list) for
I in range (1, row_num):
If i = = 1:
tag_pos = ' a%s '% i
tmp.write_row (Tag_pos, Ta G_name)
Else:
con_pos = ' a%s '% i
content = fin_info_list[i-1] #-1 is because the table header occupies
tmp.write_row (con_ POS, content)
Book.close ()
Attached source code
#! -*-coding:utf-8-*-# Function: House price Survey # Author: Urllib import parse, request from BS4 import BeautifulSoup as BS from Multiprocessing Import Pool Import re import lxml import datetime import cProfile import socket import copy import XLSXWR ITER starttime = Datetime.datetime.now () Base_url = R ' http://bj.fangjia.com/ershoufang/' test_search_dict = {' changping ': {' Huo Ying ' : {' line Line 13 ': ' Http://bj.fangjia.com/ershoufang/--r-%E6%98%8C%E5%B9%B3|w-13%E5%8F%B7%E7%BA%BF|b-%E9%9C%8D%E8%90%A5 ' }} search_list = [] # List of listings tmp_list = [] # Listing information URL Cache list layer =-1 # Get list page def get_page (URL): headers = {' Use R-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) ' R ' chrome/45.0.2454.85 safari/537.36 ', ' Ref Erer ': R ' http://bj.fangjia.com/ershoufang/', ' Host ': R ' bj.fangjia.com ', ' Connection ': ' keep-alive '} timeout = Socket.setdefaulttimeout (Timeout) # set timeout req = Request. Request (URL, headers=headers) response = Request. Urlopen (req). Read () page = Response.decode (' utf-8 ') return page # get query keyword dict def get_search (page, key): Soup = BS ( Page, ' lxml ') search_list = Soup.find_all (Href=re.compile (key), target= ') search_dict = {} for I in range (Len (searc h_list): Soup = BS (str (search_list[i)), ' lxml ') key = Soup.select (' a ') [0].get_text () value = soup.a.attrs[' hr EF '] search_dict[key] = value return search_dict # Get listing Information list (nested dictionary traversal) def get_info_list (search_dict, layer, tmp_list, Search_list): Layer + 1 # Set dictionary level for I in range (len (search_dict)): Tmp_key = List (Search_dict.keys ()) [i] # extract current dictionary Level key Tmp_list.append (Tmp_key) # Adds the current key value as an index to tmp_list tmp_value = Search_dict[tmp_key] If isinstance (tmp_va Lue, str): # When the key value is a URL tmp_list.append (tmp_value) # adds the URL to tmp_list search_list.append (copy.deepcopy
) # Add the tmp_list index URL to search_list tmp_list = Tmp_list[:layer] # based on hierarchy retention index elif tmp_value = ': # Skip when key value is empty Layer = 2 # Bounce key value LayerLevel tmp_list = tmp_list[:layer] # Keep index based on hierarchy else:get_info_list (tmp_value, layer, tmp_list, search_list) # when When the key value is a list, iterate through tmp_list = Tmp_list[:layer] Return search_list # Get listings details def get_info_pn_list (search_list): Fin_sea Rch_list = [] for i in range (len (search_list)): Print (' >>> crawling%s '% search_list[i][:3]) Search_url = SE ARCH_LIST[I][3] Try:page = Get_page (search_url) except:print (' get page timeout ') Continue soup = BS ( Page, ' lxml ') # Get the maximum number of pages Pn_num = Soup.select (' span[class= ' mr5 '] ') [0].get_text () rule = Re.compile (R ' \d+ ') m AX_PN = Int (Rule.findall (pn_num) [1]) # Assemble URL for pn in range (1, max_pn+1): Print (' ************************ is
Crawl%s page ************************ '% pn) Pn_rule = Re.compile (' [|] ') Fin_url = Pn_rule.sub (R ' |e-%s| '% pn, Search_url, 1) tmp_url_list = Copy.deepcopy (search_list[i][:3)) Tmp_url_ List.append (Fin_url) fin_search_list.append (tmp_url_list) retUrn Fin_search_list # get tag information def get_info (Fin_search_list, process_i): Print (' process%s start '% process_i) fin_info_list = []
For I in range (len (fin_search_list)): url = fin_search_list[i][3] try:page = get_page (URL) except: Print (' Get tag timeout ') Continue soup = BS (page, ' lxml ') title_list = Soup.select (' a[class= ' h_name "] ') add Ress_list = Soup.select (' span[class= ' address] ') attr_list = Soup.select (' span[class= ' attribute "] ') Price_list = Up.find_all (attrs={"class": "Xq_aprice Xq_esf_width"}) # Select is not recognized for some property values (including spaces in the middle of the property value) and can be replaced with Find_all (attrs={}) for Num in range: tag_tmp_list = [] Try:title = title_list[num].attrs["title" Print (R ') Getting%s************************ '% title ' address = re.sub (' \ n ', ', Address_list[num].get_text () Area = Re.search (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0) layout = Re.search (' \d[^0- 9]\d. ', attr_list[num].get_tExt ()). Group (0) floor = re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0) Price = Re.search (' \d+[\u4e 00-\U9FA5] ', Price_list[num].get_text ()). Group (0) Unit_price = Re.search (' \d+[\u4e00-\u9fa5]/. ', PRICE_LIST[NUM].G Et_text ()). Group (0) tag_tmp_list = copy.deepcopy (Fin_search_list[i][:3]) for tag in [title, Address, area, Layout, floor, Price, Unit_price]: tag_tmp_list.append (TAG) fin_info_list.append (Tag_tmp_list) ex Cept:print (' crawl failed ') continue print (' process%s end '% process_i) return Fin_info_list # assign task def Assignment_ Search_list (Fin_search_list, Project_num): # Project_num The number of tasks that each process contains, the smaller the number, the more processes assignment_list = [] Fin_search_list_le
n = Len (fin_search_list) for I in range (0, Fin_search_list_len, project_num): start = I end = I+project_num Assignment_list.append (Fin_search_list[start:end]) # get list fragment return Assignment_list # Store crawl results def save_excel (Fin_info_li St, file_name): tag_name = [' Area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' unit square meter price '] book = Xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '% file_name) # The default is stored on the desktop TMP = Book.add_worksheet () row_num = Len (fin
_info_list) for I in range (1, row_num): if i = = 1:tag_pos = ' a%s '% i tmp.write_row (Tag_pos, tag_name) Else:con_pos = ' a%s '% i content = fin_info_list[i-1] #-1 is due to the table header occupied Tmp.write_row (Con_pos, Conten T) book.close () if __name__ = = ' __main__ ': file_name = input (R ' crawl complete, input filename Save: ') fin_save_list = [] # Crawl Information Store List # first screen Select page = Get_page (base_url) search_dict = get_search (page, ' R ') # Secondary filter for K-in Search_dict:print (R ' ******** Level crawl: Crawling "%s" ************************ '% k "url = search_dict[k] second_page = get_page (URL) s
Econd_search_dict = Get_search (second_page, ' B ') search_dict[k] = second_search_dict # Three-level filter for K in search_dict: Second_dict = Search_dict[k] for s_k in Second_diCt:print (R ' ************************ level two crawl: Crawling "%s" ************************ '% s_k) URL = second_dict[s_k]
third_page = get_page (URL) third_search_dict = Get_search (third_page, ' W ') print ('%s>%s '% (k, s_k)) Second_dict[s_k] = third_search_dict fin_info_list = get_info_list (search_dict, layer, tmp_list, search_list) fin_in Fo_pn_list = Get_info_pn_list (fin_info_list) p = Pool (4) # set process Pool Assignment_list = assignment_search_list (fin_info_pn_ List, 2) # Assigning tasks for multiple process result = [] # Multi-process results list for I in range (len (assignment_list)): Result.append (P.apply_async get_ Info, args= (assignment_list[i], i)) P.close () P.join () for result_i in range (len): fin_info_result_list = Result[result_i].get () fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process Save_excel (fin_save_list, File_n AME) Endtime = Datetime.datetime.now () time = (endtime-starttime). Seconds print (' In total:%s s '% time)
Summarize:
The greater the size of the crawl data, the more rigorous the procedural logic required, and the more proficient the Python syntax required. How to write more Pythonic grammar, also need to continue to learn to master.
The above is the entire content of this article, I hope the content of this article for everyone's study or work can bring some help, but also hope that a lot of support cloud Habitat community!