Python realizes parallel crawl whole station 400,000 house price data (replaceable crawl city) _python

Source: Internet
Author: User
Tags datetime extend

It's written in front.

This time the reptile is about the house price information crawl, the goal is to practice more than 100,000 data processing and the whole station type crawl.

The most intuitive way to increase the amount of data is to improve the logic requirements of the function, and to choose the data structure carefully for Python's characteristics. In the past, a small amount of data capture, even if the function of the logical part of the repeated, I/O request frequency is intensive, the loop is embedded too deep, but 1~2s difference, and with the increase in data size, this 1~2s difference may be extended into 1~2h.

Therefore, in order to crawl the data volume of the site, you can start from two aspects to reduce the time cost of grasping information.

1 optimize the function logic, choose the appropriate data structure, accord with pythonic programming custom. For example, a combination of strings uses a join () to conserve memory space than "+".

2 according to I/O-intensive and CPU-intensive, choose multithreading, multi-process parallel execution mode, improve execution efficiency.

First, get the index

Packing request requests, setting timeout timeout

# get list page
def get_page (URL):
  headers = {
    ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '
           R ' chrome/45.0.2454.85 safari/537.36 ', '
    Referer ': R ' http://bj.fangjia.com/ershoufang/',
    ' Host ': R ' bj.fangjia.com ',
    ' Connection ': ' Keep-alive '
  }
  Timeout =
  Socket.setdefaulttimeout (timeout) # set timeout
  req = Request. Request (URL, headers=headers)
  response = Request.urlopen (req). Read ()
  page = Response.decode (' Utf-8 ') Return
  page

First-level location: regional information

Secondary POSITION: Plate information (according to the regional position to get plate information, in the form of Key_value to store in the Dict)

stored in a dict way, you can quickly query to the target you are looking for. -> {' Chaoyang ': {' work body ', ' ahn ', ' Jiangxiang ' ...}}

Level Three location: Subway information (search around the subway listings information)

Add the location of the Metro information to the Dict. -> {' Chaoyang ': {' working body ': {' line Line 5 ', ' line Line 10 ', ' line Line 13 '}, ' Ahn Jeong ', ' Jiangxiang ' ...}}

The corresponding url:http://bj.fangjia.com/ershoufang/--r-%e6%9c%9d%e9%98%b3%7cw-5%e5%8f%b7%e7%ba%bf%7cb-%e6%83%a0%e6%96%b0% e8%a5%bf%e8%a1%97

decoded URL: http://bj.fangjia.com/ershoufang/--r-chaoyang |w-5 line |b-Hui Xin Xi Jie

Depending on the parameter pattern of the URL, there are two ways to get the destination URL:

1 to obtain the destination URL according to the index path

# Get Listings (nested dictionary traversal)
def get_info_list (search_dict, layer, Tmp_list, search_list):
  layer = 1 # Set dictionary level for
  i in Range (len (search_dict)):
    Tmp_key = List (Search_dict.keys ()) [i] # extract current dictionary level key
    tmp_list.append (Tmp_key)  # Add the current key value as an index to tmp_list
    tmp_value = Search_dict[tmp_key]
    if Isinstance (Tmp_value, str):  # When the key value is a URL
      tmp_list.append (Tmp_value)  # Add URL to tmp_list
      search_list.append (copy.deepcopy (tmp_list))  # Add the tmp_list index URL to search_list
      tmp_list = Tmp_list[:layer] # based on hierarchy retention index
    elif tmp_value = ':  # Skip when key value is empty
      layer = 2      # Bounce key value level
      tmp_list = Tmp_list[:layer]  # Keep index based on hierarchy
    else:
      get_info_list (Tmp_ Value, layer, Tmp_list, search_list) # When the key value is a list, iterate through
      tmp_list = Tmp_list[:layer] return
  search_list

2) Packaging URL according to dict information

{' Chaoyang ': {' working body ': {' line Line 5 '}}}

Parameters:

--r-Chaoyang

--b-Working body

--w-5 Line

Assembly parameters: http://bj.fangjia.com/ershoufang/--r-Chaoyang |w-5 line |b-working body

1 # Create a combination URL based on parameters
2 def get_compose_url (Compose_tmp_url, Tag_args, Key_args):
3   compose_tmp_url_list = [ Compose_tmp_url, ' | ' If Tag_args!= ' r ' Else ', Tag_args, Parse.quote (Key_args),]
4   compose_url = '. Join (compo Se_tmp_url_list)
5 return   Compose_url

Second, get the maximum number of index pages

# Gets the URL list for the number of pages in the current index
def get_info_pn_list (search_list):
  fin_search_list = [] for
  I in range (Len (search_list ):
    print (' >>> crawling%s '% search_list[i][:3])
    Search_url = search_list[i][3]
    try:
      page = get_ Page (Search_url)
    except:
      print (' get page timeout ')
      continue
    soup = BS (page, ' lxml ')
    # Get maximum number of pages
    Pn_ num = Soup.select (' span[class= ' mr5 '] ') [0].get_text () Rule
    = Re.compile (R ' \d+ ')
    max_pn = Int (Rule.findall ( Pn_num) [1])
    # assembly URL for
    PN in range (1, max_pn+1):
      print (' ************************ is crawling%s page ************ '% pn '
      pn_rule = Re.compile (' [|] ')
      Fin_url = Pn_rule.sub (R ' |e-%s| '% pn, Search_url, 1)
      tmp_url_list = copy.deepcopy (search_list[i][:3))
      Tmp_ Url_list.append (Fin_url)
      fin_search_list.append (tmp_url_list) return
  fin_search_list

Third, grab the listing information tag

This is the tag we want to crawl:

[' Area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' unit square meter ' price ']

# get tag Information def get_info (Fin_search_list, process_i): Print (' process%s start '% process_i) fin_info_list = [] for I in range (l En (fin_search_list)): url = fin_search_list[i][3] try:page = get_page (URL) except:print (' Get tag timeout ') Continue soup = BS (page, ' lxml ') title_list = Soup.select (' a[class= ' h_name "] ') Address_list = soup.se Lect (' span[class= ' address] ') attr_list = Soup.select (' span[class= ' attribute '] ') price_list = Soup.find_all (attrs={
      ' Class ': ' Xq_aprice xq_esf_width '} # Select is not recognized for some property values (including spaces in the middle of property values) and can be replaced with Find_all (attrs={}) in the For Num in range (20): Tag_tmp_list = [] Try:title = title_list[num].attrs["title"] Print (R ' ************************ is getting% s************************ '% title ' address = re.sub (' \ n ', ', Address_list[num].get_text ()) area = Re.s Earch (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0) layout = Re.search (' \d[^0-9]\d. ', Attr_list[nu
M].get_text ()). Group (0)        Floor = Re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0) Price = Re.search (' \d+[\u4e00-\u9fa5] ', PRI Ce_list[num].get_text ()). Group (0) Unit_price = Re.search (' \d+[\u4e00-\u9fa5]/. ', Price_list[num].get_text ()). Group (0) Tag_tmp_list = copy.deepcopy (Fin_search_list[i][:3]) for tag in [title, address, area, layout, floor, Price, Unit_price]: tag_tmp_list.append (TAG) fin_info_list.append (tag_tmp_list) EXCEPT:PR Int (' crawl failed ') continue print (' process%s end '% process_i) return fin_info_list

Iv. assigning tasks, parallel crawling

Fragment the Task list, set the process pool, and crawl in parallel.

# Assign Task
def assignment_search_list (Fin_search_list, Project_num): # Project_num The number of tasks each process contains, the smaller the number, the more processes
  Assignment_list = []
  Fin_search_list_len = Len (fin_search_list) for
  I in range (0, Fin_search_list_len, project_ num):
    start = I end
    = I+project_num
    assignment_list.append (Fin_search_list[start:end]) # get list fragment
  Return assignment_list
  p = Pool (4) # Set process pool
  assignment_list = Assignment_search_list (Fin_info_pn_list, 3) # Assigning tasks for multiple process result
  = [] # Multiple processes Results list for
  I in range (len (assignment_list)):
    result.append (P.apply_async (Get_info, args= (assignment_list[i ], i))
  p.close ()
  p.join () for
  result_i in range (len (result)):
    fin_info_result_list = result[ Result_i].get ()
    fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process

By setting the process pool to crawl, the time is shortened to 3/1 of the single process crawl time, and the total time is 3h.

The computer is 4 cores, after the test, the task number is 3 o'clock, in the current computer operation efficiency is highest.

V. Store the results in Excel and wait for Visual data processing

# Store Crawl Results
def save_excel (Fin_info_list, file_name):
  tag_name = [' Area ', ' plate ', ' metro ', ' title ', ' position ', ' square ', ' huxing ', ' floor ', ' Total price ' ', ' unit square meter price '] book
  = Xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '% file_name) # The default is stored on the desktop
  tmp = Book.add_worksheet ()
  row_num = Len (fin_info_list) for
  I in range (1, row_num):
    If i = = 1:
      tag_pos = ' a%s '% i
      tmp.write_row (Tag_pos, Ta G_name)
    Else:
      con_pos = ' a%s '% i
      content = fin_info_list[i-1] #-1 is because the table header occupies
      tmp.write_row (con_ POS, content)
  Book.close ()

Attached source code

#!  -*-coding:utf-8-*-# Function: House price Survey # Author: Urllib import parse, request from BS4 import BeautifulSoup as BS from Multiprocessing Import Pool Import re import lxml import datetime import cProfile import socket import copy import XLSXWR ITER starttime = Datetime.datetime.now () Base_url = R ' http://bj.fangjia.com/ershoufang/' test_search_dict = {' changping ': {' Huo Ying ' : {' line Line 13 ': ' Http://bj.fangjia.com/ershoufang/--r-%E6%98%8C%E5%B9%B3|w-13%E5%8F%B7%E7%BA%BF|b-%E9%9C%8D%E8%90%A5 ' }} search_list = [] # List of listings tmp_list = [] # Listing information URL Cache list layer =-1 # Get list page def get_page (URL): headers = {' Use R-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) ' R ' chrome/45.0.2454.85 safari/537.36 ', ' Ref  Erer ': R ' http://bj.fangjia.com/ershoufang/', ' Host ': R ' bj.fangjia.com ', ' Connection ': ' keep-alive '} timeout = Socket.setdefaulttimeout (Timeout) # set timeout req = Request. Request (URL, headers=headers) response = Request. Urlopen (req). Read () page = Response.decode (' utf-8 ') return page # get query keyword dict def get_search (page, key): Soup = BS ( Page, ' lxml ') search_list = Soup.find_all (Href=re.compile (key), target= ') search_dict = {} for I in range (Len (searc h_list): Soup = BS (str (search_list[i)), ' lxml ') key = Soup.select (' a ') [0].get_text () value = soup.a.attrs[' hr EF '] search_dict[key] = value return search_dict # Get listing Information list (nested dictionary traversal) def get_info_list (search_dict, layer, tmp_list, Search_list): Layer + 1 # Set dictionary level for I in range (len (search_dict)): Tmp_key = List (Search_dict.keys ()) [i] # extract current dictionary Level key Tmp_list.append (Tmp_key) # Adds the current key value as an index to tmp_list tmp_value = Search_dict[tmp_key] If isinstance (tmp_va Lue, str): # When the key value is a URL tmp_list.append (tmp_value) # adds the URL to tmp_list search_list.append (copy.deepcopy
      ) # Add the tmp_list index URL to search_list tmp_list = Tmp_list[:layer] # based on hierarchy retention index elif tmp_value = ': # Skip when key value is empty Layer = 2 # Bounce key value LayerLevel tmp_list = tmp_list[:layer] # Keep index based on hierarchy else:get_info_list (tmp_value, layer, tmp_list, search_list) # when When the key value is a list, iterate through tmp_list = Tmp_list[:layer] Return search_list # Get listings details def get_info_pn_list (search_list): Fin_sea Rch_list = [] for i in range (len (search_list)): Print (' >>> crawling%s '% search_list[i][:3]) Search_url = SE ARCH_LIST[I][3] Try:page = Get_page (search_url) except:print (' get page timeout ') Continue soup = BS ( Page, ' lxml ') # Get the maximum number of pages Pn_num = Soup.select (' span[class= ' mr5 '] ') [0].get_text () rule = Re.compile (R ' \d+ ') m AX_PN = Int (Rule.findall (pn_num) [1]) # Assemble URL for pn in range (1, max_pn+1): Print (' ************************ is
      Crawl%s page ************************ '% pn) Pn_rule = Re.compile (' [|] ') Fin_url = Pn_rule.sub (R ' |e-%s| '% pn, Search_url, 1) tmp_url_list = Copy.deepcopy (search_list[i][:3)) Tmp_url_ List.append (Fin_url) fin_search_list.append (tmp_url_list) retUrn Fin_search_list # get tag information def get_info (Fin_search_list, process_i): Print (' process%s start '% process_i) fin_info_list = []
      For I in range (len (fin_search_list)): url = fin_search_list[i][3] try:page = get_page (URL) except: Print (' Get tag timeout ') Continue soup = BS (page, ' lxml ') title_list = Soup.select (' a[class= ' h_name "] ') add Ress_list = Soup.select (' span[class= ' address] ') attr_list = Soup.select (' span[class= ' attribute "] ') Price_list = Up.find_all (attrs={"class": "Xq_aprice Xq_esf_width"}) # Select is not recognized for some property values (including spaces in the middle of the property value) and can be replaced with Find_all (attrs={}) for Num in range: tag_tmp_list = [] Try:title = title_list[num].attrs["title" Print (R ') Getting%s************************ '% title ' address = re.sub (' \ n ', ', Address_list[num].get_text () Area = Re.search (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0) layout = Re.search (' \d[^0- 9]\d. ', attr_list[num].get_tExt ()). Group (0) floor = re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0) Price = Re.search (' \d+[\u4e 00-\U9FA5] ', Price_list[num].get_text ()). Group (0) Unit_price = Re.search (' \d+[\u4e00-\u9fa5]/. ', PRICE_LIST[NUM].G  Et_text ()). Group (0) tag_tmp_list = copy.deepcopy (Fin_search_list[i][:3]) for tag in [title, Address, area, Layout, floor, Price, Unit_price]: tag_tmp_list.append (TAG) fin_info_list.append (Tag_tmp_list) ex Cept:print (' crawl failed ') continue print (' process%s end '% process_i) return Fin_info_list # assign task def Assignment_ Search_list (Fin_search_list, Project_num): # Project_num The number of tasks that each process contains, the smaller the number, the more processes assignment_list = [] Fin_search_list_le
    n = Len (fin_search_list) for I in range (0, Fin_search_list_len, project_num): start = I end = I+project_num Assignment_list.append (Fin_search_list[start:end]) # get list fragment return Assignment_list # Store crawl results def save_excel (Fin_info_li St, file_name): tag_name = [' Area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' unit square meter price '] book = Xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '% file_name) # The default is stored on the desktop TMP = Book.add_worksheet () row_num = Len (fin
    _info_list) for I in range (1, row_num): if i = = 1:tag_pos = ' a%s '% i tmp.write_row (Tag_pos, tag_name) Else:con_pos = ' a%s '% i content = fin_info_list[i-1] #-1 is due to the table header occupied Tmp.write_row (Con_pos, Conten T) book.close () if __name__ = = ' __main__ ': file_name = input (R ' crawl complete, input filename Save: ') fin_save_list = [] # Crawl Information Store List # first screen Select page = Get_page (base_url) search_dict = get_search (page, ' R ') # Secondary filter for K-in Search_dict:print (R ' ******** Level crawl: Crawling "%s" ************************ '% k "url = search_dict[k] second_page = get_page (URL) s 
    Econd_search_dict = Get_search (second_page, ' B ') search_dict[k] = second_search_dict # Three-level filter for K in search_dict: Second_dict = Search_dict[k] for s_k in Second_diCt:print (R ' ************************ level two crawl: Crawling "%s" ************************ '% s_k) URL = second_dict[s_k]
      third_page = get_page (URL) third_search_dict = Get_search (third_page, ' W ') print ('%s>%s '% (k, s_k)) Second_dict[s_k] = third_search_dict fin_info_list = get_info_list (search_dict, layer, tmp_list, search_list) fin_in Fo_pn_list = Get_info_pn_list (fin_info_list) p = Pool (4) # set process Pool Assignment_list = assignment_search_list (fin_info_pn_ List, 2) # Assigning tasks for multiple process result = [] # Multi-process results list for I in range (len (assignment_list)): Result.append (P.apply_async get_  Info, args= (assignment_list[i], i)) P.close () P.join () for result_i in range (len): fin_info_result_list = Result[result_i].get () fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process Save_excel (fin_save_list, File_n AME) Endtime = Datetime.datetime.now () time = (endtime-starttime). Seconds print (' In total:%s s '% time)

Summarize:

The greater the size of the crawl data, the more rigorous the procedural logic required, and the more proficient the Python syntax required. How to write more Pythonic grammar, also need to continue to learn to master.

The above is the entire content of this article, I hope the content of this article for everyone's study or work can bring some help, but also hope that a lot of support cloud Habitat community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.