How Python implements a tutorial to crawl the entire station's 400,000-room price data (replaceable crawl city)

Source: Internet
Author: User
Tags datetime extend join

It's written in front.

This time the reptile is about the house price information crawl, the goal is to practice more than 100,000 data processing and the whole station type crawl.

The most intuitive way to increase the amount of data is to improve the logic requirements of the function, and to choose the data structure carefully for Python's characteristics. In the past, a small amount of data capture, even if the function of the logical part of the repeated, I/O request frequency is intensive, the loop is embedded too deep, but 1~2s difference, and with the increase in data size, this 1~2s difference may be extended into 1~2h.

Therefore, in order to crawl the data volume of the site, you can start from two aspects to reduce the time cost of grasping information.

1 optimize the function logic, choose the appropriate data structure, accord with pythonic programming custom. For example, a combination of strings uses a join () to conserve memory space than "+".

2 according to I/O-intensive and CPU-intensive, choose multithreading, multi-process parallel execution mode, improve execution efficiency.

First, get the index

Packing request requests, setting timeout timeout

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# get List page
Defget_page (URL):
headers={
' User-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '
R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',
' Referer ': R ' http://bj.fangjia.com/ershoufang/',
' Host ': R ' bj.fangjia.com ',
' Connection ': ' Keep-alive '
}
Timeout=60
Socket.setdefaulttimeout (Timeout) # set timeout
Req=request. Request (URL, headers=headers)
Response=request.urlopen (req). Read ()
Page=response.decode (' Utf-8 ')
Returnpage

First-level location: regional information


Secondary POSITION: Plate information (according to the regional position to get plate information, in the form of Key_value to store in the Dict)

stored in a dict way, you can quickly query to the target you are looking for. -> {' Chaoyang ': {' work body ', ' ahn ', ' Jiangxiang ' ...}}

Level Three location: Subway information (search around the subway listings information)

Add the location of the Metro information to the Dict. -> {' Chaoyang ': {' working body ': {' line Line 5 ', ' line Line 10 ', ' line Line 13 '}, ' Ahn Jeong ', ' Jiangxiang ' ...}}

The corresponding url:http://bj.fangjia.com/ershoufang/--r-%e6%9c%9d%e9%98%b3%7cw-5%e5%8f%b7%e7%ba%bf%7cb-%e6%83%a0%e6%96%b0% e8%a5%bf%e8%a1%97

decoded URL:http://bj.fangjia.com/ershoufang/--r-chaoyang |w-5 line |b-Hui Xin Xi Jie

Depending on the parameter pattern of the URL, there are two ways to get the destination URL:

1 to obtain the destination URL according to the index path

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Get Listings (nested dictionary traversal)
Defget_info_list (search_dict, layer, Tmp_list, search_list):
layer+=1# Set Dictionary level
Foriinrange (Len (search_dict)):
Tmp_key=list (Search_dict.keys ()) [i]# extract current dictionary level key
Tmp_list.append (Tmp_key) # Adds the current key value as an index to Tmp_list
Tmp_value=search_dict[tmp_key]
Ifisinstance (TMP_VALUE,STR): # When the key value is a URL
Tmp_list.append (tmp_value) # Add URL to Tmp_list
Search_list.append (Copy.deepcopy (Tmp_list)) # Add Tmp_list index URL to search_list
tmp_list=tmp_list[:layer]# Keep index based on hierarchy
eliftmp_value== ': # Skip when key value is empty
layer-=2 # jump out of the key value level
Tmp_list=tmp_list[:layer] # Keep index based on hierarchy
Else
Get_info_list (tmp_value, layer, tmp_list, search_list) # When the key value is a list, iterate through the
Tmp_list=tmp_list[:layer]
Returnsearch_list

2) Packaging URL according to dict information

{' Chaoyang ': {' working body ': {' line Line 5 '}}}

Parameters:

--r-Chaoyang

--b-Working body

--w-5 Line

Assembly parameters: http://bj.fangjia.com/ershoufang/--r-Chaoyang |w-5 line |b-working body

1
2
3
4
5
A. Create a composite URL based on a parameter
2defget_compose_url (Compose_tmp_url, Tag_args, Key_args):
3 Compose_tmp_url_list=[compose_tmp_url, ' | ' Iftag_args!= ' r ' Else ', Tag_args, Parse.quote (Key_args),]
4 compose_url= '. Join (Compose_tmp_url_list)
5 Returncompose_url

Second, get the maximum number of index pages

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
# Gets the list of URLs for the pages of the current index page
Defget_info_pn_list (search_list):
Fin_search_list=[]
Foriinrange (Len (search_list)):
Print (' >>> is crawling%s '%search_list[i][:3])
SEARCH_URL=SEARCH_LIST[I][3]
Try
Page=get_page (Search_url)
Except
Print (' get page timeout ')
Continue
Soup=bs (page, ' lxml ')
# Get maximum number of pages
Pn_num=soup.select (' span[class= "MR5"] ") [0].get_text ()
Rule=re.compile (R ' \d+ ')
Max_pn=int (Rule.findall (Pn_num) [1])
# assembly URL
Forpninrange (1, max_pn+1):
Print (' ************************ is crawling%s page ************************ '%pn)
Pn_rule=re.compile (' [|] ')
Fin_url=pn_rule.sub (R ' |e-%s| ') %PN, search_url,1)
Tmp_url_list=copy.deepcopy (Search_list[i][:3])
Tmp_url_list.append (Fin_url)
Fin_search_list.append (Tmp_url_list)
Returnfin_search_list

Third, grab the listing information tag

This is the tag we want to crawl:

[' Area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' unit square meter ' price ']


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Get tag Information
Defget_info (Fin_search_list, process_i):
Print (' process%s start '%process_i)
Fin_info_list=[]
Foriinrange (Len (fin_search_list)):
URL=FIN_SEARCH_LIST[I][3]
Try
Page=get_page (URL)
Except
Print (' Get tag timeout ')
Continue
Soup=bs (page, ' lxml ')
Title_list=soup.select (' a[class= "H_name"])
Address_list=soup.select (' span[class= ' address])
Attr_list=soup.select (' span[class= ' attribute "])
Price_list=soup.find_all (attrs={"class": "Xq_aprice Xq_esf_width"}) # Select is not recognized for some property values (including spaces in the middle of the property value), you can use Find_all ( attrs={}) instead
Fornuminrange (20):
Tag_tmp_list=[]
Try
title=title_list[num].attrs["title"]
Print (R ' ************************ is getting%s************************ '%title)
Address=re.sub (' \ n ', ', Address_list[num].get_text ())
Area=re.search (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0)
Layout=re.search (' \d[^0-9]\d. ', Attr_list[num].get_text ()). Group (0)
Floor=re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0)
Price=re.search (' \d+[\u4e00-\u9fa5] ', Price_list[num].get_text ()). Group (0)
Unit_price=re.search (' \d+[\u4e00-\u9fa5]/. ', Price_list[num].get_text ()). Group (0)
Tag_tmp_list=copy.deepcopy (Fin_search_list[i][:3])
Fortagin[title, address, area, layout, floor, Price, Unit_price]:
Tag_tmp_list.append (TAG)
Fin_info_list.append (Tag_tmp_list)
Except
Print (' Crawl failed ')
Continue
Print (' Process%s end '%process_i)
Returnfin_info_list

Iv. assigning tasks, parallel crawling

Fragment the Task list, set the process pool, and crawl in parallel.

1
2
3
4
5
6
7
8
9
# Assign Task
defassignment_search_list (fin_searc H_list, Project_num): # Project_num The number of tasks that each process contains, the smaller the number, the more processes
  assignment_list=[]
  fin_search_list_len=len (fin_search_list)
  foriinrange (0, Fin_search_list_len, project_num):
    start=i
    end=i+project_num
    assignment_list.append (Fin _search_list[start:end]) # get list fragment
  returnassignment_list
1
2
3
4
5
6
7
8
9
10
P=pool (4) # Set Process pool
Assignment_list=assignment_search_list (fin_info_pn_list,3) # Assigning tasks to multiple processes
result=[]# Multiple Process results list
Foriinrange (Len (assignment_list)):
Result.append (P.apply_async (Get_info, args= (assignment_list[i), i))
P.close ()
P.join ()
Forresult_iinrange (len (Result)):
Fin_info_result_list=result[result_i].get ()
Fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process

By setting the process pool to crawl, the time is shortened to 3/1 of the single process crawl time, and the total time is 3h.

The computer is 4 cores, after the test, the task number is 3 o'clock, in the current computer operation efficiency is highest.

V. Store the results in Excel and wait for Visual data processing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Store Crawl Results
Defsave_excel (Fin_info_list, file_name):
tag_name=[' area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' Unit square meter '
Book=xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '%file_name) # is stored on the desktop by default
Tmp=book.add_worksheet ()
Row_num=len (Fin_info_list)
Foriinrange (1, row_num):
Ifi==1:
tag_pos= ' a%s '%i
Tmp.write_row (Tag_pos, tag_name)
Else
con_pos= ' a%s '%i
content=fin_info_list[i-1]#-1 is because the table header is occupied by
Tmp.write_row (Con_pos, content)
Book.close ()

Attached source code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
#! -*-coding:utf-8-*-
# Function: House price survey
# Author:?
Fromurllibimportparse, request
Frombs4importbeautifulsoup as BS
Frommultiprocessingimportpool
Importre
Importlxml
Importdatetime
Importcprofile
Importsocket
Importcopy
Importxlsxwriter
Starttime=datetime.datetime.now ()
Base_url=r ' http://bj.fangjia.com/ershoufang/'
test_search_dict={' changping ': {' Huo ying ': {' line Line 13 ': ' Http://bj.fangjia.com/ershoufang/--r-%E6%98%8C%E5%B9%B3|w-13%E5%8F%B7%E7 %ba%bf|b-%e9%9c%8d%e8%90%a5 '}}}
search_list=[]# List of listing information
tmp_list=[]# Listing URL Cache list
Layer=-1
# get List page
Defget_page (URL):
headers={
' User-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '
R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',
' Referer ': R ' http://bj.fangjia.com/ershoufang/',
' Host ': R ' bj.fangjia.com ',
' Connection ': ' Keep-alive '
}
Timeout=60
Socket.setdefaulttimeout (Timeout) # set timeout
Req=request. Request (URL, headers=headers)
Response=request.urlopen (req). Read ()
Page=response.decode (' Utf-8 ')
Returnpage
# Get Query Keywords Dict
Defget_search (page, key):
Soup=bs (page, ' lxml ')
Search_list=soup.find_all (Href=re.compile (key), target= ")
search_dict={}
Foriinrange (Len (search_list)):
Soup=bs (str (search_list[i]), ' lxml '
Key=soup.select (' a ') [0].get_text ()
value=soup.a.attrs[' href ']
Search_dict[key]=value
Returnsearch_dict
# Get Listings (nested dictionary traversal)
Defget_info_list (search_dict, layer, Tmp_list, search_list):
layer+=1# Set Dictionary level
Foriinrange (Len (search_dict)):
Tmp_key=list (Search_dict.keys ()) [i]# extract current dictionary level key
Tmp_list.append (Tmp_key) # Adds the current key value as an index to Tmp_list
Tmp_value=search_dict[tmp_key]
Ifisinstance (TMP_VALUE,STR): # When the key value is a URL
Tmp_list.append (tmp_value) # Add URL to Tmp_list
Search_list.append (Copy.deepcopy (Tmp_list)) # Add Tmp_list index URL to search_list
tmp_list=tmp_list[:layer]# Keep index based on hierarchy
eliftmp_value== ': # Skip when key value is empty
layer-=2 # jump out of the key value level
Tmp_list=tmp_list[:layer] # Keep index based on hierarchy
Else
Get_info_list (tmp_value, layer, tmp_list, search_list) # When the key value is a list, iterate through the
Tmp_list=tmp_list[:layer]
Returnsearch_list
# Get Listings Details
Defget_info_pn_list (search_list):
Fin_search_list=[]
Foriinrange (Len (search_list)):
Print (' >>> is crawling%s '%search_list[i][:3])
SEARCH_URL=SEARCH_LIST[I][3]
Try
Page=get_page (Search_url)
Except
Print (' get page timeout ')
Continue
Soup=bs (page, ' lxml ')
# Get maximum number of pages
Pn_num=soup.select (' span[class= "MR5"] ") [0].get_text ()
Rule=re.compile (R ' \d+ ')
Max_pn=int (Rule.findall (Pn_num) [1])
# assembly URL
Forpninrange (1, max_pn+1):
Print (' ************************ is crawling%s page ************************ '%pn)
Pn_rule=re.compile (' [|] ')
Fin_url=pn_rule.sub (R ' |e-%s| ') %PN, search_url,1)
Tmp_url_list=copy.deepcopy (Search_list[i][:3])
Tmp_url_list.append (Fin_url)
Fin_search_list.append (Tmp_url_list)
Returnfin_search_list
# Get tag Information
Defget_info (Fin_search_list, process_i):
Print (' process%s start '%process_i)
Fin_info_list=[]
Foriinrange (Len (fin_search_list)):
URL=FIN_SEARCH_LIST[I][3]
Try
Page=get_page (URL)
Except
Print (' Get tag timeout ')
Continue
Soup=bs (page, ' lxml ')
Title_list=soup.select (' a[class= "H_name"])
Address_list=soup.select (' span[class= ' address])
Attr_list=soup.select (' span[class= ' attribute "])
Price_list=soup.find_all (attrs={"class": "Xq_aprice Xq_esf_width"}) # Select is not recognized for some property values (including spaces in the middle of the property value), you can use Find_all ( attrs={}) instead
Fornuminrange (20):
Tag_tmp_list=[]
Try
title=title_list[num].attrs["title"]
Print (R ' ************************ is getting%s************************ '%title)
Address=re.sub (' \ n ', ', Address_list[num].get_text ())
Area=re.search (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0)
Layout=re.search (' \d[^0-9]\d. ', Attr_list[num].get_text ()). Group (0)
Floor=re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0)
Price=re.search (' \d+[\u4e00-\u9fa5] ', Price_list[num].get_text ()). Group (0)
Unit_price=re.search (' \d+[\u4e00-\u9fa5]/. ', Price_list[num].get_text ()). Group (0)
Tag_tmp_list=copy.deepcopy (Fin_search_list[i][:3])
Fortagin[title, address, area, layout, floor, Price, Unit_price]:
Tag_tmp_list.append (TAG)
Fin_info_list.append (Tag_tmp_list)
Except
Print (' Crawl failed ')
Continue
Print (' Process%s end '%process_i)
Returnfin_info_list
# Assigning tasks
Defassignment_search_list (Fin_search_list, Project_num): # Project_num The number of tasks each process contains, the smaller the number, the more processes
Assignment_list=[]
Fin_search_list_len=len (Fin_search_list)
Foriinrange (0, Fin_search_list_len, project_num):
Start=i
End=i+project_num
Assignment_list.append (Fin_search_list[start:end]) # get list Fragment
Returnassignment_list
# Store Crawl Results
Defsave_excel (Fin_info_list, file_name):
tag_name=[' area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' Unit square meter '
Book=xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '%file_name) # is stored on the desktop by default
Tmp=book.add_worksheet ()
Row_num=len (Fin_info_list)
Foriinrange (1, row_num):
Ifi==1:
tag_pos= ' a%s '%i
Tmp.write_row (Tag_pos, tag_name)
Else
con_pos= ' a%s '%i
content=fin_info_list[i-1]#-1 is because the table header is occupied by
Tmp.write_row (Con_pos, content)
Book.close ()
if__name__== ' __main__ ':
File_name=input (R ' crawl complete, enter filename Save: ')
fin_save_list=[]# Crawl Information Store list
# First-level filtering
Page=get_page (Base_url)
Search_dict=get_search (page, ' R ')
# Two-stage screening
Forkinsearch_dict:
Print (R ' ************************ crawl: Crawling "%s" ************************ '%k)
URL=SEARCH_DICT[K]
Second_page=get_page (URL)
Second_search_dict=get_search (second_page, ' B ')
Search_dict[k]=second_search_dict
# Three-level screening
Forkinsearch_dict:
SECOND_DICT=SEARCH_DICT[K]
Fors_kinsecond_dict:
Print (R ' ************************ level two crawl: Crawling "%s" ************************ '%s_k)
Url=second_dict[s_k]
Third_page=get_page (URL)
Third_search_dict=get_search (Third_page, ' W ')
Print ('%s>%s '% (k, s_k))
Second_dict[s_k]=third_search_dict
Fin_info_list=get_info_list (search_dict, layer, tmp_list, search_list)
Fin_info_pn_list=get_info_pn_list (Fin_info_list)
P=pool (4) # Set Process pool
Assignment_list=assignment_search_list (fin_info_pn_list,2) # Assigning tasks to multiple processes
result=[]# Multiple Process results list
Foriinrange (Len (assignment_list)):
Result.append (P.apply_async (Get_info, args= (assignment_list[i), i))
P.close ()
P.join ()
Forresult_iinrange (len (Result)):
Fin_info_result_list=result[result_i].get ()
Fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process
Save_excel (Fin_save_list, file_name)
Endtime=datetime.datetime.now ()
Time= (endtime-starttime). seconds
Print (' In total:%s s '%time)

Summary :

The greater the size of the crawl data, the more rigorous the procedural logic required, and the more proficient the Python syntax required. How to write more Pythonic grammar, also need to continue to learn to master.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.