How Python implements a tutorial to crawl the entire station's 400,000-room price data (replaceable crawl city)

Last Update:2017-01-13 Source: Internet

Author: User

Tags datetime extend join

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It's written in front.

This time the reptile is about the house price information crawl, the goal is to practice more than 100,000 data processing and the whole station type crawl.

The most intuitive way to increase the amount of data is to improve the logic requirements of the function, and to choose the data structure carefully for Python's characteristics. In the past, a small amount of data capture, even if the function of the logical part of the repeated, I/O request frequency is intensive, the loop is embedded too deep, but 1~2s difference, and with the increase in data size, this 1~2s difference may be extended into 1~2h.

Therefore, in order to crawl the data volume of the site, you can start from two aspects to reduce the time cost of grasping information.

1 optimize the function logic, choose the appropriate data structure, accord with pythonic programming custom. For example, a combination of strings uses a join () to conserve memory space than "+".

2 according to I/O-intensive and CPU-intensive, choose multithreading, multi-process parallel execution mode, improve execution efficiency.

First, get the index

Packing request requests, setting timeout timeout

# get List page

Defget_page (URL):

headers={

' User-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '

R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',

' Referer ': R ' http://bj.fangjia.com/ershoufang/',

' Host ': R ' bj.fangjia.com ',

' Connection ': ' Keep-alive '

}

Timeout=60

Socket.setdefaulttimeout (Timeout) # set timeout

Req=request. Request (URL, headers=headers)

Response=request.urlopen (req). Read ()

Page=response.decode (' Utf-8 ')

Returnpage

First-level location: regional information

Secondary POSITION: Plate information (according to the regional position to get plate information, in the form of Key_value to store in the Dict)

stored in a dict way, you can quickly query to the target you are looking for. -> {' Chaoyang ': {' work body ', ' ahn ', ' Jiangxiang ' ...}}

Level Three location: Subway information (search around the subway listings information)

Add the location of the Metro information to the Dict. -> {' Chaoyang ': {' working body ': {' line Line 5 ', ' line Line 10 ', ' line Line 13 '}, ' Ahn Jeong ', ' Jiangxiang ' ...}}

The corresponding url:http://bj.fangjia.com/ershoufang/--r-%e6%9c%9d%e9%98%b3%7cw-5%e5%8f%b7%e7%ba%bf%7cb-%e6%83%a0%e6%96%b0% e8%a5%bf%e8%a1%97

decoded URL:http://bj.fangjia.com/ershoufang/--r-chaoyang |w-5 line |b-Hui Xin Xi Jie

Depending on the parameter pattern of the URL, there are two ways to get the destination URL:

1 to obtain the destination URL according to the index path

# Get Listings (nested dictionary traversal)

Defget_info_list (search_dict, layer, Tmp_list, search_list):

layer+=1# Set Dictionary level

Foriinrange (Len (search_dict)):

Tmp_key=list (Search_dict.keys ()) [i]# extract current dictionary level key

Tmp_list.append (Tmp_key) # Adds the current key value as an index to Tmp_list

Tmp_value=search_dict[tmp_key]

Ifisinstance (TMP_VALUE,STR): # When the key value is a URL

Tmp_list.append (tmp_value) # Add URL to Tmp_list

Search_list.append (Copy.deepcopy (Tmp_list)) # Add Tmp_list index URL to search_list

tmp_list=tmp_list[:layer]# Keep index based on hierarchy

eliftmp_value== ': # Skip when key value is empty

layer-=2 # jump out of the key value level

Tmp_list=tmp_list[:layer] # Keep index based on hierarchy

Else

Get_info_list (tmp_value, layer, tmp_list, search_list) # When the key value is a list, iterate through the

Tmp_list=tmp_list[:layer]

Returnsearch_list

2) Packaging URL according to dict information

{' Chaoyang ': {' working body ': {' line Line 5 '}}}

Parameters:

--r-Chaoyang

--b-Working body

--w-5 Line

Assembly parameters: http://bj.fangjia.com/ershoufang/--r-Chaoyang |w-5 line |b-working body

A. Create a composite URL based on a parameter

2defget_compose_url (Compose_tmp_url, Tag_args, Key_args):

3 Compose_tmp_url_list=[compose_tmp_url, ' | ' Iftag_args!= ' r ' Else ', Tag_args, Parse.quote (Key_args),]

4 compose_url= '. Join (Compose_tmp_url_list)

5 Returncompose_url

Second, get the maximum number of index pages

21st

# Gets the list of URLs for the pages of the current index page

Defget_info_pn_list (search_list):

Fin_search_list=[]

Foriinrange (Len (search_list)):

Print (' >>> is crawling%s '%search_list[i][:3])

SEARCH_URL=SEARCH_LIST[I][3]

Try

Page=get_page (Search_url)

Except

Print (' get page timeout ')

Continue

Soup=bs (page, ' lxml ')

# Get maximum number of pages

Pn_num=soup.select (' span[class= "MR5"] ") [0].get_text ()

Rule=re.compile (R ' \d+ ')

Max_pn=int (Rule.findall (Pn_num) [1])

# assembly URL

Forpninrange (1, max_pn+1):

Print (' ************************ is crawling%s page ************************ '%pn)

Pn_rule=re.compile (' [|] ')

Fin_url=pn_rule.sub (R ' |e-%s| ') %PN, search_url,1)

Tmp_url_list=copy.deepcopy (Search_list[i][:3])

Tmp_url_list.append (Fin_url)

Fin_search_list.append (Tmp_url_list)

Returnfin_search_list

Third, grab the listing information tag

This is the tag we want to crawl:

[' Area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' unit square meter ' price ']

21st

# Get tag Information

Defget_info (Fin_search_list, process_i):

Print (' process%s start '%process_i)

Fin_info_list=[]

Foriinrange (Len (fin_search_list)):

URL=FIN_SEARCH_LIST[I][3]

Try

Page=get_page (URL)

Except

Print (' Get tag timeout ')

Continue

Soup=bs (page, ' lxml ')

Title_list=soup.select (' a[class= "H_name"])

Address_list=soup.select (' span[class= ' address])

Attr_list=soup.select (' span[class= ' attribute "])

Price_list=soup.find_all (attrs={"class": "Xq_aprice Xq_esf_width"}) # Select is not recognized for some property values (including spaces in the middle of the property value), you can use Find_all ( attrs={}) instead

Fornuminrange (20):

Tag_tmp_list=[]

Try

title=title_list[num].attrs["title"]

Print (R ' ************************ is getting%s************************ '%title)

Address=re.sub (' \ n ', ', Address_list[num].get_text ())

Area=re.search (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0)

Layout=re.search (' \d[^0-9]\d. ', Attr_list[num].get_text ()). Group (0)

Floor=re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0)

Price=re.search (' \d+[\u4e00-\u9fa5] ', Price_list[num].get_text ()). Group (0)

Unit_price=re.search (' \d+[\u4e00-\u9fa5]/. ', Price_list[num].get_text ()). Group (0)

Tag_tmp_list=copy.deepcopy (Fin_search_list[i][:3])

Fortagin[title, address, area, layout, floor, Price, Unit_price]:

Tag_tmp_list.append (TAG)

Fin_info_list.append (Tag_tmp_list)

Except

Print (' Crawl failed ')

Continue

Print (' Process%s end '%process_i)

Returnfin_info_list

Iv. assigning tasks, parallel crawling

Fragment the Task list, set the process pool, and crawl in parallel.

# Assign Task

defassignment_search_list (fin_searc H_list, Project_num): # Project_num The number of tasks that each process contains, the smaller the number, the more processes

assignment_list=[]

fin_search_list_len=len (fin_search_list)

foriinrange (0, Fin_search_list_len, project_num):

start=i

end=i+project_num

assignment_list.append (Fin _search_list[start:end]) # get list fragment

returnassignment_list

P=pool (4) # Set Process pool

Assignment_list=assignment_search_list (fin_info_pn_list,3) # Assigning tasks to multiple processes

result=[]# Multiple Process results list

Foriinrange (Len (assignment_list)):

Result.append (P.apply_async (Get_info, args= (assignment_list[i), i))

P.close ()

P.join ()

Forresult_iinrange (len (Result)):

Fin_info_result_list=result[result_i].get ()

Fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process

By setting the process pool to crawl, the time is shortened to 3/1 of the single process crawl time, and the total time is 3h.

The computer is 4 cores, after the test, the task number is 3 o'clock, in the current computer operation efficiency is highest.

V. Store the results in Excel and wait for Visual data processing

# Store Crawl Results

Defsave_excel (Fin_info_list, file_name):

tag_name=[' area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' Unit square meter '

Book=xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '%file_name) # is stored on the desktop by default

Tmp=book.add_worksheet ()

Row_num=len (Fin_info_list)

Foriinrange (1, row_num):

Ifi==1:

tag_pos= ' a%s '%i

Tmp.write_row (Tag_pos, tag_name)

Else

con_pos= ' a%s '%i

content=fin_info_list[i-1]#-1 is because the table header is occupied by

Tmp.write_row (Con_pos, content)

Book.close ()

Attached source code

21st

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

#! -*-coding:utf-8-*-

# Function: House price survey

# Author:?

Fromurllibimportparse, request

Frombs4importbeautifulsoup as BS

Frommultiprocessingimportpool

Importre

Importlxml

Importdatetime

Importcprofile

Importsocket

Importcopy

Importxlsxwriter

Starttime=datetime.datetime.now ()

Base_url=r ' http://bj.fangjia.com/ershoufang/'

test_search_dict={' changping ': {' Huo ying ': {' line Line 13 ': ' Http://bj.fangjia.com/ershoufang/--r-%E6%98%8C%E5%B9%B3|w-13%E5%8F%B7%E7 %ba%bf|b-%e9%9c%8d%e8%90%a5 '}}}

search_list=[]# List of listing information

tmp_list=[]# Listing URL Cache list

Layer=-1

# get List page

Defget_page (URL):

headers={

' User-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '

R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',

' Referer ': R ' http://bj.fangjia.com/ershoufang/',

' Host ': R ' bj.fangjia.com ',

' Connection ': ' Keep-alive '

}

Timeout=60

Socket.setdefaulttimeout (Timeout) # set timeout

Req=request. Request (URL, headers=headers)

Response=request.urlopen (req). Read ()

Page=response.decode (' Utf-8 ')

Returnpage

# Get Query Keywords Dict

Defget_search (page, key):

Soup=bs (page, ' lxml ')

Search_list=soup.find_all (Href=re.compile (key), target= ")

search_dict={}

Foriinrange (Len (search_list)):

Soup=bs (str (search_list[i]), ' lxml '

Key=soup.select (' a ') [0].get_text ()

value=soup.a.attrs[' href ']

Search_dict[key]=value

Returnsearch_dict

# Get Listings (nested dictionary traversal)

Defget_info_list (search_dict, layer, Tmp_list, search_list):

layer+=1# Set Dictionary level

Foriinrange (Len (search_dict)):

Tmp_key=list (Search_dict.keys ()) [i]# extract current dictionary level key

Tmp_list.append (Tmp_key) # Adds the current key value as an index to Tmp_list

Tmp_value=search_dict[tmp_key]

Ifisinstance (TMP_VALUE,STR): # When the key value is a URL

Tmp_list.append (tmp_value) # Add URL to Tmp_list

Search_list.append (Copy.deepcopy (Tmp_list)) # Add Tmp_list index URL to search_list

tmp_list=tmp_list[:layer]# Keep index based on hierarchy

eliftmp_value== ': # Skip when key value is empty

layer-=2 # jump out of the key value level

Tmp_list=tmp_list[:layer] # Keep index based on hierarchy

Else

Get_info_list (tmp_value, layer, tmp_list, search_list) # When the key value is a list, iterate through the

Tmp_list=tmp_list[:layer]

Returnsearch_list

# Get Listings Details

Defget_info_pn_list (search_list):

Fin_search_list=[]

Foriinrange (Len (search_list)):

Print (' >>> is crawling%s '%search_list[i][:3])

SEARCH_URL=SEARCH_LIST[I][3]

Try

Page=get_page (Search_url)

Except

Print (' get page timeout ')

Continue

Soup=bs (page, ' lxml ')

# Get maximum number of pages

Pn_num=soup.select (' span[class= "MR5"] ") [0].get_text ()

Rule=re.compile (R ' \d+ ')

Max_pn=int (Rule.findall (Pn_num) [1])

# assembly URL

Forpninrange (1, max_pn+1):

Print (' ************************ is crawling%s page ************************ '%pn)

Pn_rule=re.compile (' [|] ')

Fin_url=pn_rule.sub (R ' |e-%s| ') %PN, search_url,1)

Tmp_url_list=copy.deepcopy (Search_list[i][:3])

Tmp_url_list.append (Fin_url)

Fin_search_list.append (Tmp_url_list)

Returnfin_search_list

# Get tag Information

Defget_info (Fin_search_list, process_i):

Print (' process%s start '%process_i)

Fin_info_list=[]

Foriinrange (Len (fin_search_list)):

URL=FIN_SEARCH_LIST[I][3]

Try

Page=get_page (URL)

Except

Print (' Get tag timeout ')

Continue

Soup=bs (page, ' lxml ')

Title_list=soup.select (' a[class= "H_name"])

Address_list=soup.select (' span[class= ' address])

Attr_list=soup.select (' span[class= ' attribute "])

Fornuminrange (20):

Tag_tmp_list=[]

Try

title=title_list[num].attrs["title"]

Print (R ' ************************ is getting%s************************ '%title)

Address=re.sub (' \ n ', ', Address_list[num].get_text ())

Area=re.search (' \d+[\u4e00-\u9fa5]{2} ', Attr_list[num].get_text ()). Group (0)

Layout=re.search (' \d[^0-9]\d. ', Attr_list[num].get_text ()). Group (0)

Floor=re.search (' \d/\d ', Attr_list[num].get_text ()). Group (0)

Price=re.search (' \d+[\u4e00-\u9fa5] ', Price_list[num].get_text ()). Group (0)

Unit_price=re.search (' \d+[\u4e00-\u9fa5]/. ', Price_list[num].get_text ()). Group (0)

Tag_tmp_list=copy.deepcopy (Fin_search_list[i][:3])

Fortagin[title, address, area, layout, floor, Price, Unit_price]:

Tag_tmp_list.append (TAG)

Fin_info_list.append (Tag_tmp_list)

Except

Print (' Crawl failed ')

Continue

Print (' Process%s end '%process_i)

Returnfin_info_list

# Assigning tasks

Defassignment_search_list (Fin_search_list, Project_num): # Project_num The number of tasks each process contains, the smaller the number, the more processes

Assignment_list=[]

Fin_search_list_len=len (Fin_search_list)

Foriinrange (0, Fin_search_list_len, project_num):

Start=i

End=i+project_num

Assignment_list.append (Fin_search_list[start:end]) # get list Fragment

Returnassignment_list

# Store Crawl Results

Defsave_excel (Fin_info_list, file_name):

tag_name=[' area ', ' plate ', ' Subway ', ' title ', ' position ', ' square meter ', ' huxing ', ' floor ', ' Total price ', ' Unit square meter '

Book=xlsxwriter. Workbook (R ' C:\Users\Administrator\Desktop\%s.xls '%file_name) # is stored on the desktop by default

Tmp=book.add_worksheet ()

Row_num=len (Fin_info_list)

Foriinrange (1, row_num):

Ifi==1:

tag_pos= ' a%s '%i

Tmp.write_row (Tag_pos, tag_name)

Else

con_pos= ' a%s '%i

content=fin_info_list[i-1]#-1 is because the table header is occupied by

Tmp.write_row (Con_pos, content)

Book.close ()

if__name__== ' __main__ ':

File_name=input (R ' crawl complete, enter filename Save: ')

fin_save_list=[]# Crawl Information Store list

# First-level filtering

Page=get_page (Base_url)

Search_dict=get_search (page, ' R ')

# Two-stage screening

Forkinsearch_dict:

Print (R ' ************************ crawl: Crawling "%s" ************************ '%k)

URL=SEARCH_DICT[K]

Second_page=get_page (URL)

Second_search_dict=get_search (second_page, ' B ')

Search_dict[k]=second_search_dict

# Three-level screening

Forkinsearch_dict:

SECOND_DICT=SEARCH_DICT[K]

Fors_kinsecond_dict:

Print (R ' ************************ level two crawl: Crawling "%s" ************************ '%s_k)

Url=second_dict[s_k]

Third_page=get_page (URL)

Third_search_dict=get_search (Third_page, ' W ')

Print ('%s>%s '% (k, s_k))

Second_dict[s_k]=third_search_dict

Fin_info_list=get_info_list (search_dict, layer, tmp_list, search_list)

Fin_info_pn_list=get_info_pn_list (Fin_info_list)

P=pool (4) # Set Process pool

Assignment_list=assignment_search_list (fin_info_pn_list,2) # Assigning tasks to multiple processes

result=[]# Multiple Process results list

Foriinrange (Len (assignment_list)):

Result.append (P.apply_async (Get_info, args= (assignment_list[i), i))

P.close ()

P.join ()

Forresult_iinrange (len (Result)):

Fin_info_result_list=result[result_i].get ()

Fin_save_list.extend (fin_info_result_list) # Merges the lists obtained by each process

Save_excel (Fin_save_list, file_name)

Endtime=datetime.datetime.now ()

Time= (endtime-starttime). seconds

Print (' In total:%s s '%time)

Summary :

The greater the size of the crawl data, the more rigorous the procedural logic required, and the more proficient the Python syntax required. How to write more Pythonic grammar, also need to continue to learn to master.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More