Next, I'll write an example that reinforces the understanding of regular expressions. Or back to the home page we downloaded, in practice, we do not need the entire content of the page, so we have to improve the program, the information on the page filter, and save the content we need. Open Chrome Browser and right-click to check.
Find what we need in the source code of the Web page. To debug the program, we can test the compiled regular expression on the http://tool.oschina.net/regex/.
For Houseinfo:pattern=r ' data-el= ' Region ' > (. +?) </div> '
For Price:pattern=r ' <div class= "Totalprice" ><span>\d+</span> million '
We use the regular extract content is redundant, we can think of slicing method to process the extracted content. On the Source:
1 fromUrllibImportRequest2 ImportRe3 4 defHtmlspider (url,startpage,endpage):5 6 #role: Responsible for processing URLs, assigning each URL to send requests7 8 forPageinchRange (startpage,endpage+1):9Filename="Section"+ str (page) +"page. html"Ten One #combine as full URL AFullurl=url +Str (page) - - #call LoadPage () to send a request to get an HTML page theHtml=loadPage (fullurl,filename) - - - + defloadPage (fullurl,filename): - #Get page +Response=Request.urlopen (FullUrl) AHtml=response.read (). Decode ('Utf-8') at #print (Html) - - #regular compiling to get property information -Info_pattern=r'data-el= "Region" > (. +?) </div>' -info_list=Re.findall (info_pattern,html) - #print (info_list) in #regular compilation to get property prices -Price_pattern=r'<div class= "Totalprice" ><span>\d+</span> million' toprice_list=Re.findall (price_pattern,html) + #print (price_list) - the writepage (price_list,info_list,filename) * $ Panax Notoginseng - the defwritepage (price_list,info_list,filename): + """ A Save the server's response file to a local disk the """ +list1=[] -List2=[] $ forIinchprice_list: $I='-------------->>>>>price:'+ I[30:-8] +'million' - list1.append (i) - #print (i[30:-8]) the forJinchinfo_list: -J=j.replace ('</a>',' '*10)WuyiJ=J[:10] +' '+'---------->>>>>deatil Information:'+ j[10:] +' 'The the List2.append (j) - #print (j) Wu - foreachinchZip (list2,list1): About Print(each) $ - - - Print("is storing"+filename) A #with open (filename, ' WB ') as F: + #f.write (HTML) the - $ Print("--"*30) the the the if __name__=="__main__": the #enter the starting and ending pages you want to download, and note the conversion to int type -Startpage=int (Input ("Please enter the start page:")) inEndpage=int (Input ("Please enter the termination page:")) the theUrl="https://sh.lianjia.com/ershoufang/" About the Htmlspider (url,startpage,endpage) the the Print("Download Complete! ")
This is the result of the program running. I just print it on the terminal, or you can use Json.dumps () to save the crawled content locally.
There are actually other ways to extract this data, which will be discussed later.
Python crawler (6)--Regular expression (iii)