Python crawler (6)--Regular expression (iii)

Source: Internet
Author: User

Next, I'll write an example that reinforces the understanding of regular expressions. Or back to the home page we downloaded, in practice, we do not need the entire content of the page, so we have to improve the program, the information on the page filter, and save the content we need. Open Chrome Browser and right-click to check.

    

Find what we need in the source code of the Web page. To debug the program, we can test the compiled regular expression on the http://tool.oschina.net/regex/.

For Houseinfo:pattern=r ' data-el= ' Region ' > (. +?) </div> '

For Price:pattern=r ' <div class= "Totalprice" ><span>\d+</span> million '

We use the regular extract content is redundant, we can think of slicing method to process the extracted content. On the Source:

1  fromUrllibImportRequest2 ImportRe3 4 defHtmlspider (url,startpage,endpage):5     6     #role: Responsible for processing URLs, assigning each URL to send requests7     8      forPageinchRange (startpage,endpage+1):9Filename="Section"+ str (page) +"page. html"Ten      One         #combine as full URL AFullurl=url +Str (page) -  -         #call LoadPage () to send a request to get an HTML page theHtml=loadPage (fullurl,filename) -  -  -  + defloadPage (fullurl,filename): -     #Get page +Response=Request.urlopen (FullUrl) AHtml=response.read (). Decode ('Utf-8') at     #print (Html) -     -     #regular compiling to get property information -Info_pattern=r'data-el= "Region" > (. +?) </div>' -info_list=Re.findall (info_pattern,html) -     #print (info_list) in     #regular compilation to get property prices -Price_pattern=r'<div class= "Totalprice" ><span>\d+</span> million' toprice_list=Re.findall (price_pattern,html) +     #print (price_list) -  the writepage (price_list,info_list,filename) *  $ Panax Notoginseng  -  the defwritepage (price_list,info_list,filename): +     """ A Save the server's response file to a local disk the     """ +list1=[] -List2=[] $      forIinchprice_list: $I='-------------->>>>>price:'+ I[30:-8] +'million' - list1.append (i) -         #print (i[30:-8]) the      forJinchinfo_list: -J=j.replace ('</a>',' '*10)WuyiJ=J[:10] +' '+'---------->>>>>deatil Information:'+ j[10:] +' 'The the List2.append (j) -     #print (j) Wu  -      foreachinchZip (list2,list1): About         Print(each) $  -  -  -     Print("is storing"+filename) A     #with open (filename, ' WB ') as F: +      #f.write (HTML) the  -  $     Print("--"*30) the  the  the if __name__=="__main__": the     #enter the starting and ending pages you want to download, and note the conversion to int type -Startpage=int (Input ("Please enter the start page:")) inEndpage=int (Input ("Please enter the termination page:")) the  theUrl="https://sh.lianjia.com/ershoufang/" About  the Htmlspider (url,startpage,endpage) the  the     Print("Download Complete! ")

This is the result of the program running. I just print it on the terminal, or you can use Json.dumps () to save the crawled content locally.

There are actually other ways to extract this data, which will be discussed later.

Python crawler (6)--Regular expression (iii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.